VoIP security architecture in brief

By lee · November 21, 2013

Voice over IP (VoIP) has been around for a long time. It’s ubiquitous in homes, data centers and carrier networks. Despite this ubiquity, security is rarely a priority. With the combination of a handful of important standard protocols, it is possible to make untappable end to end encryption for an established VoIP call.

TLS is the security protocol between the signaling endpoints of the session. It’s the same technology that exists for SSL web sites; ecommerce, secure webmail, Tor and many others use TLS for security. Unlike web sites, VoIP uses a different protocol called the Session Initiation Protocol (SIP) for signaling: actions like ringing an endpoint, answering a call and hanging up. This is the metadata of calls. SIP-TLS uses the standard Certificate Authorities for key agreement. This implies trust between the certificate issuer and the calling endpoints.

An example of a SIP dialog

To add a little complexity, the content of calls has only a small relationship to SIP. The key agreement protocol for P2P VoIP content is called ZRTP. In a true P2P system, all the key agreement and encryption of a call’s content happens in the endpoint applications. An important distinction between VoIP and other networked communications is that all devices are both client and server at once, so we have only “endpoints” rather than “clients” or “servers”. Once the endpoints agree on a shared secret, the ZRTP session ends and the SRTP session begins. When established, all audio and video content going over the network is encrypted. Only the two peer endpoints who established a session with ZRTP can decrypt the media stream. This is the part of the conversation that cannot be wiretapped nor can metadata of sessions in progress be spied on.

An example ZRTP key exchange

To step back a little, let’s review some acronyms. First there is SIP (Session Initialization Protocol). This protocol is encrypted with TLS. It contains the IP addresses of the endpoints who wish to communicate but it does not interact with the audio or video stream.

Second, there is ZRTP. This protocol enters into the mix after a successful SIP dialog establishes a call session by locating the two endpoints. It transmits key agreement information over a unverified SRTP channel between the peers. The peers use their voices to speak a secret that verifies that the channel is secure between only the two peers.

Third, enter SRTP. Only after the ZRTP key exchange succeeds is the call content encrypted with the Secure Real Time Protocol. From this point forward, all audio and video is secure and uniquely keyed to each individual session.

This brief was inspired by the numerous discussions I’ve participated in online and offline during my ongoing operation of ostel.co, a secure VoIP service sponsored by The Guardian Project. I understand that VoIP is complex when compared to HTTP and the mainstream understanding of the securirty elements often omits the ZRTP/SRTP content, rather focusing on only the SIP-TLS signaling. While signaling is important, few calls would be useful without content.