SIP Dialog

VoIP security architecture in brief

Voice over IP (VoIP) has been around for a long time. It’s ubiquitous in homes, data centers and carrier networks. Despite this ubiquity, security is rarely a priority. With the combination of a handful of important standard protocols, it is possible to make untappable end to end encryption for an established VoIP call.

TLS is the security protocol between the signaling endpoints of the session. It’s the same technology that exists for SSL web sites; ecommerce, secure webmail, Tor and many others use TLS for security. Unlike web sites, VoIP uses a different protocol called the Session Initiation Protocol (SIP) for signaling: actions like ringing an endpoint, answering a call and hanging up. This is the metadata of calls. SIP-TLS uses the standard Certificate Authorities for key agreement. This implies trust between the certificate issuer and the calling endpoints.

SIP Dialog

An example of a SIP dialog

To add a little complexity, the content of calls has only a small relationship to SIP. The key agreement protocol for P2P VoIP content is called ZRTP. In a true P2P system, all the key agreement and encryption of a call’s content happens in the endpoint applications. An important distinction between VoIP and other networked communications is that all devices are both client and server at once, so we have only “endpoints” rather than “clients” or “servers”. Once the endpoints agree on a shared secret, the ZRTP session ends and the SRTP session begins. When established, all audio and video content going over the network is encrypted. Only the two peer endpoints who established a session with ZRTP can decrypt the media stream. This is the part of the conversation that cannot be wiretapped nor can metadata of sessions in progress be spied on.

ZRTP Overview

An example ZRTP key exchange

To step back a little, let’s review some acronyms. First there is SIP (Session Initialization Protocol). This protocol is encrypted with TLS. It contains the IP addresses of the endpoints who wish to communicate but it does not interact with the audio or video stream.

Second, there is ZRTP. This protocol enters into the mix after a successful SIP dialog establishes a call session by locating the two endpoints. It transmits key agreement information over a unverified SRTP channel between the peers. The peers use their voices to speak a secret that verifies that the channel is secure between only the two peers.

Third, enter SRTP. Only after the ZRTP key exchange succeeds is the call content encrypted with the Secure Real Time Protocol. From this point forward, all audio and video is secure and uniquely keyed to each individual session.

This brief was inspired by the numerous discussions I’ve participated in online and offline during my ongoing operation of, a secure VoIP service sponsored by The Guardian Project. I understand that VoIP is complex when compared to HTTP and the mainstream understanding of the securirty elements often omits the ZRTP/SRTP content, rather focusing on only the SIP-TLS signaling. While signaling is important, few calls would be useful without content.

3 comments for “VoIP security architecture in brief

  1. paul
    2013/11/30 at 6:46 pm

    A couple questions:

    1) Why do you need to introduce ZRTP into the mix? In the carrier VOIP space, the standard for encryption is SIP over TLS, with SRTP keys negotiated in the INVITE transaction. While I’ve heard of ZRTP before, I’ve never heard of any network operator using it. What does it add over the simpler SIPS/SRTP combination?

    2) I noticed that you guys have created some Docker images for secure VOIP servers. This is something that I’ve been idly sketching out, but I was going to use Asterisk a la What are the advantages of Kamailio over Asterisk?

  2. lee
    2013/12/03 at 5:36 pm

    Paul, ZRTP ensures the content of calls is secure between both endpoints without the dependence on an SSL Certificate Authority to validate the connection. Because ZRTP exists only on the call’s endpoints, there is no need for any network operators to support it. The only requirement is that the RTP media stream pass through any proxies without modification. Two services which follow this practice are and

    Regarding your second question, Kamailio is a modular SIP router. Asterisk is a PBX. Kamailio and Asterisk can work together since Kamailio is not able to answer calls. For example, the echo test at is provided by a PBX behind Kamailio.

Leave a Reply

Your email address will not be published. Required fields are marked *