This article describes the end-to-end encryption used for Telegram voice calls.
Before a voice call is ready, some preliminary actions have to be performed. The calling party needs to contact the party to be called and check whether it is ready to accept the call. Besides that, the parties have to negotiate the protocols to be used, learn the IP addresses of each other or of the Telegram relay servers to be used (so-called reflectors), and generate a one-time encryption key for this voice call with the aid of Diffie--Hellman key exchange. All of this is accomplished in parallel with the aid of several Telegram API methods and related notifications. This document details the generation of the encryption key. Other negotiations will be eventually documented elsewhere.
The Diffie-Hellman key exchange, as well as the whole protocol used to create a new voice call, is quite similar to the one used for Secret Chats. We recommend studying the linked article before proceeding.
However, we have introduced some important changes to facilitate the key verification process. Below is the entire exchange between the two communicating parties, the Caller (A) and the Callee (B), through the Telegram servers (S).
g_a_hash:bytes, among others. For this call, this field is to be filled with g_a_hash, not g_a itself.
key_fingerprint:longreceived from the other side, as an implementation sanity check.
At this point, the Diffie--Hellman key exchange is complete, and both parties have a 256-byte shared secret key key which is used to encrypt all further exchanges between A and B.
It is of paramount importance to accept each update only once for each instance of the key generation protocol, discarding any duplicates or alternative versions of already received and processed messages (updates).
Both parties A (the Caller) and B (the Callee) transform the voice information into a sequence of small chunks or packets, not more than 1 kilobyte each. This information is to be encrypted using the shared key key generated during the initial exchange, and sent to the other party, either directly (P2P) or through Telegram's relay servers (so-called reflectors). This document describes only the encryption process for each chunk, leaving out voice encoding and the network-dependent parts.
The low-level data chunk
raw_data:string, obtained from voice encoder, is first encapsulated into one of the two constructors for the DecryptedDataBlock type, similar to DecryptedMessage used in secret chats:
decryptedDataBlock#dbf948c1 random_id:long random_bytes:string flags:# voice_call_id:flags.2?int128 in_seq_no:flags.4?int out_seq_no:flags.4?int recent_received_mask:flags.5?int proto:flags.3?int extra:flags.1?string raw_data:flags.0?string = DecryptedDataBlock; simpleDataBlock#cc0d0e76 random_id:long random_bytes:string raw_data:string = DecryptedDataBlock;
out_seq_no is the chunk's sequence number among all sent by this party (starting from one),
in_seq_no -- the highest known out_seq_no from the received packets. The parameter
recent_received_mask is a 32-bit mask, used to track delivery of the last 32 packets sent by the other party. The bit i is set if a packet with
out_seq_no equal to
in_seq_no-i has been received.
The higher 8 bits in
flags are reserved for use by the lower-level protocol (the one which generates and interprets
raw_data), and will never be used for future extensions of
proto are mandatory until the other side confirms reception of at least one packet by sending a packet with a non-zero
in_seq_no. After that, they become optional, and the
simpleDataBlock constructor can be used if the lower level protocol wants to.
voice_call_id is computed from the key
key and equals the lower 128 bits of its SHA-256.
random_bytes string should contain at least 7 bytes of random data. The field
random_id also contains 8 random bytes, which can be used as a unique packet identifier if necessary.
Once the data is encapsulated in
DecryptedDataBlock, it is TL-serialized and encrypted with MTProto, using
key instead of
auth_key; the parameter x is to be set to 0 for messages from A to B, and to 8 for messages in the opposite direction. Encrypted data are prepended by the 128-bit
msg_key (usual for MTProto); before that, either the 128-bit
voice_call_id (if P2P is used) or the
peer_tag (if reflectors are used) is prepended. The resulting data packet is sent by UDP either directly to the other party (if P2P is possible) or to the Telegram relay servers (reflectors).
To verify the key, both parties concatenate the secret key key with the value g_a of the Caller ( A ), compute SHA256 and use it to generate a sequence of emoticons. More precisely, the SHA256 hash is split into four 64-bit integers; each of them is divided by the total number of emoticons used (currently 333), and the remainder is used to select specific emoticons. The specifics of the protocol guarantee that comparing four emoticons out of a set of 333 is sufficient to prevent eavesdropping (MiTM attack on DH) with a probability of 0.9999999999.
This is because instead of the standard Diffie-Hellman key exchange which requires only two messages between the parties:
we use a three-message modification thereof that works well when both parties are online (which also happens to be a requirement for voice calls):
The idea here is that A commits to a specific value of a (and of g_a) without disclosing it to B. B has to choose its value of b and g_b without knowing the true value of g_a, so that it cannot try different values of b to force the final key (g_a)^b to have any specific properties (such as fixed lower 32 bits of SHA256(key)). At this point, B commits to a specific value of g_b without knowing g_a. Then A has to send its value g_a; it cannot change it even though it knows g_b now, because the other party B would accept only a value of g_a that has a hash specified in the very first message of the exchange.
If some impostor is pretending to be either A or B and tries to perform a Man-in-the-Middle Attack on this Diffie--Hellman key exchange, the above still holds. Party A will generate a shared key with B -- or whoever pretends to be B -- without having a second chance to change its exponent a depending on the value g_b received from the other side; and the impostor will not have a chance to adapt his value of b depending on g_a, because it has to commit to a value of g_b before learning g_a. The same is valid for the key generation between the impostor and the party B.
The use of hash commitment in the DH exchange constrains the attacker to only one guess to generate the correct visualization in their attack, which means that using just over 33 bits of entropy represented by four emoji in the visualization is enough to make a successful attack highly improbable.
For a slightly more user-friendly explanation of the above see: How are calls authenticated?