Introducing Symphony’s Media Bridge

April 25, 2023
Victorictor Klykov
Olof Kallander
Tech4Fin

Started in 2014 as a messaging platform to enable data security and compliance, today Symphony is a leading common connector for market workflows. Among other things, Symphony provides more than half a million users with secure tele-conferencing services. At the core of these services lies a piece of software tech known as Selective Forwarding Unit (SFU). We call it Symphony Media Bridge (SMB).

The purpose of SMB is to provide media “routing” between participants in video conferences. It saves bandwidth and CPU for the clients by handling the forwarding of media to many destinations. SMB orchestrates media quality by selecting the right media sources, at the right quality, at the right time to forward depending on the capabilities of the clients. So, SMB is actually a key component for the media experience and a critical component for performance.

What is SMB?

Skip this paragraph if you are familiar with MCU (Multipoint Control Unit) and SFU (Selective Forwarding Unit). If not, let’s recap the basics:

There are three competing video-conferencing architectures:

p2p Mesh (peer-to-peer)
MCU (Multipoint Control Unit)
SFU (Selective Forwarding Unit)

The first one, p2p Mesh, is serverless, so each participant sends to and receives from each other. The latter two cases (MCU and SFU) are server-based, so media is being sent from peers to the server. That allows centralized implementation of compliance policies, recording, and facilitates scalability.

The primary reason for using server-based architecture (SFU/MCU) in comparison with the serverless p2p mesh is that the latter scales poorly. In the p2p mesh network the number of connections rises in quadratic fashion thus drastically increasing the amount of traffic handled by the client and severely limiting the maximum capacity of the video conference.

MCU architecture assumes that the server will receive one video stream from each participant, decode these streams frame by frame, arrange resulting images into a new single image, e.g. into rectangular pattern, and finally produce a single output video stream (see Fig. 1). Thus each participant will send one and receive one video stream. In theory, that should mean better scalability, but in practice, assembling and re-encoding composite video, possibly different for each participant, puts a lot of computational stress on the MCU server and thus limits scalability.

Advantages of the MCU architecture include lower bandwidth and CPU stress on the clients.

Fig.1 MCU in a nutshell: receive video from all participants, decrypt, decode, compose, re-encode, encrypt and send back to the participants.

In the SFU case, each participant sends the same video in several different independent Real-time Transport Protocol (RTP) streams, each encoded in different quality/bitrate. The SFU then combines these streams into a multiplex selectively, providing each participant with its own set of received videos matching available decoding capacity and uplink (from SFU) bandwidth.

Fig.2 SFU in a nutshell: receive simulcast from each participant, selectively forward each endpoint only necessary streams.

This means that SFU does not need to compose and re-encode incoming videos. That has several crucial implications for secure video-conferencing:

CPU load is greatly reduced for two reasons:

Firstly, SFU bypasses computationally expensive video re-encoding of several composite video streams;
Secondly, SFU can afford to be smart and detect which limited subset of all received video streams are actually needed to be forwarded, focus all work on them and discards all other streams bypassing even decryption stage;

True end-to-end encryption can be implemented because forwarded media is undisturbed by SFU (see previous point: SFU does not need to decode video to recompose and re-encode it, thus it does not need to decrypt it, and thus it does not need to know the key).

Fig.3 Symphony end-to-end encryption (plus hop-by-hop WebRTC encryption)

SMB is Symphony’s in-house implementation of SFU server for secure video-conferencing.

Why SMB?

Now that we know the advantages that SFU architecture brings to secure video-conferencing, let’s look at another question: why implement an in-house Symphony Media Bridge?

There are many qualitative and quantitative aspects of SFU implementations to consider when selecting the best for your needs, but the main are:

Is it secure and robust?
Is it scalable?
Is it performant?
Is the selection of audio sources and video sources suitable to provide a pleasant user experience in a meeting setting?
What particular unique features make it stand out from the crowd?

After testing existing solutions we determined that we couldn’t tick all of the necessary boxes unless we built it ourselves, tailored for our needs, but with ambition that it would be useful for many others.

Security

Security in real-time communications (RTC) usually revolves around selection of encryption algorithms and key handling, but there are more aspects added if you build the source on top of other open source libraries, which is often the case. If these open source libraries need to be modified, or are flawed, you must have the resources and ability to modify, fix and maintain those fixes. If any library contains security flaws, your product will have to be upgraded or contained as soon as possible and there is a risk that security issues are not fixed as quickly as needed for the security profile of your product.

This is one of the key reasons that we decided to create an SFU with very limited use of open source libraries. It reduces frequency of security issues and with a smaller code base, it is less likely to occur and easier to fix. There are still a few dependencies on openSSL, Secure Real-time Transport Protocol (SRTP), etc, that we monitor, but overall there are less frequent reports than observed in other solutions.

Another feature we wanted to allow in SMB is double encryption. This means the client encrypts the media payload with a different key before it is sent through SRTP and encrypted with a by-hop key. Double encryption of the media means it cannot be eavesdropped at any server and can only be finally decrypted at the receiving clients. While some of the VP8 video header is required to be single encrypted to allow forwarding, otherwise, the video content is not accessible to anyone along the media path.

One of our goals of publishing SMB as open source was to provide transparency into what the SMB does. It adds trust.

SMB uses RTP demultiplexing allowing a few ports on SMB to serve many clients and conferences. This in turn makes it easier to maintain firewall rules for allowed traffic.

Due to the independent nature of the SMB, it is also easy to run it in a protected VPC network for additional security.

Then finally, SMB uses strong encryption and authentication algorithms for the media. Minimum of AES-256 is used for payload encryption and RSA2048 is used for signing and authentication.

Scalability

Scaling up SFU is primarily about increasing the number of users that can connect in total and the number of users that can connect to each meeting. The main method of achieving this is to reduce the number of packets SFU actually needs to receive, decrypt and forward. Another aspect is to allow a client to participate in a meeting with 1000 participants without breaking down the call setup, or cause millions of call setup updates as a thousand people join the meeting.

The SFU also has to be smart with regards to bandwidth. If there is insufficient bandwidth, forwarding more streams to a client will not be successful. The SFU may adjust and pick lower bit rate streams to forward if the client cannot display the streams in high resolution.

Once the capacity of a single SFU has been exhausted, the only way to continue scaling out is to involve multiple SFUs in the same meeting, by connecting them in a mesh, aka barbelling. While a single SFU forwards a set of streams to each participant, it only has to forward that set once to another SFU to allow that second SFU to forward it again to all its participants.

There are three particular problems that one faces trying to increase the number of participants of the video conference:

SSRC mapping;
Bandwidth constraints;
Multi-location (multi-SFU) conferencing.

Let’s describe them first one by one before showing how we managed to solve them altogether.

To provide a context, we’ll start with an overview of what each endpoint receives. SFU sends each endpoint a collection of audio and video RTP streams. Each media stream is labeled with an identifier which is known as synchronization source (SSRC). SSRC in a nutshell is a simple digital label, present in various RTP headers, that allows the receiver to understand “who’s this data is from” from the multiplex of RTP streams it receives. Clients use SSRC whenever it needs to identify the media source, e.g. when requesting SFU to resend some packets if a packet loss is detected, or for UI purposes.

Thus, adding a new participant to an existing conference requires extensive interchange of signaling information in the form of SDP (Session Description Protocol) data between the server and each endpoint. SDP, among other things, bears mapping of media channels to the SSRCs. As the number of participants grows, SDP lineary grows as well, and needs to be sent to each endpoint, which makes the SDP traffic grow quadratically. SMB solves this problem by re-using limited numbers of SSRC and supplementing SDP with SSRC mapping messages in a custom protocol. We call this technique SSRC-rewriting.

The second problem arises from the fact that the endpoint’s resources like bandwidth and CPU are limited. It can’t receive and decode video from all other participants. Moreover, it makes little sense to do so, since “real estate” in UI is limited as well: it has no practical use to display thousands of thumbnail-sized videos when in fact there are a few (usually one) people talking and presenting (e.g. slides). SMB’s take on that is that at most we forward some limited ‘N’ (e.g., 9) number of videos to each participant, depending on who’s presenting, the client’s bandwidth, who’s active speaker and other factors.

The last problem we face happens because the server’s resources are limited as well: with all the optimization implemented, there is still a clear limit on how many participants an SFU server with a limited CPU and Memory can serve. SMB provides an API to build a mesh of SMB servers to solve this. In the simplest configuration, it is two SMB servers working together, with the topology that looks like a “barbell” – where the “bar” is the connection between two SMBs. Thus we call this API barbelling.

Fig.4 SMB Barbelling.

In a typical video conference of one hundred (or thousands) of users, there is a practical need to forward only a few quality media streams of those who are presenting: e.g., participants showing slides or talking, or several speakers in a discussion. Let’s call these presenters Active Talkers, and the currently speaking one, or presenting slides the Dominant Speaker. SMB always sends video from the Dominant Speaker and several Active Talkers which together form a group of automatically forwarded videos. The size of this group is relatively small to fit the client UI and can be configured per conference. When a user stops speaking, in due time it will be dropped from the Active Talkers group and replaced by a new automatically detected active speaker. In addition to this group of automatically forwarded videos that SFU maintains by itself, each user might want to see video from a particular single endpoint in a high quality. We call this video pinned. It means that each endpoint will always receive at most known ‘N’ videos: some automatically forwarded, plus, possibly a pinned one.

The Symphony Media Bridge selectively forwards the most up-to-known limited ‘N’ video streams which allows us to tackle the problems together:

All endpoints receive the same static SDP from the SFU with pre-allocated SSRCs, thus eliminating the problem of signaling traffic growth. Mapping between RTP streams’ SSRCs and endpoints is provided separately via custom SSRC mapping messages that the SFU sends to the endpoints whenever such need arises: e.g., when the active speaker changes, and the speaker’s video no longer needs to be forwarded, or when the user drops out of the conference. It is achieved by SFU rewriting SSRC of the received video streams before forwarding them to the endpoints.
Each endpoint receives at most up-to predefined ‘N’ video streams, and SFU selects forwarded streams based on endpoint uplink capacity which is carefully estimated by a custom high-efficiency bandwidth estimation algorithm.
SMB implements an efficient stream director algorithm, which allows separate SMB servers to quickly converge on the decision about from which subset of endpoints to forward a video. SMB that are connected into a barbelling mesh send each other SSRC mapping messages when needed, and forward only those video streams that are needed on a remote SMB server.
All other video streams that are not part of that ‘N’ to be forwarded, are ignored even before decrypting thus saving a lot of resources.

Summarizing: thanks to the barbelling API and SSRC-rewriting, Symphony Media Bridge servers can be connected into a SFU mesh, greatly increasing scalability potential.

Performance

SFU selects and forwards data from one endpoint to another. SFU server performs multiple repetitive operations related to encryption and network IO. Most of them are independent and operate on separate contexts, and thus suitable for parallel processing. So, to maximize performance, it’s imperative to:

Avoid memory operations (real-time memory allocations, copying etc.)
Avoid unnecessary CPU-heavy operations (encryption/decryption);
Use wait-free/lock-free concurrent data structures;
Use efficient multitasking architecture.

In addition to all algorithmic improvements, extra performance can be squeezed by writing codebase using “proper” language. SMB codebase is written on C++ 17 which provides a significant advantage in performance in comparison to something like Java when it comes to CPU heavy tasks. The disadvantage of the C++ is its somewhat limited standard library, which is why we had to implement wait-free concurrent data structures ourselves.

Let’s shed some light on aforementioned aspects of optimizations.

SMB has real-time critical (running on worker threads) and non-real-time critical (manager thread) components which “talk” to each. All memory structures are preallocated on a manager thread, thus reducing the number of potential waits that the worker threads might expect due to system calls.

SMB implementations can be considered memory-heavy, due to pre-allocations and heavy use of wait-free/lock-free concurrent data structures used by components running on real-time critical worker threads. Among such structure we use:

MPMC (Multiple Producer – Multiple Consumer) HashMap;
MPMC Publisher-Subscriber;
MPMC Queue;
MPSC Queue (Multiple Producer – Single Consumer), e.g. for logging (consumer) from multiple threads;

All these structures allow to “trade” memory for CPU cycles, by guaranteeing concurrent threads access to preallocated in advance resources instead of waiting for one to be freed.

Memory copying operations are first reduced to socket syscalls for reading and writing, by passing reference to the data packet instead of copying its content. Furthermore, as soon as SSRC is obtained, SFU checks whether this packet is used or not. If the packet belongs to the video stream that is not going to be forwarded to any endpoint, it’s being dropped immediately saving CPU cycles on unnecessary decryption.

Our latest load tests indicated that the best configuration with an even load on CPU is 1 manager thread and 7 workers on 8-Core, 8 GB Intel-based server. Most of the delays happen in syscalls to sockets, so we expect that the further improvement can be achieved by switching to io_uring API instead of epoll and socket API, though expected improvements are marginal at best.

Features

This introduction to the Symphony Media Bridge would not be complete without listing some pivotal features that helps it stand out from the competition:

Hybrid meetings with both SFU and MCU participants;
Very large meeting support without reconfiguring clients repeatedly;
Accurate, efficient and reliable bandwidth estimation for client uplinks to serve people working on limited networks;
Fairly clever client downlink bandwidth estimation that is used to control meeting video and screen share quality enabling people at lower bandwidth to participate in the meeting;
Highly efficient internal load balancing using a few wait-free concurrent data structures;
Neighbor support where participants in the same acoustic group will not hear each other which avoids acoustic double talk;
SFU clustering with inter-meeting connections to create larger meetings, region based meetings, lower latency on average.

Where to find SMB?

We implemented SMB as an open source project on GitHub to ensure transparency and leverage community support. Currently SMB is being used by Symphony, Cloud9 and Eyevinn Technology for various purposes. Symphony Meetings has a specific need for audio and video calls in addition to screen sharing in medium sized meetings. Cloud9 uses it in trader voice scenarios with thousands of participants in the same communication channel. Eyevinn uses it to test live video feeds to a huge number of destinations.

Please find the code here: https://github.com/finos/SymphonyMediaBridge