Carrying over history to the mesh

In our journey building an event mesh, we face the challenge of reconciling with Symphony’s engineering history. Specifically, we are transitioning from a single-tenant, monolithic architecture to a multi-tenant, microservice architecture. And critically, due to the nature of our business, this shift must be both progressive and smooth.

Ultimately, we want to enable our microservices to access the data events originated by our existing primary backend service, without being dependent on the evolution of the backend service. Additionally, the pre-existing events have evolved in isolation over the years, encompassing content for fairly orthogonal concerns, with greater expectations on producers to provide context that is not strictly required. Our objective was to rework the schemes for our existing events within the framework established in our newly formed events catalog. So we evaluated the different patterns we could adopt to carry that baggage over and reduce its weight.

Architectural context

The below diagram represents the basics of Symphony’s asynchronous communications architecture. Our backend service is deployed over AWS and GCP public clouds. Depending on the cloud provider, it relies on either SNS/SQS or Pub/Sub for asynchronous communications through a custom-built abstraction.

We introduce our event mesh on top of Kafka, running in Confluent Cloud managed clusters, and want to rely on schema registry to enforce adherence to mutually agreeable event structures.

Option #1: Using Confluent-supported connectors

One of the reasons we chose to build our event mesh on top of the Confluent Kafka platform is that it provides support for interconnection with a variety of services through supported Kafka connectors, available through their hub.

There are two source connectors of interest to consider in our scenario: Amazon SQS Source Connector and Google Cloud Pub/Sub Source Connector.

This approach provides an easy-to-follow pattern when there is a limited number of SQS queues or PubSub subscriptions from which to ingest events. Otherwise, the operations of hundreds of connector instances on top of Kafka Connect would come with its own set of challenges relating to observability and connector error resolution. It also fits well if there are small changes to apply on the events structure, which can be handled by relying on Simple Message Transforms.

Our objective is to reshape the events wholly, and with this method the transformations are too complex to fit this model. For instance, existing event streams require decrypting events with per-tenant encryption keys before applying an advanced mapping logic. Additionally, while we could set up a single SQS queue to subscribe to and aggregate many SNS topics, Pub/Sub subscriptions are per topic, leading to a potentially high cardinality of Pub/Sub connector instances. Therefore, this pattern did not work for us.

Option #2: Updating producers to maintain dual event streams

An alternative we considered was to update producers to maintain dual streams: the existing one, and a new one that would feed the event mesh. This would have the benefit of preparing the evolution of the producing service, particularly considering that the legacy event stream will be dropped in the not-so-distant future.

However, this approach had two major drawbacks. First, it falls into the dual-write problem: since there are two independent actions to be triggered when an event occurs, it is possible that the second one will fail while the first one succeeds without a way to roll back that first action. This would introduce inconsistencies and complexity in order to compensate for events that failed. Moving entirely to a transactional pattern would also be costly in development efforts and diminish the system’s stability and scalability. Second, as Symphony has hundreds of servers running single-tenant publishers, this method would exceed limitations in terms of parallel connections to our Kafka clusters.

Option #3: Building our own connector service

The final approach we considered, and ultimately decided upon, was to build our own connector service. We traditionally use Java for backend development, so could therefore benefit from all the libraries and tools that come with Spring, including Spring Cloud, and in particular, SQS support in Spring Cloud for AWS, and Pub/Sub support in Spring Cloud for GCP. These libraries make it easy to spin up a microservice that consumes from both SQS and Pub/Sub. Such an independent microservice would be stateless, and could scale based on the need as a standard kubernetes pod.

For SQS, AWS already supports subscriptions to multiple SNS topics without any documented restriction, so we would be able to aggregate all the events from all the tenants into a single queue for mapping and forwarding to the event mesh. For Pub/Sub, the support of a reactive stream subscriber makes it possible to pipe the consumption from multiple subscriptions on top of a single bounded thread pool, minimizing the compute resources required.

For the mapping logic, we looked into MapStruct as an existing mapping library. This mapping logic was too complex to fit nicely in our planned formalism, and likely too complex for other similar libraries. Therefore, we proceeded with building dedicated conversion logic. We ultimately relied on classes generated from schemas agreed upon to represent the events, and serialize them with the KafkaProtobufSerializer to ensure there is no deviation from the currently registered schema. The following diagram recaps the intended architecture.

In implementing this pattern, we solve for our original target to bring events originating from our existing deployments into our event mesh without compromising on our goal to improve on our data stream. A direct benefit of this pattern is that it allows us to build and evolve microservices without being dependent on the evolution of a legacy backend undergoing significant transformation.

You may also like