Introducing: Symphony’s Engineering Blog

Do mesh with me!

At Symphony, we provide secure and compliant collaboration as a service to financial services organizations. Founded in 2014 by a consortium of financial institutions, we inherited a monolithic single tenant architecture that has served us well so far. As our open platform solidifies its place in the fintech ecosystem, we continue to further push the boundaries of interoperability, connect with multiple partners, and expand our portfolio with acquisitions. In executing this mission, our globally distributed engineering team faces the challenges of modernizing our core architecture while maintaining best-in-class services to power mission critical applications and workflows.

In our first blog of this series, we will focus more precisely on how a microservices architecture can be choreographed to deliver overall product functions, allow data sharing, and power data analytics in an enterprise architecture. As software architects, we need to answer these questions:

  • How can we scope and organize services in a microservice architecture to scale to a high number of services and follow organizational evolutions?
  • How can we ease interconnections across the services?
  • Which patterns will help provide operational excellence?

Recognizing synchronous request/response patterns for communication between microservices can hamper scalability of a service-oriented architecture, we sought to find other patterns for scaling our platform.

Enter… data mesh

To begin our architecting exercise, we needed to understand how to group data and functions into cohesive services to support our various product lines, anticipating the need for data sharing. We decided to follow the data mesh approach, a concept developed by ThoughtWorks’ Zhamak Dehghani, starting with Domain Driven Design (DDD).

With our first architecture designs, we quickly realized that we would likely have to deal with dozens of services to provide all the functionality of our product portfolio. As a result, the concept of bounded context proved to be critical. Bounded context allows for identification of models interrelated due to their common presence in related business needs. We can then identify touchpoints in between contexts, possibly surfacing through concepts with similar semantics in adjacent capabilities. By extrapolating data models to functions and services, bounded context helps group functions based on context and utility: those that manipulate the same data and have frequent interactions, functions that are bridging data flows in between contexts, and unrelated functions.

From this view (fully acknowledging Conway’s law), it becomes easier to attain organizational alignment even as our organization is constantly evolving. It becomes easier to identify team ownership on a set of services and make organizational decisions without performing major rewrites of our services. We also gain a better understanding of runtime coupling in between services, and can start applying governance on the interconnection points.

A simplified, partial decomposition of our DDD model with three identified bounded contexts could be defined as follows for our secure and compliant collaboration services:

Three main bounded contexts stand out: collaboration, identity, and compliance. Concepts like “participant” and “public profile” are linked across the contexts. This model also allows us to highlight data sets from other contexts that are required for services. For instance, profile discovery and conversations services depend on compliance control rules to authorize access to public profiles or addition of participants to conversations.

The more entities, services, and contexts are set up and running, the harder it becomes for development teams to identify and rely on existing assets.

Discoverability

In our architecture group, we initiated registries for all data entities, APIs, and service documentations to provide visibility and standardization throughout the organization. Here is some information we collect at Symphony from our engineering teams with example standards.

Registry type
Contracts
Addressing
Schema
Entity schema – JSON Schema, Protobuf, Avro
Link to APIs or topics
Duplication, latency to data update, cardinality, changes throughput
API
API specifications – OpenAPI, gRPC
Paths Link to service
Response time, throughput, uptime
Service
Not standardized
Domains Routes Clusters
Response time, throughput, uptime

We enrich registries with automated or gated checks and documentation to apply governance across the deliverables from the different teams, provide best practices, ensure consistency, and enforce API or service scopes to maintain separation of concerns.

Service mesh

A data mesh provides an excellent framework for asynchronous communication, but there will still be a need for some synchronous interconnection between services. A service mesh pattern can address this need.

Event mesh

A particular slice of data mesh is about real-time data sharing, providing an excellent framework for an event-driven architecture. We refer to this subdivision as the event mesh.

A key aspect of implementing this pattern is the ability to incorporate data from external sources by building data caches, replicas, or materialized views within bounded contexts. We can achieve this without tight coupling, details of which will be presented in a future article in this series.

Tradeoffs in mesh approaches

At Symphony, as in most IT or SaaS companies, one of our primary goals is to maintain the highest service levels possible for mission-critical applications. We use the following metrics to measure progress:

  • Uptime SLA / SLO: percentage of time the service is able to process requests successfully over a one-month or one-year period
  • Mean Time to Recovery (MTTR): the mean time required to bring service back into a stable state after an incident
  • Incident severity distribution: a score to measure the severity of an incident’s impact on organizations or individuals
  • Latency: average, p90, p99 time required to process a service request
  • Throughput: amount of requests a functioning service is capable of processing per unit of time

Here is a comparison between service mesh and event mesh, with their impacts on these metrics. Value for assessment is done in comparison with a do-nothing approach.

Metric
Service Mesh
Event Mesh
Uptime SLA
Increased through standardizing exception-handling flows
Greatly increased through contained blast radius
MTTR
Reduced through standardizing exception-handling flows
Greatly reduced through contained blast radius
Incident severity
Reduced through standardizing exception-handling flows
Greatly reduced through contained blast radius
Latency
Significant variance during incidents, low end-to-end latency
Lesser variance, longer end-to-end latency
Throughput
Capped by throughput of slowest service in end-to-end flow
Driven by front APIs throughput only
Key benefit
Minimize end-to-end latency
Full isolation

Conclusion

Through an example and simplified model, we have introduced the usage of DDD and its extrapolation to structuring microservices and governing their interactions. We have highlighted the impacts of two service communication patterns on key operational metrics: service mesh and event mesh. The pattern with the most significant positive impact is event mesh. However, not all communications can be asynchronous. It makes sense to favor event mesh whenever possible, and otherwise rely on service mesh.

In our next blog, we will explore the positioning of event modeling in software development lifecycle, as well as detail our practical approach to build a collaborative and governed event catalog.

You may also like