Systemicity

Using BAPO and DDD to design scalable product orgs

Andrew Nicholson — Wed, 29 Jan 2025 19:30:33 GMT

Product companies that want to scale quickly often adopt a distributed organizational model - small, mostly autonomous, vertically enabled, product oriented teams. These teams can build features and entire products mostly independently, which supports organizational growth through horizontal scaling. Ideally, these teams are aligned with business goals, and have some level of technical alignment.

This organizational model does indeed allow companies to grow quickly. However, this growth comes at a cost. Paying Conway’s Law the respect it deserves, distributed organizations ship distributed systems. Distributed systems are complex. Understanding the sources of complexity in distributed systems and the strategies for managing that complexity will be a differentiator for tech companies that want to grow fast while maintaining long-term sustainability. Complexity kills, simplicity survives.

The primary source of complexity in distributed systems is managing distributed data. The CAP Theorem sums up the core technical tradeoff - in systems where data is located and mutated in multiple locations, consistency comes at a cost to availability. That’s a tough pill for stakeholders to swallow - both consistency and availability (or more generally, performance) are desirable. Complexity emerges in the tension between those desires, along with the competing desire for teams to operate independently and fast, within the technical constraints inherent to distributed systems and the organizational constraints implied by Conway’s Law.

Software businesses all compete in this space and each apply a combination of frameworks, philosophies, and intuitions. Agile, Scrum, and Lean seem to dominate the current thinking around process management. Microservices architectures seem to be winning tech leaders over with their promise of unlocking organizational growth through horizontal scaling. DevSecOps, product triads, and autonomous teams are other angles on the problem of how to deliver business value through software. All of these frameworks have a common focus - how to enable independent, fast value delivery streams so that you can grow your organization to meet your business needs without incurring massive overhead and complexity from inter-team dependencies.

But none of these frameworks tells us what to do about Conway’s Law, which in my opinion is a massive shortcoming. I contend that many of the core problems facing software businesses today, whether they can name them or not, are related to a mismatch between system architecture and business needs. Conway’s Law informs us of the link between the organization (specifically its communication structure) and the systems it builds. If companies only look at the organizational structure and processes and try to optimize those for delivering value, they aren’t paying attention to the assets that are the value of the business - the software systems that are the product these software businesses build.

Software as a product isn’t like shampoo. If your shampoo business’s leadership team decides they need to be certified vegan to align with their business strategy, they can change their formula pretty much overnight. When you recognize that complexity in your software architecture has metastasized to the point where it’s consuming more energy from the business than your actual business goals, you can’t just change the formula overnight - you have a very long and painful project ahead.

Software leaders should apply a framework that holistically co-designs organization with architecture. Jan Bosch’s Business-Architecture-Process-Organization (BAPO) model respects the interdependence of these features of software businesses and tells us which direction to approach from. Domain Driven Design (DDD), Event Driven Architecture (EDA), Command/Query Responsibility Segregation (CQRS), and Consistency Models offer technical solutions to the hard problems of distributed systems that should feed back into organizational design considerations. In this article, I will present these together in such a way that the technical complexity inherent to distributed systems can inform the sociotechnical tradeoffs technology leaders must make when designing distributed organizations.

Where to begin

BAPO informs a strategy for designing an organization that ships software to solve business problems, and in particular what end of the design space to start from - with the business. The BAPO model states that the Business informs the Architecture which informs Processes which informs Organization (i.e. teams). Jan Bosch claims that most organizations get this exactly backwards - they start with Organization structure and work back, aligning Process with Organization, allowing Architecture to emerge according to Process and Organization (Conway’s Law), and then struggling to align all of that with Business value, and then wonder why the software teams can’t deliver what the business asks of them.

The interesting situation is that most companies are not BAPO but instead they are OPAB: the existing organization is used as a basis for the definition of convenience-driven processes, which in turn leads to an accidental architecture. This restrictive architecture, driven by the past of the company, rather than its future, then offers a highly limited set of business strategy options. - Jan Bosch

So, what can we use to translate the business needs to software architecture? Domain Driven Design (DDD) is exactly that - a set of practices for breaking down complex business domains to inform software design. I think it’s fair to say that DDD has survived the test of time and proven itself as a reliable toolset for modeling software businesses. One of its primary tools is Event Storming - a collaborative brainstorming process where as many stakeholders and builders as you can stomach get together to braindump what your system does to perform its business function. You start with events, add commands that cause those events, the actors who trigger the commands, and the groups (or aggregates) of models those events concern, to ultimately arrive at the transactions your business needs to support.

This list of transactions is extremely valuable because it maps business primitives to technical primitives. Transactions are both the unit of business operations and the unit of technical operations. This is alchemy - you start with a business need and you transform it to a technical primitive. By defining your essential business operations in terms of transactions, you also define the essential processes of your technical system. The goal of software engineering organizations is to translate business needs into technical systems, and this is where that transformation occurs. Whether or not a business practices Event Storming and DDD, they are performing that translation somewhere, and probably poorly if they aren’t explicit about it.

Transactions are a special type of technical primitive because they are the atomic unit of concurrency - you can’t break down a mutation beyond the level of transaction, because by definition changes in a transaction happen together or not at all. That’s the point, after all. Concurrency management may not be the hardest technical problem a particular engineering organization faces, but it is the hardest technical problem all engineering organizations face. Knowing what transactions you need to support tells you how to structure your architecture to match your business domain in a way that avoids unnecessary concurrency management and the costs that come along with it.

Architecture

Event Storming maps out all the transactions your system has to support, and the groups of models each of those transactions concern. This grouping informs which data should live together in a service topology. The fact that two models change together is a very good reason to group them together in a single database. This applies the principle of high cohesion, which states that things that change together should be grouped together. Grouping transactionally related objects in a single database lets you offload the highly complex problem of concurrency control to your database management system. Could most engineers reimplement for example a flavor of Postgres’s Multi-Version Concurrency Control, between multiple databases, over a network?

The shape of your service topology starts to emerge when you group together objects that are related and change together, and decouple objects that are not - this is the other side of the high cohesion coin, loose coupling. Loose coupling is what allows systems (and teams) to operate independently. In DDD language, the tool to apply here is Domain Modeling. Where Event Storming maps atomic business events and transactions in a sequence, Domain Modeling topologically maps data objects and operations into distinct Bounded Contexts. Bounded Contexts concern a specific, causally-closed, “bounded” problem space in the business.

The foundation for your architecture is your Bounded Contexts. These will inform your service boundaries. Communication structure is the other critical component of your architecture. There are two types of communication - synchronous and asynchronous. The choice of communication protocol is absolutely critical to the types of consistency you can support, the transactions you can perform, and the product experiences you can build. Communication is also the fundamental force behind Conway’s Law - the communication protocol two teams’ services use dictates the communication processes for those teams.

Because of this link between team communication and service communication, this is where I diverge slightly from the BAPO model. You can’t model your architectural communication and then model your organizational processes (which are communication processes). The Processes have to feed back into the Architecture. Instead of linearly going through B-A-P-O, I believe there is an iterative feedback loop between A and P.

Process

How do you choose how teams should communicate? This is where it is critical to identify Consistency Models. Consistency Models describe the ordering guarantees a concurrent system makes for the operations in the system, and the performance tradeoffs you must make as you provide stronger ordering guarantees. In the strongest model, the system behaves as though all operations occurred in the order they actually occur in the real world. This comes with a high performance tradeoff - it requires a high degree of synchronous coordination between separate processes (in the computational meaning of the term), so when communication fails, no process can proceed.

In the weakest model, the system provides no ordering guarantees, but values between processes eventually converge, and even when communication fails independent processes can proceed, because coordination is asynchronous. This Eventual Consistency model allows intermediate states that the system never would have been in if it respected the real-world order of operations, which may or may not be acceptable.

There’s a middle ground model called Causal Consistency which requires no synchronous coordination and thus performs as well as Eventual Consistency, but also provides some ordering guarantees based on “causality” - operations that depend on previous operations are invisible until their dependencies are visible. It’s also built on asynchronous communication, but requires additional metadata to track dependencies, and therefore requires a degree of coordination between teams implementing Causal Consistency.

This is an oversimplification of Consistency Models leaving the majority of them out of the discussion, but the point is there are nuanced tradeoffs to navigate here between system performance, consistency, complexity of implementation, and team coordination. It’s crucial to clearly identify the target level of consistency between any data that may need to be replicated between teams.

My hope is that explicitly addressing these tradeoffs will lead stakeholders to the realization that it would be great to avoid them altogether. There’s a model for that, too - CQRS with Single Write Stores. CQRS informs us to conceptually and architecturally separate read and write storage. Single Write Stores are a simplifying constraint on this architecture that allows you to avoid the hard problem of managing concurrency that is the source of data consistency problems. If you avoid writing data in multiple locations, you avoid concurrent updates, and thus the need to manage concurrency. You can still replicate data for reads where performance mandates, in the form of caching, and you can probably accept a weaker form of consistency like Eventual or Causal Consistency for those caches, built on asynchronous communication.

To get this right, it’s critical to group transactionally related models in the same database. Even if a single model only lives in one database, if it’s part of a transaction that involves models from other databases, then you have to manage concurrent transactions across multiple databases, and you’re in the business of distributed transactions. That’s probably not related to your actual business, and probably a lot more technically complex than whatever your business concerns (unless you work on a highly technical problem).

The type of communication between services dictates the communication processes teams must follow to collaborate. If a service communicates synchronously with another service, the team maintaining the dependency has to ship that dependency before the team depending on it can use it, and there must be a process for managing that dependency in project planning. It’s crucial to address this upfront, as it comes at a cost to project management complexity, and introduces dependencies between teams the org wants to operate independently.

However, the reasons for introducing this project management overhead may be very good if you consider the complexity of consistency models - using the Single Write Store model, if a team needs to mutate an object in another team’s service, they may need to do it synchronously, but they avoid much more complex forms of coordination required to provide consistency between two write stores. The good news is that the team maintaining the write store should already know what commands they need to support from the Event Storming exercise. They will have to add new commands as the business needs evolve, though, and there has to be a process for managing that change.

On the other hand, teams that don’t have synchronous dependencies can operate independently. A team can asynchronously publish mutations to their domain objects to a centralized event bus, and not care about who is consuming those events. This architectural decoupling also decouples the processes between teams, and enables organizational scalability. It’s a lot easier to add teams when you don’t get a combinatorial explosion in inter-team dependencies.

Organization

With your Architecture and Processes specified, including communication between services, it becomes pretty clear how to map Organization on top of that structure. Again, don’t fight Conway’s Law. The BAPO model lets you apply the Inverse Conway Maneuver - structure your organization to build the architecture you believe meets your business goals.

There will be product teams that own Bounded Contexts, and perhaps a Platform Team to maintain connective tissue like an Event Bus. Product teams are already aligned with business needs because they were designed to solve the use cases defined by the Event Storming exercise. The Platform Team can provide infrastructure and abstractions for encapsulating the complexity of managing data correctly. They can provide product teams with patterns for sharing their updates, like Transactional Outbox, and orchestration of multi-step “sagas”. Sagas should be non-transactional by design, informed by the grouping of transactionally related objects discovered in Event Storming. Durable Workflow Execution may also be a need, in order to perform multi-step in-order operations across multiple systems, including third parties. These systems can help manage causal dependencies between asynchronous operations without the need for the metadata to implement Causal Consistency.

Designing for growth

Architecture should be designed to support some well known maturation and growth strategies. Applying BAPO early on will likely highlight a core domain that performs much of the function of the business, with most domain models transactionally related in this core domain. You shouldn’t expect to be able to nicely divide your business by a single denominator (which is what setting team size as a constraint and then trying to slice up your architecture by teams attempts to do). One domain will probably outsize all the others combined, if there are others.

As the business grows, it will need to support new transactions. At some point your core domain may need to be broken into subdomains. Modularity should be baked into the system design from the bottom up. Again, high cohesion, loose coupling is the guiding principle for building systems that are able to change. Modules encapsulate areas of high cohesion, and are loosely coupled. A well-modularized system is more easy to break apart into a separate domain service than a tightly coupled one.

As you decompose the core domain, it’s important to respect transactional relations. If you separate models that are highly transactionally related, you will incur the costs of managing distributed transactions. I have often seen the desire to grow the organization supersede the design of an architecture that can support organizational growth - OPAB instead of BAPO. If you set organizational headcount goals before you figure out how you’re going to have more engineers working on a given system, you are putting the cart before the horse. If you add teams, they will create services, and those services may have a lot of overlap with other services, and without a clear architectural vision of how to properly modularize software and data, you will end up with an emergent, undesigned, and unnecessarily complex system.

That said, engineers should anticipate organizational growth. A healthy organization has to grow. If there is no growth pipeline, every time a person leaves the organization dies a little. A small organization in particular can’t afford to lose people and not have replacements lined up. Systems have to be designed with modularity in mind from the beginning to support decomposing bigger domains into smaller ones (modular design is a good idea anyway). Starting with the business needs and breaking those down using Domain Driven Design will set your organization up for healthy growth.

Synchronous vs asynchronous communication: the leakiest abstraction

Andrew Nicholson — Sat, 25 Jan 2025 00:43:18 GMT

There are two categories of communication in software systems - synchronous and asynchronous. Synchronous communication is two-way - Alice sends a request to Bob, and Alice waits for Bob’s response. Asynchronous communication is one-way - Alice sends a message to Bob, and Bob does not respond (except maybe with an acknowledgment). For Alice to learn what Bob’s doing with her message, they have to communicate again, either by Bob sending Alice a message or Alice synchronously asking what Bob is up to.

“Synchronous” means “happening at the same time” - the request and response happen together. “Asynchronous” is the opposite, not happening together. Synchronous communication protocols include HTTP, FTP, SSH, and TCP. Asynchronous protocols include AMQP (used by message brokers like RabbitMQ), Kafka Protocol, Websockets, HTTP Server-Sent Events, and UDP. Synchronous protocols are used for blocking, two-way communication between a user and a system or between systems. Asynchronous protocols are used to decouple systems in time - a producer of a message continues with their business while a consumer does something with it.

Synchronous communication is a very useful pattern. When a user or system does something, they get feedback on whether or not what they were trying to do was successful. You lose this near immediate feedback with asynchronous communication. You may “request” that something happens, and never hear back about the status.

This seems like a major downside for asynchronous communication, so what’s the upside? By decoupling the producer of a message from its consumers, you allow the consumer a ton of leeway in how they handle the message. One of the most important patterns this unlocks is retrying. Retrying failures is obviously critical to building resiliency into a system. Services go down, and messages that need to get to those services eventually, but not necessarily right away, can be queued for retry. In a synchronous system, all you can do when a call fails is tell the caller to try again. With asynchronous protocols in place you can try again for them. A related pattern is throttling requests to a third party to avoid exceeding a rate limit. These can’t happen synchronously, because you can’t have the user waiting around indefinitely for things that may take minutes or hours (even days). There is very likely a timeout somewhere in your web stack on the order of 1 minute.

This is clearly a pretty significant tradeoff - immediate feedback on the one hand telling the user when something worked or not (pretty useful information), and resiliency on the other hand. It’s also a completely leaky abstraction to use asynchronous communication. Once you are communicating asynchronously in any layer in your stack, everything built on top of that layer that uses that communication has to know that it’s asynchronous. You can’t communicate asynchronously on the backend when the user sends a request, and still give them a response right away, if the response depends on what happens behind the “asynchronous boundary”.

This presents a subtle problem. A backend architect may want systems to communicate asynchronously, say using an Event Driven Architecture. They may have good reasons for this, around resiliency, scale, decoupling, performance, etc. However, if the result of any actions behind this asynchronous boundary need to be communicated to the user, the entire product experience has to change, all the way up to the frontend. Designers have to consider what happens when a request gets accepted but not worked on for hours. There will be many cases where the user isn’t paying attention to your application when a request finally gets processed, so you’ll have to consider introducing asynchronous communication like email to the user. If you’re not careful, a design constraint on backend systems can have quite negative effects on user experience.

So how do you balance resiliency with good UX? The best way I have found is to pay careful attention to this leaky abstraction, and design your system in such a way that the applications users interact with have all the data they need to respond synchronously to requests letting them know whether they are successful. There may still be side effects of user-issued requests that need to happen asynchronously for resiliency, but the user-facing application should know whether to accept or reject a request. If it is accepted, there should not be anything that prevents the system from eventually fulfilling the side effects. There can be temporary failures that resolve themselves, but if an enqueued side effect is failing due to some invariant that has been violated, that’s a system design flaw that needs to be fixed.

For example, let’s say you have a validation constraint that users should have unique usernames, and you need to create an account in multiple services when a user signs up. If a username is accepted as unique but when you go to create it in another service you find that it isn’t unique in that service, that’s a system design flaw. You either have to ensure that the data flows through one service first, so that it has all the data it needs to know whether to accept new user requests, or implement a form of (synchronous) distributed transaction that makes sure the user can be created in all systems when the request comes in. This is a big reason I advocate strongly for single write stores.

We’re accustomed to being able to hide details as software engineers. That’s a big part of the art of software design. Communication protocols are a case where you can’t hide details, and the art has to happen at a higher-order. The entire system has to be designed to weigh tradeoffs between user experience, resiliency, possible data flows, performance, scale, and team collaboration. The fact that this seemingly simple, super low-level distinction mandates completely different system designs and products is a reminder that we are playing by rules we didn’t write and don’t get to change.

Eventual Consistency is no silver bullet for the CAP Theorem

Andrew Nicholson — Fri, 17 Jan 2025 20:34:11 GMT

There is a fundamental tradeoff in distributed systems between consistency and performance, coarsely defined by the CAP Theorem. This tradeoff pops up whenever you have more than one copy of any given piece of data, which is frequent. I have seen a common narrative play out over my career when builders and stakeholders are confronted with this tradeoff. Inevitably, when data is distributed, there comes a point when lack of consistency or performance becomes noticeable enough to be classified as a defect. Someone points out that you can’t have one without losing the other. And then someone replies with the line I have heard several times - “we don’t need real-time consistency, Eventual Consistency is good enough” - mic drop.

The appealing idea here is that if you can accept a period of staleness between two copies of data, then you don’t have to suffer the performance cost of synchronizing the copies on write. You can queue updates and propagate them to other copies in the background. At a high level, it seems like an easy tradeoff. People are accustomed to a bit of staleness, and it’s a small cost to pay to not have to block write operations on updating all copies, which also means they fail if you can’t reach the other copies. Often it’s the failures in one service causing failures in another that brings this issue to a head.

The problem with this idea is that Eventual Consistency is not just a bit of staleness. Eventual Consistency is not the silver bullet solution to this problem (spoiler - there is no silver bullet). In fact, much of the literature on consistency in distributed systems excludes Eventual Consistency altogether because it is a very weak consistency model, and there are “stronger” models that perform just as well1.

The primary weakness of Eventual Consistency is that it provides absolutely no ordering guarantees, and therefore changes may be applied out of order, resulting in intermediate anomalous states. Values may appear in replicas that would have never been seen in the primary. Let’s say you have a distributed data object with two operations, increment (+) and decrement (-). There is a primary data store and a single read replica. The data value starts at 0, and a user issues several increment and decrement commands in fast succession. These are applied in the correct order to the primary, because the communication between the client and server is synchronous, let’s say +, -, +, -, +. The data value ends up as 1 in the primary. Concurrent to the operations mutating our data value, the primary service enqueues the increment and decrement events to propagate to the read replica.

Eventual Consistency offers no guarantees that these events are published, consumed, or made visible in order. The read replica could process them in the order -, -, +, +, +. Eventually, the value converges to 1, which is good. However, in the interim the value reads -1 and -2, states the primary never entered. There may be business invariants that specify that this value should always be positive. How does an application gracefully handle these anomalous states that violate business invariants? You can probably think of cases where these anomalies would be acceptable, and cases where they would not. That highlights the point of this article - Eventual Consistency is not the solution to the consistency/performance tradeoff of distributed systems.

There are, in fact, many other solutions that are more or less appropriate in different contexts where the tradeoffs have different dimensions and weights. Eventual Consistency is but one Consistency Model. Another consistency model I quite like because it performs just as well as Eventual Consistency but offers logical ordering guarantees is Causal Consistency. Causal Consistency is basically Eventual Consistency plus causal ordering, meaning that the protocol for replicating changes has to include some data structure for tracking the causality relation that orders events. Events that can have a causal effect on other events have to be applied (or “visible”) before their dependents. In the example above, the out-of-order decrement events just wouldn’t be visible until the increment events they depend on.

While Causal Consistency strikes a great balance of consistency and performance, the tracking of causal ordering and logic to make visible only those events whose dependencies are satisfied does add complexity. For that reason, I have rarely seen Causal Consistency “in the wild” - it’s hard to get everyone aligned on a complex solution, especially across team boundaries, which is typically where data gets replicated (see Conway’s Law - a single team is unlikely to distribute their data in a way that requires complex synchronization). I think the implementation of Causal Consistency is actually not excessively complex, but understanding why you would care about an abstract concept like “causality” requires understanding of a complex problem - which is why I’m trying to spread that understanding.

Another aspect to consider is that both Eventual and Causal Consistency are “single-object” Consistency Models, which means they are only relevant for modeling consistency of single operations on individual distributed data objects, and not distributed transactions that may span multiple operations on multiple distributed data objects. If you have a transactional operation that either requires multiple distributed objects to commit or rollback atomically or requires multiple operations on a distributed object to commit or rollback atomically, then you have to use a different strategy. There are transactional Consistency Models, for example Serializable Consistency which can be implemented with a 2- or 3-phase commit algorithm. You may be familiar with some of these in the form of transaction Isolation Levels in SQL DMBSs (e.g. Read Committed, Read Uncommitted).

Another nuance to consider when shifting to an eventually consistent model is that you cross the boundary from synchronous to asynchronous communication. This is a qualitatively different low-level detail that leaks all the way up the stack. If you were to, for example, use a CQRS style architecture where writes go to one store and reads happen in an eventually (or causally) consistent replica, then you must account for the latency between writes and reads in your product design. As a perhaps archaic seeming example in 2024, if you are issuing a write and then reading the result of that write in a full request/response cycle, like submitting a form and then rendering a view of the form as the response to the form submission, you may render an old view. While old school, request/response is pretty darn simple, and therein lies its value. CQRS and eventually consistent stores also have value, but it’s important to think about the door you walk through when you change from a synchronous to asynchronous communication protocol.

All this is to say that Eventual Consistency is not always the answer to consistency and performance woes. It certainly can be, but there are complex tradeoffs to consider. Too often, I think, when Eventual Consistency is offered as a solution, it is without full understanding of what it really means, which is not just “a little delay in consistency”. It’s more fair to characterize Causal Consistency that way. Eventual Consistency is relatively simple, but still more complex than fully synchronous consistency. It is a weak model that allows anomalous intermediate states that may violate business invariants. And it may not be appropriate for replicating transactional writes. Those downsides may very well be outweighed in some contexts, but it’s important to be aware of them and explicitly accept them, or choose another solution.

Jepsen's great resource on consistency models and Kleppman's great paper on the CAP Theorem for example

Teams should optimize for small changes

Andrew Nicholson — Tue, 07 Jan 2025 23:26:52 GMT

When it comes to shipping changes, bigger is not better. Every deployment is a bet that the changes will add value to your product. This isn’t something you can prove before shipping - it has to be tested over time in the wild. There may be bugs that don’t show up until a certain set of conditions is met that can only occur in your production environment, there are quality defects that act as death-by-1000-cuts that you only see over time, and even if everything works as intended the customer may not like it. The way to mitigate the downside of these bets is with small deployments.

I learned this lesson when working with a team that had a high defect rate. Applying lessons from Continuous Delivery, I wrote a PR template that broke down the risk of a changeset by three dimensions - size, consequence, and mitigations. Consequence is a measure of the potential negative impact to the business if something goes wrong, essentially how critical the code being changed is. Mitigations are ways of minimizing the likelihood and/or consequences of defects, e.g. testing, feature flagging, or rollback strategy. Size is just the size of the changeset. Each of these were quantified in t-shirt sizes.

This rubric essentially broke the risk down into probability (size minus mitigations) and consequences. This is a common way to evaluate risk known as the Risk Assessment Matrix. A goal of software engineering teams should be to decrease the risk of releases, so our bets are more likely to pay off. There is very little you can do to affect consequences - if you’re changing a critical business process, a defect is always going to be bad. Mitigations like testing and feature flagging should be a standard practice, but can only get you so far. Which leaves us with one lever to pull to minimize risk - size.

It was surprising to me to learn that minimizing risk, which is a core concern of my job, was reducible essentially to making small changes (in addition to following standards of doing my best to ensure the changes contained no defects). There are some strong reasons for this. First, there is the simple fact that every line of code has a probability of containing a defect, so the more lines of code you group together, the higher the probability of a defect in the group. For example, if each line has a 1% chance of containing a defect, then a PR with 10 lines has a 9.6% chance of containing a defect, whereas one with 100 lines has a 63.4% chance1. This is taking the weak assumption that the probability of defects in a given line of a changeset is independent of the probability of defects of other lines, and I think you could make the case that the probability of defects in a given line goes up with the number of other changes (more changes mean more possibility for interactions). Either way, bigger changesets obviously mean more chances for defects.

What is nonobvious, though, is that the probability of a defect making it into production grows super-linearly with the size of the changeset. That’s because the mechanisms we have for catching defects are human-oriented and humans don’t scale to the size of a changeset. To catch defects, we write tests. It would be appealing to think that 100% test coverage means proof of absence of defects, but it really means absence of proof of a defect in those particular tests. There may be tests that were not written that would have identified defects. It’s up to humans to write tests, and other humans to double-check the right tests were written.

The ultimate mechanism for checking for defects is code review. A reviewer has to hold interacting changes in their head to reason about whether they contain defects. A person can hold 10 changes in their head pretty easily, but probably not 100, and definitely not 1000. Given that the probability that a 1000 line changeset contains a defect is relatively high, and the probability that a reviewer can identify those defects is very low, it’s clear that changes this large are risky.

A careful reader will have noticed that I claimed the reviewer needs to hold interacting changes in their head. Changes that are unrelated can be evaluated independently. If your code has good modularization, it’s clear which changes can affect which code. So, what’s the big deal if you ship a 1000 line change if it’s well modularized? Well, what would be the big deal in breaking it into smaller changes, if it’s well modularized? Aggregating unrelated changes makes it hard to know which changes are doing what. If you find a bug in production after a release that contained 5 unrelated changes that could have been shipped independently, how do you know which change caused the bug? If users really dislike something you just shipped, how do you know what it is they dislike? Breaking apart changes that aren’t related helps not only reduce the probability of a defect in any given changeset, but also helps you identify which changes are causing which effects.

The fact that something as simple as the lines of code of a changeset is the best lever you can pull to reduce release risk should come as good news. It’s not very complicated to optimize for small release size, compared to something abstract like “velocity”. In fact, it encourages writing code that is easier to change in small chunks later. If you ship early, often, and small, then it encourages you to write code in modules. It makes you think about how to break the problem down into its component parts, as best you can. In addition to modularization, you may need to use feature flags to deliver partial functionality before it’s ready for your end users. This also encourages best practices for product development - it allows your product owners to use features and give feedback before they are live to customers, therefore tightening feedback loops.

In summary, teams should optimize for small changes. There are many things teams could optimize for, but change size is the only easy metric I know of that reduces risk, encourages modularization, and improves product release cycles.

Using the formula 1−i=1∏n(1−Pi) where Pi is 1% and n is 10 or 100

My favorite functions for data manipulation

Andrew Nicholson — Sun, 15 Dec 2024 16:39:10 GMT

Software engineering is all about breaking complex problems into simpler parts. Great engineers have a real talent for this. Early in my career, I learned a strategy from one such engineer that I use almost every time I write code. It’s simple, low-level, and widely applicable. I’ll call it map, reduce, or filter. The basic idea is that much of what programs do boils down to these 3 fundamental operations on collections of data - mapping each element to another element, reducing the collection to a single value (or, another container for the collection), and filtering out elements in the collection.

Most of the low-level work applications perform is simple data manipulation. You read data, transform it, and filter, both when reading and when writing it. There is unlimited variability in how programs can be written. Programming languages are extremely powerful and flexible in this way, but the unlimited possibility space for how to solve a problem doesn’t make the job of the programmer easy - you have to search this space for solutions that meet your tradeoff goals.

A skilled programmer can rule out huge branches of the tree of possibility space very quickly, and focus on the ones that are more likely to lead to great solutions. Thinking about the problem as a combination of map, reduce, and filter operations on collections focuses you on tools I have found solve a surprisingly high percentage of problems.

Very often, you have a collection of elements, and you need to extract a value from each element, or more generally perform some transformation on each element. For example, let’s say you have a collection of user objects, and you need a list of their ids. At a high level, you are performing a map - you are mapping each value to another value.

Almost as often, you have a collection of elements that you need to repackage in another container, or extract a single value from. Examples of repackaging include putting the elements in an array of user objects in a dictionary where the key is the user id and the value is the user object (in some languages this might be doable with map, but in JavaScript for example, an object is a single value, not a collection). You might need to extract a single value from a collection, like the newest user, or the sum of users’ account balances. These are reduce operations - you are reducing a collection to a single value.

Also very frequently you will need to filter a collection. The filter operation returns a subset of the collection. For example, you may want to filter out inactive users in a particular view.

These three operations obviously don’t solve every problem you’ll run into, but you might be surprised how many they do solve. I find these to be a great starting point for breaking down a problem. Start with the data, and think about the operations you need to perform on it in terms of map, reduce, and filter. You will probably get to a working solution pretty quickly, and from there you can optimize for performance, brevity, intentionality, etc. For example, it might be more intention-revealing to use a find function than reduce, if you need the newest user.

You could perform all of these operations with your favorite brand of iteration, like a for loop. These functions have some non-obvious advantages over looping. Principal to these is that map, reduce, and filter are pure functions that promote immutability. They discourage reassignable variables or mutating collections. They don’t mutate the collection you perform them on - they return a new collection or object. There’s no need to instantiate reassignable variables so that you can reassign them in a for loop. Immutability at the reference and value level cuts out two classes of cognitive load that usually just get in the way of understanding what the program is really doing. Along that same line, another benefit of using map, reduce, and filter as functional primitives is that you clearly inform the reader of your intention - you know that these functions do one and exactly one thing, whereas a for loop could be used for any iterative algorithm.

Use of map and reduce in particular also promotes writing code that is composable. The function you pass to map or reduce is often reusable for mapping/reducing collections of different types of elements, or on individual elements. Code that was written to transform a single value can be easily adopted to handle a collection of values of the same type - just pass the same function to map.

This is borrowing from and bastardizing the great canon on functional programming. My favorite exploration of the ideas of functional programming is from Tom Harding. Part of the magic of map is in its functional programming properties - arrays/collections are just one example of “containers” (to avoid using the infamous m word that rhymes with gonad), and the map function just means you are transforming the value a container holds without changing the type of the container. This can be useful for other constructs like Result, which can be a Success or Failure, both of which implement map, where Success maps a value to another Success with the return value from the function passed to map, and Failure just returns itself.

To summarize, a simple approach I take to breaking down problems involving data manipulation is to reach first for map, reduce, or filter as my functional primitives. This one cool trick reduces a surprising amount of cognitive load when searching for solutions to a problem, and provides building blocks for understanding much of what programs do.

Four ways to handle concurrency in distributed systems

Andrew Nicholson — Fri, 13 Dec 2024 18:36:52 GMT

Distributed systems are complex because concurrency is complex. When multiple operations can mutate or read the same piece of data at the same time, you get conflicts and inconsistencies, and you have to define how the system will handle those. Otherwise, you get nondeterministic behavior like “last write wins” and “phantom reads”. In this article, I briefly introduce four strategies for managing concurrency in distributed systems - locking (handle it ad hoc), single write stores (avoid it), immutability (become immune to it), and consistency models (model it).

The first strategy is to handle it ad hoc with locking. Locks are a fundamental primitive for managing concurrency. Because they are quite low-level, they can be applied as needed when you find concurrency issues here and there. The basic idea of locking is that you can specify which operations are allowed concurrently on specific sets of data. You have fine-grained control over what data you lock (e.g. an entire table, or just a row), and which operations you consider conflicting (e.g. concurrent writes, or any read concurrent with a write). Locking is general and foundational, but it doesn’t scale well as an architectural abstraction. The following strategies are system design strategies, and provide patterns at the system level for dealing with the problems introduced by concurrency.

The second strategy is to avoid it with single write stores. Single write stores are a system design constraint that any given piece of data is only ever written in one database. This lets you offload the complex problem of concurrency control to your database. You can’t avoid concurrent operations entirely (unless you only accept one operation at a time which isn’t really an option for the obvious performance reasons), but you can avoid concurrent conflicting operations across the system. How you divide your data is important, too - you could partition your data in such a way that even if a given piece of data is only written in one store, a distributed transaction requires writing to multiple stores transactionally, and your database can’t manage distributed transactions for you. It’s important to consider the transaction as the basic unit of write operations, and group transactionally related data together (this is the principle of high cohesion - things that change together should live together). You can still distribute data for reads and realize much of the performance benefits of replication without taking on the much harder problem of coordinating concurrent writes.

The third strategy for dealing with concurrency is to become immune to it with immutability. Concurrency frequently doesn’t affect operations on immutable data. If you can’t mutate a piece of data, you can’t mutate it concurrently (with another mutation or a read), and concurrent mutations are really the source of complexity. But isn’t the whole point of most application interfaces to data to allow users to mutate it? Yes, and the lever you get to play with as an engineer is how to read the data vs how to write/store the data. You can model your data and write operations as immutable records of actions/events, and when you need to read the current state you transform that history to a single value. This is known as event sourcing. For example, you can store a counter as a series of increment and decrement events, then reduce over those events to get the current value. In this example, order doesn’t matter, but there may be intermediate inconsistencies or violations of business invariants (e.g. the value might go below 0 when that’s supposed to be impossible, if the events are received out of order and there is no check on this constraint). In cases where ordering matters, you may need additional ordering metadata or constraints, and account for reading events when they may be out of order.

These complex cases bring us to the final strategy for managing concurrency I want to introduce - model it with consistency models. Consistency models are formal specifications of system behavior that model how the system handles ordering of operations, in particular concurrent ones. Examples of consistency models include linearizability (where the system behaves as though all operations occur in all processes in the order they occur in the real world, which requires coordination), eventual consistency (where the system just ensures values converge eventually), and causal consistency (where the system ensures values converge, and never applies operations out of order). The simple example of the counter would implement eventual consistency, because the values eventually converge but can enter intermediate, invalid states. To beef that up to causal consistency, you’d have to track metadata about the ordering of operations - event B depends on event A, so if you have B but not A, you need to wait before you can apply B. There are consistency models that describe every permutation of event ordering given the possible ordering anomalies of distributed systems and the tools for managing concurrency and coordination (including locking and immutability), and each of these has its own performance tradeoffs. Every distributed system behaves according to one of the consistency models, whether or not it’s intentionally designed with a particular model in mind. If you don’t pay any mind to concurrency control, you will likely end up in a system with weak consistency, i.e. no consistency guarantees at all.

This is a very high-level and oversimplified view of a complex problem space, but hopefully it gets you thinking about managing concurrency. Too often, I have seen concurrency considered an edge case (e.g. the “race condition”) or an afterthought, but to deliver trustworthy software, you have to manage concurrency. The most formalized way to approach the problem is through consistency models, and the strategies of locking, single write stores, and immutability are some of the tools that help you achieve levels of consistency.

Avoid the autonomy trap

Andrew Nicholson — Tue, 10 Dec 2024 18:32:50 GMT

I have worked at startups pre, during, and post hypergrowth. Each of them used a version of the same strategy to unlock organizational scaling - autonomous teams. Autonomous teams are vertically integrated, business-function oriented, and small. They can build, ship, and maintain basically independently. More than one of these organizations referenced “The Spotify Model” as inspiration.

The insight most people seem to take away from The Spotify Model is that it’s better to focus on autonomy than alignment early on. The line of reasoning goes that overindexing on alignment too early presents the risk that you will block progress and delivery, and in the early stage of a company you can’t afford to do that. Getting out of the way of teams lets them build what they want, and you try your best to align that with business goals with KPIs and stakeholders to whom they are accountable. You can accept the risk that teams may be duplicating work, not building a cohesive product or system, and generally optimizing for local rather than global maxima, but you can’t accept the risk that nobody can get anything done without some centralized authority rubberstamping a plan, or without depending on other teams that have their own priorities.

I think all of this is true. However, in practice what I’ve seen is a near total disregard for alignment when it comes at any cost to autonomy (it always does). Very early in a company’s lifecycle I have seen the extreme focus on autonomy help the company cross the chasm into product-market fit. A small group of people working independently can build a lot of features, and if the features are low-hanging fruit that everyone is pretty confident will add a lot of value to the product, then it’s smart to get out of the way of the builders as much as possible. Being late to market is a failure mode.

However, the technical debt incurred in this stage adds up pretty darn quickly. With no coordination and little collaboration, independent builders will make systems that solve immediate problems, and as a whole the system will be very poorly designed (because it wasn’t designed at all). This is Conway’s Law at an extreme end.

Many people think that technical debt is mostly about cutting corners to move faster, as if writing low-quality code takes less time than writing high-quality code (it doesn’t, granted the developer knows how to write high-quality code). This isn’t what I mean - technical debt really describes the fact that the needs of the business and (more commonly) your understanding of those needs change faster than you can change your technical systems.

When a group of independent builders creates an emergent, distributed system (as will happen according to Conway’s Law), the business incurs massive technical debt related to the difficult problems of distributed systems. These problems are usually completely orthogonal to the real needs the business has for its technical systems. Most companies never get to the scale where it’s necessary to solve hard distributed systems problems to meet the traffic demands of their users. Despite what microservices evangelists say, you can scale a monolith to very high demand, in a mature organization - Shopify still runs its business on a monolith.

Here’s the real kicker - distributed systems problems require alignment to solve. The canonical problem in distributed systems is summed up by the CAP Theorem, which states a tradeoff between Consistency (basically data correctness) and Availability when data is writable in multiple places. There are myriad ways to navigate this tradeoff, and each of them requires a degree of coordination between the data stores. You may have to implement specific APIs, track metadata related to the order of operations, or adopt an asynchronous communication protocol. No matter what, you won’t solve this problem without communication and coordination between the data stores, and communication and coordination between the teams building them. This requires some degree of technical alignment, which will come at a cost to team autonomy by definition.

You can solve these problems ad hoc as data becomes untrustworthy or performance suffers, but they are common problems that you will face time and again across the organization, and they are unrelated to the specifics of your business. This type of problem screams out for a reusable abstraction, to ensure that they are solved correctly, consistently, and efficiently. And to have teams get on the same page about a reusable abstraction, communication protocols, shared patterns, etc. requires alignment. It will cost something upfront, but it will very likely pay off not very far down the road.

What doesn’t work is a culture where alignment is eschewed for the sake of autonomy bar none, without considering the nuanced cases where the technical realities of software engineering push organizations to need common solutions to common difficult problems, and a coordinated effort can avoid the unnecessary complexity sinks you will create without proper consideration to distributed system design.

Beyond the CAP Theorem - consistency models

Andrew Nicholson — Mon, 02 Dec 2024 18:57:18 GMT

tldr; Consistency models describe the consistency guarantees between processes in a distributed system. They each have performance bounds. The CAP Theorem describes the most all-or-nothing of these where linearizable consistency comes at a cost to availability. Other common and useful consistency models to know are eventual and causal consistency.

The CAP Theorem describes a fundamental tradeoff all distributed systems must make when the system fails to communicate internally - the system can either respond to requests with a potentially inconsistent view of the data to maintain Availability, or sacrifice Availability to preserve Consistency.

This is a simple, high-level description of an inescapable tradeoff. However, the rigorous reader will be asking what Availability and Consistency actually mean. These terms conceal lower-level bits of exchange in the tradeoff space. These bits are formally specified with “consistency models”.

Consistency models describe the different consistency guarantees distributed systems can make, each of which has an upper bound on performance or “Availability”. Generally, “stronger” consistency means worse performance. The CAP Theorem uses the strongest consistency model as its definition for “Consistency” - linearizability1. It also takes the least granular notion of performance for its definition of “Availability” - the system is either fully available, or not. So, while useful, the CAP Theorem only describes the tradeoff at the coarsest possible level - either the system is as strongly consistent as possible, or as available as possible.

In distributed systems, data can be concurrently read and written in multiple processes. These operations happen in real-world time, but the system doesn’t know the real-world time at which they begin or end - computers only have access to a local clock, and the clocks on different computers are not synchronized. To define consistency is to define how the system handles ordering these operations given the lack of a shared clock. Consistency models define the ordering guarantees the system makes considering all the possible orders in which operations may appear to each process, and what coordination is required between processes to make those guarantees. Ordering guarantees can be stronger or weaker, and require blocking or nonblocking coordination (or none at all).

The linearizability consistency model describes a system that behaves as though it does know the real-world ordering of operations, as if the operations were occurring within a single global clock-space. Because there is in fact no global clock-space, behaving like there is one requires a synchronization protocol - the processes have to coordinate in a blocking way to preserve clock synchrony. When communication fails, this synchronization can’t occur.

Linearizability has the benefit of matching most people’s expectation of what it means to be “consistent” and being easy to reason about, but in some contexts it is stronger than required and the cost to performance is too high. This is where other consistency models are useful - some weaker models don’t require communication at all times, and therefore perform better during failures.

Some useful alternative consistency models are causal consistency and eventual consistency. The most useful resource I have found that explores the details and tradeoffs of consistency models is provided by Jepsen (they omit eventual consistency, which is not uncommon in the literature because it is quite weak). The following image does a great job of highlighting which models have which availability characteristics.

Source: https://jepsen.io/consistency/models

The left of the tree concerns distributed transactions, which are most relevant to database design. The right side will be more relevant to distributed application design. Jepsen frequently references Consistency in Non-Transactional Distributed Storage Systems by Viotti and Vukolić, which is a very thorough resource containing more consistency models and more details (including eventual consistency).

Another great read on this topic is A Critique of the CAP Theorem by Martin Kleppmann. In this paper, Kleppmann advocates the terminology “delay sensitivity” to describe the sensitivity of common consistency models to network delays, for both read and write operations. Something I find interesting is that the sequential consistency model, which provides very strong consistency guarantees, can be implemented to be insensitive to delays for either reads or writes, but not both. This highlights just how nuanced this tradeoff space really is.

As you can see, there is a lot to digest here. Here’s my quick and dirty take on some key, common consistency models and their tradeoffs:

Linearizability - strong, simple, but poor performance

The strongest consistency model, and a good choice when correctness is really important. It has the huge benefit of being easy to reason about, and matching expectations. It’s also fairly simple to implement. The downside is that no process can proceed when there’s an internal communication failure. Even in healthy conditions, all operations are exposed to added latency from communicating with other processes while they synchronize.

Eventual Consistency - good performance, simple, but poor correctness

I have often heard this referenced as the silver bullet solution to the CAP Theorem in industry, but it’s actually one of the weakest models. Eventual consistency only guarantees that data stores eventually converge to the same value. Sometimes this can be as simple as propagating operations asynchronously with at-least-once (if the operations are commutative and idempotent) or exactly-once (if the operations are commutative but not idempotent) delivery. However, since eventual consistency provides no ordering guarantees, operations applied out of order may result in temporary anomalous states - states that are inconsistent with the actual order that operations occurred. This can be a major downside, as you lose a lot of trust in your data if it enters states it never really should be in, even temporarily. So it’s not really a silver bullet. Clients can proceed during internal outages, and it’s simple, which are its major upsides.

Causal Consistency - good correctness, good performance, but complex

Causal consistency respects the causal dependencies between operations in its ordering. This means out-of-order operations are essentially not applied until their dependencies are received. Also sometimes called “strong eventual consistency”, data stores eventually converge in this model, and don’t ever enter intermediate anomalous states. Clients can proceed during internal outages, provided they communicate with the same process. The major downside to causal consistency is the complexity of implementation and the alignment required between processes (and therefore the people working on those processes, which can be separate teams working on separate services) to implement it correctly. Every operation must include some additional metadata about its dependencies, and every client must understand how to interpret that metadata when rendering the current value.

Subscribe now

Sometimes other consistency models are used to define Consistency in discussions of the CAP Theorem, but the original model discussed by Gilbert and Lynch is linearizability (also called atomic consistency in this paper). Furthermore, while linearizability is commonly recognized as the strongest consistency model for read/write object, the Jepsen diagram includes Strict Serializable as an apparently stronger model for transactional (multiple operations on multiple read/write objects). As a simplification, this article doesn’t get into the differences between distributed read/write and transactional objects. Transactional objects are most relevant for database design, whereas application protocols typically work at the read/write object level.

The hidden costs of low-code tools

Andrew Nicholson — Mon, 02 Dec 2024 18:42:53 GMT

There has been a recent boom of powerful off-the-shelf tools that allow non-programmers to build and operate software systems that can run a business. There are some big advantages to using these tools - folks closer to the business domain can build solutions, you can prototype products to find product-market fit without investing heavily in tech, and you don’t have to reinvent the wheel, to name a few. However, everything comes with tradeoffs, and there’s one in particular I want to highlight - when you store your data in a third party, you don’t control how you get to interact with it.

There are some obvious downsides to this. First, you may need a view of the data that just isn’t available. Or, you may want to write the data in a way that isn’t possible - perhaps a batch update for performance reasons. This may force you to replicate the data, and operate on a shim application that sits in front of the third party tool.

This gets to a downside that can be very, very costly - third party tools force you into a distributed system early on, and can necessitate that you build more components and data stores in the distributed system than your org really needs.

Consider that the major upside of third parties is that you need fewer engineers in order to build features. If you have a small team of engineers, though, they will likely build a relatively simple monolithic application that solves the problems your business needs solved, and nothing more. A small team operating a distributed system, some components of which they don’t control, is going to spend a lot of time grappling with difficult distributed systems problems, that are entirely orthogonal to your product needs.

Controlling APIs matters a lot in distributed systems design. If you have distributed data, a major problem you will run into is providing some form of consistency for that data. There are significant tradeoffs here - the CAP Theorem, a simplification of these tradeoffs, states that you have to choose between availability and consistency in all distributed systems.

As an example, say you have data stored in Airtable and a shim application that many of your users interact with. Some simple operations in your shim may require many hundreds or thousands of requests to Airtable - maybe you needed to build the shim to get around that constraint. This pattern doesn’t scale, and you’ll run into rate limits which essentially make Airtable unavailable to you. You then have to choose whether your application is unavailable, too, or whether to accept updates and risk concurrent and conflicting updates that threaten consistency. You can’t change Airtable’s API, so distributed systems solutions like 3-phase commit or causal consistency models aren’t an option.

I think this risk is mostly hidden to those making the decision to opt for off-the-shelf tooling over engineering, because of the technical nature of distributed systems problems. If they were aware of this, it may change the cost/benefit calculation of betting on non-engineers using powerful low-code solutions, versus a small team of engineers building a simple solution to the business’s problems, as many startups have done in the past.

We’re at an interesting point in the experiment with low-code off-the-shelf solutions driving software systems for early stage startups. The tools have been around long enough for some products to have been built on them, but we are just now getting to a point where those products are reaching maturity. Now is a good time to consider whether the promised value of low upfront cost and non-engineer development have panned out to be net-positive.

Subscribe now

What is a distributed system?

Andrew Nicholson — Mon, 02 Dec 2024 18:31:54 GMT

Distributed systems are just that - systems whose components are distributed in space. This seemingly trivial quality has important implications. It means that components of distributed systems do not share a global clock. There is, in fact, no such thing as a global clock - clocks tick on their own. There is no universal time to keep them “in sync”.

Because components in distributed systems don’t share a clock, ordering events becomes a tricky problem. You can’t simply use timestamps - just because the timestamp of event a on computer A is numerically less than the timestamp of event b on computer B, a didn’t necessarily happen before b, because the clocks that provide the timestamps drift.

The inability to order events based on timestamps has huge implications. When two events should result in the modification of the same data, and you can’t tell which one happened first, how do you know what the value of the data should be? There are protocols you can use to implement varying levels of event ordering, but they each come with their costs in runtime performance and complexity.

Another problem you must confront in distributed systems is that communication relies on inherently non-instantaneous and error-prone networks. This, too, has huge implications. If you can’t trust your means of communication, how do you know if you just haven’t received a message yet, or if it was never sent?

There aren’t any tricks here - you have to work within the theoretical bounds implied by the fact that distributed systems are built in a world where clocks aren’t shared and communication is slow and faulty. Lamport clocks, vector clocks, consistency models, distributed locks, etc. are the beautiful fruits of solutions that have grown from the gnarly branch of software engineering that concerns distributed systems. As application engineers working in these systems, it is our privilege to get to work with these elegant solutions, and our responsibility to learn them in order to build sound systems that meet our business goals.

A single write store is the "source of truth"

Andrew Nicholson — Thu, 28 Nov 2024 18:34:59 GMT

The phrase “source of truth” has become a popular attempt to simplify distributed systems. It sounds great - there is a single place where data is “true”, so if you need the “true” value of an object, you go there for it. However, the phrase is poorly defined and doesn’t tell what to actually do to simplify your system. What does “truth” mean in a system where data may be located and mutated in multiple places? “Source of truth” tries to approximate “consistency”, but it loses important details. In a system where data can be read and written in multiple locations, “consistency” describes the guarantees the system makes for handling concurrent and conflicting writes. You can’t just say “this is the source of truth” without constraining the system in a way that makes the claim tenable.

The best design constraint I have found to simplify distributed systems is to restrict writes for a piece of data to a single store. The term “single write store” is a precise, formal, and actionable design constraint - a given record can only be mutated in a single place. This gives the single write store the trustworthiness that people want when they talk about the “source of truth”. You know where to go if you need to mutate a record, and where to get the most up-to-date value.

When you distribute data, you also distribute operations on that data - reads, writes, or both. Replicating data for the purpose of reads is relatively simple. Any time a client reads data, even in a centralized data store, the system has to assume that the data could have been mutated by the time the value reaches the client - unless you lock data when it’s read, concurrent processes can update it after the read operation starts. You can’t trust that the value a client has is up to date.

Read replicas rely on a synchronization process to periodically update them to the current value from the primary. They are more likely to be “stale” than data read from the primary, but the system already has to assume that any value is stale by the time it reaches a client. Distributed reads present possible usability concerns by increasing staleness, but they don’t change any fundamental assumptions about how the system can treat this data.

Distributed writes, on the other hand, do change a fundamental assumption about your system - if data can be written in multiple places concurrently, then the data in storage itself may be stale. This means you can’t trust any copy - you have to coordinate between multiple services over network connections to avoid conflicting writes. This is the source of the gnarliest problems of distributed systems.

Consider as an example data representing a user’s calendar and appointments in it. There is an obvious constraint you want to enforce - a user can’t have two appointments at the same time. If this data is stored in a single database, a simple database constraint can enforce this business invariant. No one will ever have a conflicting appointment, because the database won’t allow it, even in a highly concurrent environment. You can offload the complex problem of concurrency control to the database.

If the data can be updated in two databases, neither database can ensure that the other doesn’t have a concurrent transaction ongoing that conflicts. The services themselves must coordinate over the network to implement some form of concurrency control (or consistency model). If either service is unavailable, then you have to face a hard tradeoff - is the schedule operation unavailable, or do you accept an attempt to schedule and reconcile it later? What happens if there are conflicts in the latter case? A single write store completely avoids this class of distributed systems problems.

If having a single write store is simple, and multiple write stores complex, why don’t organizations naturally select the simpler option most of the time? All decisions are tradeoffs, and there are costs to constraining your architecture in this way. A primary cost is that it requires organizational cross-team alignment to constrain your architecture at all. Alignment exists in tension with autonomy, and many organizations opt for autonomy, especially early in their growth, which is exactly when it’s important to get architecture decisions right and not bite off unnecessary complexity. It costs something to intentionally design your architecture - you have to invest the time in fitting your architecture to your business needs, when the path of least resistance is to let it emerge as you build feature after feature.

These costs lead to a couple antipatterns I have seen repeated that result in the unintended emergence of multiple write stores. The first is starting with a read replica, or “cache”, and building client interactions on top of it that evolve to eventually require writes. Because the client is reading from the cache, it seems to make sense to mutate the cache first, then update the primary, as it appears simpler from the perspective of the client. However, this neglects that there may be concurrent read or write operations executing in the primary store, and opens up the door for conflicts in both the primary and the replica.

Another antipattern is building a new application on top of a third-party platform that users interact with directly. Perhaps the platform provides some base functionality that got your product off the ground, but is unreliable or doesn’t support all the operations you now need. If you build an application as a shim in front of the platform, you may have to store data you need to operate that application in the application itself. It is then an easy leap to start mutating the data directly in the application. However, if users are still working in the third-party platform, now you have two write stores, and the data must flow bidirectionally.

This is a complex situation. The way out of it is to move users to your new application, and treat that as the new primary write store. However, it takes time to rebuild the required functionality that the platform provides, and you’ll likely want to ship something before you have a complete product. There are complex tradeoffs here - a platform may appear to offer lots of great functionality you don’t have to build early on, but if you anticipate growing out of it then you shouldn’t underestimate the difficulty of a migration. And while it’s generally a best practice to ship small features incrementally, this may be a case where going from 0 to 1 might be a better option, to avoid the enormous complexity of the intermediate state where you have to support bidirectional data flow between multiple write stores. You can partition users to derisk this strategy, but you have to partition them in a way so that they are never performing writes that can conflict between partitions.

To summarize, the design constraint of “single write store” can help guide organizations to avoid unnecessary sources of huge complexity. In my experience, too often is this complexity unknowingly bitten off, and with some knowledge it can be intentionally avoided.