A single write store is the "source of truth"

Nov 28, 2024

The phrase “source of truth” has become a popular attempt to simplify distributed systems. It sounds great - there is a single place where data is “true”, so if you need the “true” value of an object, you go there for it. However, the phrase is poorly defined and doesn’t tell what to actually do to simplify your system. What does “truth” mean in a system where data may be located and mutated in multiple places? “Source of truth” tries to approximate “consistency”, but it loses important details. In a system where data can be read and written in multiple locations, “consistency” describes the guarantees the system makes for handling concurrent and conflicting writes. You can’t just say “this is the source of truth” without constraining the system in a way that makes the claim tenable.

The best design constraint I have found to simplify distributed systems is to restrict writes for a piece of data to a single store. The term “single write store” is a precise, formal, and actionable design constraint - a given record can only be mutated in a single place. This gives the single write store the trustworthiness that people want when they talk about the “source of truth”. You know where to go if you need to mutate a record, and where to get the most up-to-date value.

When you distribute data, you also distribute operations on that data - reads, writes, or both. Replicating data for the purpose of reads is relatively simple. Any time a client reads data, even in a centralized data store, the system has to assume that the data could have been mutated by the time the value reaches the client - unless you lock data when it’s read, concurrent processes can update it after the read operation starts. You can’t trust that the value a client has is up to date.

Read replicas rely on a synchronization process to periodically update them to the current value from the primary. They are more likely to be “stale” than data read from the primary, but the system already has to assume that any value is stale by the time it reaches a client. Distributed reads present possible usability concerns by increasing staleness, but they don’t change any fundamental assumptions about how the system can treat this data.

Distributed writes, on the other hand, do change a fundamental assumption about your system - if data can be written in multiple places concurrently, then the data in storage itself may be stale. This means you can’t trust any copy - you have to coordinate between multiple services over network connections to avoid conflicting writes. This is the source of the gnarliest problems of distributed systems.

Consider as an example data representing a user’s calendar and appointments in it. There is an obvious constraint you want to enforce - a user can’t have two appointments at the same time. If this data is stored in a single database, a simple database constraint can enforce this business invariant. No one will ever have a conflicting appointment, because the database won’t allow it, even in a highly concurrent environment. You can offload the complex problem of concurrency control to the database.

If the data can be updated in two databases, neither database can ensure that the other doesn’t have a concurrent transaction ongoing that conflicts. The services themselves must coordinate over the network to implement some form of concurrency control (or consistency model). If either service is unavailable, then you have to face a hard tradeoff - is the schedule operation unavailable, or do you accept an attempt to schedule and reconcile it later? What happens if there are conflicts in the latter case? A single write store completely avoids this class of distributed systems problems.

If having a single write store is simple, and multiple write stores complex, why don’t organizations naturally select the simpler option most of the time? All decisions are tradeoffs, and there are costs to constraining your architecture in this way. A primary cost is that it requires organizational cross-team alignment to constrain your architecture at all. Alignment exists in tension with autonomy, and many organizations opt for autonomy, especially early in their growth, which is exactly when it’s important to get architecture decisions right and not bite off unnecessary complexity. It costs something to intentionally design your architecture - you have to invest the time in fitting your architecture to your business needs, when the path of least resistance is to let it emerge as you build feature after feature.

These costs lead to a couple antipatterns I have seen repeated that result in the unintended emergence of multiple write stores. The first is starting with a read replica, or “cache”, and building client interactions on top of it that evolve to eventually require writes. Because the client is reading from the cache, it seems to make sense to mutate the cache first, then update the primary, as it appears simpler from the perspective of the client. However, this neglects that there may be concurrent read or write operations executing in the primary store, and opens up the door for conflicts in both the primary and the replica.

Another antipattern is building a new application on top of a third-party platform that users interact with directly. Perhaps the platform provides some base functionality that got your product off the ground, but is unreliable or doesn’t support all the operations you now need. If you build an application as a shim in front of the platform, you may have to store data you need to operate that application in the application itself. It is then an easy leap to start mutating the data directly in the application. However, if users are still working in the third-party platform, now you have two write stores, and the data must flow bidirectionally.

This is a complex situation. The way out of it is to move users to your new application, and treat that as the new primary write store. However, it takes time to rebuild the required functionality that the platform provides, and you’ll likely want to ship something before you have a complete product. There are complex tradeoffs here - a platform may appear to offer lots of great functionality you don’t have to build early on, but if you anticipate growing out of it then you shouldn’t underestimate the difficulty of a migration. And while it’s generally a best practice to ship small features incrementally, this may be a case where going from 0 to 1 might be a better option, to avoid the enormous complexity of the intermediate state where you have to support bidirectional data flow between multiple write stores. You can partition users to derisk this strategy, but you have to partition them in a way so that they are never performing writes that can conflict between partitions.

To summarize, the design constraint of “single write store” can help guide organizations to avoid unnecessary sources of huge complexity. In my experience, too often is this complexity unknowingly bitten off, and with some knowledge it can be intentionally avoided.

Systemicity

Discussion about this post