Four ways to handle concurrency in distributed systems
Distributed systems are complex because concurrency is complex. When multiple operations can mutate or read the same piece of data at the same time, you get conflicts and inconsistencies, and you have to define how the system will handle those. Otherwise, you get nondeterministic behavior like “last write wins” and “phantom reads”. In this article, I briefly introduce four strategies for managing concurrency in distributed systems - locking (handle it ad hoc), single write stores (avoid it), immutability (become immune to it), and consistency models (model it).
The first strategy is to handle it ad hoc with locking. Locks are a fundamental primitive for managing concurrency. Because they are quite low-level, they can be applied as needed when you find concurrency issues here and there. The basic idea of locking is that you can specify which operations are allowed concurrently on specific sets of data. You have fine-grained control over what data you lock (e.g. an entire table, or just a row), and which operations you consider conflicting (e.g. concurrent writes, or any read concurrent with a write). Locking is general and foundational, but it doesn’t scale well as an architectural abstraction. The following strategies are system design strategies, and provide patterns at the system level for dealing with the problems introduced by concurrency.
The second strategy is to avoid it with single write stores. Single write stores are a system design constraint that any given piece of data is only ever written in one database. This lets you offload the complex problem of concurrency control to your database. You can’t avoid concurrent operations entirely (unless you only accept one operation at a time which isn’t really an option for the obvious performance reasons), but you can avoid concurrent conflicting operations across the system. How you divide your data is important, too - you could partition your data in such a way that even if a given piece of data is only written in one store, a distributed transaction requires writing to multiple stores transactionally, and your database can’t manage distributed transactions for you. It’s important to consider the transaction as the basic unit of write operations, and group transactionally related data together (this is the principle of high cohesion - things that change together should live together). You can still distribute data for reads and realize much of the performance benefits of replication without taking on the much harder problem of coordinating concurrent writes.
The third strategy for dealing with concurrency is to become immune to it with immutability. Concurrency frequently doesn’t affect operations on immutable data. If you can’t mutate a piece of data, you can’t mutate it concurrently (with another mutation or a read), and concurrent mutations are really the source of complexity. But isn’t the whole point of most application interfaces to data to allow users to mutate it? Yes, and the lever you get to play with as an engineer is how to read the data vs how to write/store the data. You can model your data and write operations as immutable records of actions/events, and when you need to read the current state you transform that history to a single value. This is known as event sourcing. For example, you can store a counter as a series of increment and decrement events, then reduce over those events to get the current value. In this example, order doesn’t matter, but there may be intermediate inconsistencies or violations of business invariants (e.g. the value might go below 0 when that’s supposed to be impossible, if the events are received out of order and there is no check on this constraint). In cases where ordering matters, you may need additional ordering metadata or constraints, and account for reading events when they may be out of order.
These complex cases bring us to the final strategy for managing concurrency I want to introduce - model it with consistency models. Consistency models are formal specifications of system behavior that model how the system handles ordering of operations, in particular concurrent ones. Examples of consistency models include linearizability (where the system behaves as though all operations occur in all processes in the order they occur in the real world, which requires coordination), eventual consistency (where the system just ensures values converge eventually), and causal consistency (where the system ensures values converge, and never applies operations out of order). The simple example of the counter would implement eventual consistency, because the values eventually converge but can enter intermediate, invalid states. To beef that up to causal consistency, you’d have to track metadata about the ordering of operations - event B depends on event A, so if you have B but not A, you need to wait before you can apply B. There are consistency models that describe every permutation of event ordering given the possible ordering anomalies of distributed systems and the tools for managing concurrency and coordination (including locking and immutability), and each of these has its own performance tradeoffs. Every distributed system behaves according to one of the consistency models, whether or not it’s intentionally designed with a particular model in mind. If you don’t pay any mind to concurrency control, you will likely end up in a system with weak consistency, i.e. no consistency guarantees at all.
This is a very high-level and oversimplified view of a complex problem space, but hopefully it gets you thinking about managing concurrency. Too often, I have seen concurrency considered an edge case (e.g. the “race condition”) or an afterthought, but to deliver trustworthy software, you have to manage concurrency. The most formalized way to approach the problem is through consistency models, and the strategies of locking, single write stores, and immutability are some of the tools that help you achieve levels of consistency.