Using BAPO and DDD to design scalable product orgs
Business-Architecture-Process-Org and DDD inform a strategy for translating your business needs into technical system design that avoids incidental complexity that can eat your team's time as you grow
Product companies that want to scale quickly often adopt a distributed organizational model - small, mostly autonomous, vertically enabled, product oriented teams. These teams can build features and entire products mostly independently, which supports organizational growth through horizontal scaling. Ideally, these teams are aligned with business goals, and have some level of technical alignment.
This organizational model does indeed allow companies to grow quickly. However, this growth comes at a cost. Paying Conway’s Law the respect it deserves, distributed organizations ship distributed systems. Distributed systems are complex. Understanding the sources of complexity in distributed systems and the strategies for managing that complexity will be a differentiator for tech companies that want to grow fast while maintaining long-term sustainability. Complexity kills, simplicity survives.
The primary source of complexity in distributed systems is managing distributed data. The CAP Theorem sums up the core technical tradeoff - in systems where data is located and mutated in multiple locations, consistency comes at a cost to availability. That’s a tough pill for stakeholders to swallow - both consistency and availability (or more generally, performance) are desirable. Complexity emerges in the tension between those desires, along with the competing desire for teams to operate independently and fast, within the technical constraints inherent to distributed systems and the organizational constraints implied by Conway’s Law.
Software businesses all compete in this space and each apply a combination of frameworks, philosophies, and intuitions. Agile, Scrum, and Lean seem to dominate the current thinking around process management. Microservices architectures seem to be winning tech leaders over with their promise of unlocking organizational growth through horizontal scaling. DevSecOps, product triads, and autonomous teams are other angles on the problem of how to deliver business value through software. All of these frameworks have a common focus - how to enable independent, fast value delivery streams so that you can grow your organization to meet your business needs without incurring massive overhead and complexity from inter-team dependencies.
But none of these frameworks tells us what to do about Conway’s Law, which in my opinion is a massive shortcoming. I contend that many of the core problems facing software businesses today, whether they can name them or not, are related to a mismatch between system architecture and business needs. Conway’s Law informs us of the link between the organization (specifically its communication structure) and the systems it builds. If companies only look at the organizational structure and processes and try to optimize those for delivering value, they aren’t paying attention to the assets that are the value of the business - the software systems that are the product these software businesses build.
Software as a product isn’t like shampoo. If your shampoo business’s leadership team decides they need to be certified vegan to align with their business strategy, they can change their formula pretty much overnight. When you recognize that complexity in your software architecture has metastasized to the point where it’s consuming more energy from the business than your actual business goals, you can’t just change the formula overnight - you have a very long and painful project ahead.
Software leaders should apply a framework that holistically co-designs organization with architecture. Jan Bosch’s Business-Architecture-Process-Organization (BAPO) model respects the interdependence of these features of software businesses and tells us which direction to approach from. Domain Driven Design (DDD), Event Driven Architecture (EDA), Command/Query Responsibility Segregation (CQRS), and Consistency Models offer technical solutions to the hard problems of distributed systems that should feed back into organizational design considerations. In this article, I will present these together in such a way that the technical complexity inherent to distributed systems can inform the sociotechnical tradeoffs technology leaders must make when designing distributed organizations.
Where to begin
BAPO informs a strategy for designing an organization that ships software to solve business problems, and in particular what end of the design space to start from - with the business. The BAPO model states that the Business informs the Architecture which informs Processes which informs Organization (i.e. teams). Jan Bosch claims that most organizations get this exactly backwards - they start with Organization structure and work back, aligning Process with Organization, allowing Architecture to emerge according to Process and Organization (Conway’s Law), and then struggling to align all of that with Business value, and then wonder why the software teams can’t deliver what the business asks of them.
The interesting situation is that most companies are not BAPO but instead they are OPAB: the existing organization is used as a basis for the definition of convenience-driven processes, which in turn leads to an accidental architecture. This restrictive architecture, driven by the past of the company, rather than its future, then offers a highly limited set of business strategy options. - Jan Bosch
So, what can we use to translate the business needs to software architecture? Domain Driven Design (DDD) is exactly that - a set of practices for breaking down complex business domains to inform software design. I think it’s fair to say that DDD has survived the test of time and proven itself as a reliable toolset for modeling software businesses. One of its primary tools is Event Storming - a collaborative brainstorming process where as many stakeholders and builders as you can stomach get together to braindump what your system does to perform its business function. You start with events, add commands that cause those events, the actors who trigger the commands, and the groups (or aggregates) of models those events concern, to ultimately arrive at the transactions your business needs to support.
This list of transactions is extremely valuable because it maps business primitives to technical primitives. Transactions are both the unit of business operations and the unit of technical operations. This is alchemy - you start with a business need and you transform it to a technical primitive. By defining your essential business operations in terms of transactions, you also define the essential processes of your technical system. The goal of software engineering organizations is to translate business needs into technical systems, and this is where that transformation occurs. Whether or not a business practices Event Storming and DDD, they are performing that translation somewhere, and probably poorly if they aren’t explicit about it.
Transactions are a special type of technical primitive because they are the atomic unit of concurrency - you can’t break down a mutation beyond the level of transaction, because by definition changes in a transaction happen together or not at all. That’s the point, after all. Concurrency management may not be the hardest technical problem a particular engineering organization faces, but it is the hardest technical problem all engineering organizations face. Knowing what transactions you need to support tells you how to structure your architecture to match your business domain in a way that avoids unnecessary concurrency management and the costs that come along with it.
Architecture
Event Storming maps out all the transactions your system has to support, and the groups of models each of those transactions concern. This grouping informs which data should live together in a service topology. The fact that two models change together is a very good reason to group them together in a single database. This applies the principle of high cohesion, which states that things that change together should be grouped together. Grouping transactionally related objects in a single database lets you offload the highly complex problem of concurrency control to your database management system. Could most engineers reimplement for example a flavor of Postgres’s Multi-Version Concurrency Control, between multiple databases, over a network?
The shape of your service topology starts to emerge when you group together objects that are related and change together, and decouple objects that are not - this is the other side of the high cohesion coin, loose coupling. Loose coupling is what allows systems (and teams) to operate independently. In DDD language, the tool to apply here is Domain Modeling. Where Event Storming maps atomic business events and transactions in a sequence, Domain Modeling topologically maps data objects and operations into distinct Bounded Contexts. Bounded Contexts concern a specific, causally-closed, “bounded” problem space in the business.
The foundation for your architecture is your Bounded Contexts. These will inform your service boundaries. Communication structure is the other critical component of your architecture. There are two types of communication - synchronous and asynchronous. The choice of communication protocol is absolutely critical to the types of consistency you can support, the transactions you can perform, and the product experiences you can build. Communication is also the fundamental force behind Conway’s Law - the communication protocol two teams’ services use dictates the communication processes for those teams.
Because of this link between team communication and service communication, this is where I diverge slightly from the BAPO model. You can’t model your architectural communication and then model your organizational processes (which are communication processes). The Processes have to feed back into the Architecture. Instead of linearly going through B-A-P-O, I believe there is an iterative feedback loop between A and P.
Process
How do you choose how teams should communicate? This is where it is critical to identify Consistency Models. Consistency Models describe the ordering guarantees a concurrent system makes for the operations in the system, and the performance tradeoffs you must make as you provide stronger ordering guarantees. In the strongest model, the system behaves as though all operations occurred in the order they actually occur in the real world. This comes with a high performance tradeoff - it requires a high degree of synchronous coordination between separate processes (in the computational meaning of the term), so when communication fails, no process can proceed.
In the weakest model, the system provides no ordering guarantees, but values between processes eventually converge, and even when communication fails independent processes can proceed, because coordination is asynchronous. This Eventual Consistency model allows intermediate states that the system never would have been in if it respected the real-world order of operations, which may or may not be acceptable.
There’s a middle ground model called Causal Consistency which requires no synchronous coordination and thus performs as well as Eventual Consistency, but also provides some ordering guarantees based on “causality” - operations that depend on previous operations are invisible until their dependencies are visible. It’s also built on asynchronous communication, but requires additional metadata to track dependencies, and therefore requires a degree of coordination between teams implementing Causal Consistency.
This is an oversimplification of Consistency Models leaving the majority of them out of the discussion, but the point is there are nuanced tradeoffs to navigate here between system performance, consistency, complexity of implementation, and team coordination. It’s crucial to clearly identify the target level of consistency between any data that may need to be replicated between teams.
My hope is that explicitly addressing these tradeoffs will lead stakeholders to the realization that it would be great to avoid them altogether. There’s a model for that, too - CQRS with Single Write Stores. CQRS informs us to conceptually and architecturally separate read and write storage. Single Write Stores are a simplifying constraint on this architecture that allows you to avoid the hard problem of managing concurrency that is the source of data consistency problems. If you avoid writing data in multiple locations, you avoid concurrent updates, and thus the need to manage concurrency. You can still replicate data for reads where performance mandates, in the form of caching, and you can probably accept a weaker form of consistency like Eventual or Causal Consistency for those caches, built on asynchronous communication.
To get this right, it’s critical to group transactionally related models in the same database. Even if a single model only lives in one database, if it’s part of a transaction that involves models from other databases, then you have to manage concurrent transactions across multiple databases, and you’re in the business of distributed transactions. That’s probably not related to your actual business, and probably a lot more technically complex than whatever your business concerns (unless you work on a highly technical problem).
The type of communication between services dictates the communication processes teams must follow to collaborate. If a service communicates synchronously with another service, the team maintaining the dependency has to ship that dependency before the team depending on it can use it, and there must be a process for managing that dependency in project planning. It’s crucial to address this upfront, as it comes at a cost to project management complexity, and introduces dependencies between teams the org wants to operate independently.
However, the reasons for introducing this project management overhead may be very good if you consider the complexity of consistency models - using the Single Write Store model, if a team needs to mutate an object in another team’s service, they may need to do it synchronously, but they avoid much more complex forms of coordination required to provide consistency between two write stores. The good news is that the team maintaining the write store should already know what commands they need to support from the Event Storming exercise. They will have to add new commands as the business needs evolve, though, and there has to be a process for managing that change.
On the other hand, teams that don’t have synchronous dependencies can operate independently. A team can asynchronously publish mutations to their domain objects to a centralized event bus, and not care about who is consuming those events. This architectural decoupling also decouples the processes between teams, and enables organizational scalability. It’s a lot easier to add teams when you don’t get a combinatorial explosion in inter-team dependencies.
Organization
With your Architecture and Processes specified, including communication between services, it becomes pretty clear how to map Organization on top of that structure. Again, don’t fight Conway’s Law. The BAPO model lets you apply the Inverse Conway Maneuver - structure your organization to build the architecture you believe meets your business goals.
There will be product teams that own Bounded Contexts, and perhaps a Platform Team to maintain connective tissue like an Event Bus. Product teams are already aligned with business needs because they were designed to solve the use cases defined by the Event Storming exercise. The Platform Team can provide infrastructure and abstractions for encapsulating the complexity of managing data correctly. They can provide product teams with patterns for sharing their updates, like Transactional Outbox, and orchestration of multi-step “sagas”. Sagas should be non-transactional by design, informed by the grouping of transactionally related objects discovered in Event Storming. Durable Workflow Execution may also be a need, in order to perform multi-step in-order operations across multiple systems, including third parties. These systems can help manage causal dependencies between asynchronous operations without the need for the metadata to implement Causal Consistency.
Designing for growth
Architecture should be designed to support some well known maturation and growth strategies. Applying BAPO early on will likely highlight a core domain that performs much of the function of the business, with most domain models transactionally related in this core domain. You shouldn’t expect to be able to nicely divide your business by a single denominator (which is what setting team size as a constraint and then trying to slice up your architecture by teams attempts to do). One domain will probably outsize all the others combined, if there are others.
As the business grows, it will need to support new transactions. At some point your core domain may need to be broken into subdomains. Modularity should be baked into the system design from the bottom up. Again, high cohesion, loose coupling is the guiding principle for building systems that are able to change. Modules encapsulate areas of high cohesion, and are loosely coupled. A well-modularized system is more easy to break apart into a separate domain service than a tightly coupled one.
As you decompose the core domain, it’s important to respect transactional relations. If you separate models that are highly transactionally related, you will incur the costs of managing distributed transactions. I have often seen the desire to grow the organization supersede the design of an architecture that can support organizational growth - OPAB instead of BAPO. If you set organizational headcount goals before you figure out how you’re going to have more engineers working on a given system, you are putting the cart before the horse. If you add teams, they will create services, and those services may have a lot of overlap with other services, and without a clear architectural vision of how to properly modularize software and data, you will end up with an emergent, undesigned, and unnecessarily complex system.
That said, engineers should anticipate organizational growth. A healthy organization has to grow. If there is no growth pipeline, every time a person leaves the organization dies a little. A small organization in particular can’t afford to lose people and not have replacements lined up. Systems have to be designed with modularity in mind from the beginning to support decomposing bigger domains into smaller ones (modular design is a good idea anyway). Starting with the business needs and breaking those down using Domain Driven Design will set your organization up for healthy growth.
“ Could most engineers reimplement for example a flavor of Postgres’s Multi-Version Concurrency Control, between multiple databases, over a network?”
Don’t tempt me with a good time!
Good post! I think it’d be fun to dig into team/service communication patterns more. I’ve seen some stuff around this but don’t see it discussed often.