Teams should optimize for small changes
When it comes to shipping changes, bigger is not better. Every deployment is a bet that the changes will add value to your product. This isn’t something you can prove before shipping - it has to be tested over time in the wild. There may be bugs that don’t show up until a certain set of conditions is met that can only occur in your production environment, there are quality defects that act as death-by-1000-cuts that you only see over time, and even if everything works as intended the customer may not like it. The way to mitigate the downside of these bets is with small deployments.
I learned this lesson when working with a team that had a high defect rate. Applying lessons from Continuous Delivery, I wrote a PR template that broke down the risk of a changeset by three dimensions - size, consequence, and mitigations. Consequence is a measure of the potential negative impact to the business if something goes wrong, essentially how critical the code being changed is. Mitigations are ways of minimizing the likelihood and/or consequences of defects, e.g. testing, feature flagging, or rollback strategy. Size is just the size of the changeset. Each of these were quantified in t-shirt sizes.
This rubric essentially broke the risk down into probability (size minus mitigations) and consequences. This is a common way to evaluate risk known as the Risk Assessment Matrix. A goal of software engineering teams should be to decrease the risk of releases, so our bets are more likely to pay off. There is very little you can do to affect consequences - if you’re changing a critical business process, a defect is always going to be bad. Mitigations like testing and feature flagging should be a standard practice, but can only get you so far. Which leaves us with one lever to pull to minimize risk - size.
It was surprising to me to learn that minimizing risk, which is a core concern of my job, was reducible essentially to making small changes (in addition to following standards of doing my best to ensure the changes contained no defects). There are some strong reasons for this. First, there is the simple fact that every line of code has a probability of containing a defect, so the more lines of code you group together, the higher the probability of a defect in the group. For example, if each line has a 1% chance of containing a defect, then a PR with 10 lines has a 9.6% chance of containing a defect, whereas one with 100 lines has a 63.4% chance1. This is taking the weak assumption that the probability of defects in a given line of a changeset is independent of the probability of defects of other lines, and I think you could make the case that the probability of defects in a given line goes up with the number of other changes (more changes mean more possibility for interactions). Either way, bigger changesets obviously mean more chances for defects.
What is nonobvious, though, is that the probability of a defect making it into production grows super-linearly with the size of the changeset. That’s because the mechanisms we have for catching defects are human-oriented and humans don’t scale to the size of a changeset. To catch defects, we write tests. It would be appealing to think that 100% test coverage means proof of absence of defects, but it really means absence of proof of a defect in those particular tests. There may be tests that were not written that would have identified defects. It’s up to humans to write tests, and other humans to double-check the right tests were written.
The ultimate mechanism for checking for defects is code review. A reviewer has to hold interacting changes in their head to reason about whether they contain defects. A person can hold 10 changes in their head pretty easily, but probably not 100, and definitely not 1000. Given that the probability that a 1000 line changeset contains a defect is relatively high, and the probability that a reviewer can identify those defects is very low, it’s clear that changes this large are risky.
A careful reader will have noticed that I claimed the reviewer needs to hold interacting changes in their head. Changes that are unrelated can be evaluated independently. If your code has good modularization, it’s clear which changes can affect which code. So, what’s the big deal if you ship a 1000 line change if it’s well modularized? Well, what would be the big deal in breaking it into smaller changes, if it’s well modularized? Aggregating unrelated changes makes it hard to know which changes are doing what. If you find a bug in production after a release that contained 5 unrelated changes that could have been shipped independently, how do you know which change caused the bug? If users really dislike something you just shipped, how do you know what it is they dislike? Breaking apart changes that aren’t related helps not only reduce the probability of a defect in any given changeset, but also helps you identify which changes are causing which effects.
The fact that something as simple as the lines of code of a changeset is the best lever you can pull to reduce release risk should come as good news. It’s not very complicated to optimize for small release size, compared to something abstract like “velocity”. In fact, it encourages writing code that is easier to change in small chunks later. If you ship early, often, and small, then it encourages you to write code in modules. It makes you think about how to break the problem down into its component parts, as best you can. In addition to modularization, you may need to use feature flags to deliver partial functionality before it’s ready for your end users. This also encourages best practices for product development - it allows your product owners to use features and give feedback before they are live to customers, therefore tightening feedback loops.
In summary, teams should optimize for small changes. There are many things teams could optimize for, but change size is the only easy metric I know of that reduces risk, encourages modularization, and improves product release cycles.
Using the formula 1−i=1∏n(1−Pi) where Pi is 1% and n is 10 or 100