Synchronous vs asynchronous communication: the leakiest abstraction
Synchronous and asynchronous communication are so fundamentally different that if you have to communication async anywhere in the stack, everything above that layer also has to go async
There are two categories of communication in software systems - synchronous and asynchronous. Synchronous communication is two-way - Alice sends a request to Bob, and Alice waits for Bob’s response. Asynchronous communication is one-way - Alice sends a message to Bob, and Bob does not respond (except maybe with an acknowledgment). For Alice to learn what Bob’s doing with her message, they have to communicate again, either by Bob sending Alice a message or Alice synchronously asking what Bob is up to.
“Synchronous” means “happening at the same time” - the request and response happen together. “Asynchronous” is the opposite, not happening together. Synchronous communication protocols include HTTP, FTP, SSH, and TCP. Asynchronous protocols include AMQP (used by message brokers like RabbitMQ), Kafka Protocol, Websockets, HTTP Server-Sent Events, and UDP. Synchronous protocols are used for blocking, two-way communication between a user and a system or between systems. Asynchronous protocols are used to decouple systems in time - a producer of a message continues with their business while a consumer does something with it.
Synchronous communication is a very useful pattern. When a user or system does something, they get feedback on whether or not what they were trying to do was successful. You lose this near immediate feedback with asynchronous communication. You may “request” that something happens, and never hear back about the status.
This seems like a major downside for asynchronous communication, so what’s the upside? By decoupling the producer of a message from its consumers, you allow the consumer a ton of leeway in how they handle the message. One of the most important patterns this unlocks is retrying. Retrying failures is obviously critical to building resiliency into a system. Services go down, and messages that need to get to those services eventually, but not necessarily right away, can be queued for retry. In a synchronous system, all you can do when a call fails is tell the caller to try again. With asynchronous protocols in place you can try again for them. A related pattern is throttling requests to a third party to avoid exceeding a rate limit. These can’t happen synchronously, because you can’t have the user waiting around indefinitely for things that may take minutes or hours (even days). There is very likely a timeout somewhere in your web stack on the order of 1 minute.
This is clearly a pretty significant tradeoff - immediate feedback on the one hand telling the user when something worked or not (pretty useful information), and resiliency on the other hand. It’s also a completely leaky abstraction to use asynchronous communication. Once you are communicating asynchronously in any layer in your stack, everything built on top of that layer that uses that communication has to know that it’s asynchronous. You can’t communicate asynchronously on the backend when the user sends a request, and still give them a response right away, if the response depends on what happens behind the “asynchronous boundary”.
This presents a subtle problem. A backend architect may want systems to communicate asynchronously, say using an Event Driven Architecture. They may have good reasons for this, around resiliency, scale, decoupling, performance, etc. However, if the result of any actions behind this asynchronous boundary need to be communicated to the user, the entire product experience has to change, all the way up to the frontend. Designers have to consider what happens when a request gets accepted but not worked on for hours. There will be many cases where the user isn’t paying attention to your application when a request finally gets processed, so you’ll have to consider introducing asynchronous communication like email to the user. If you’re not careful, a design constraint on backend systems can have quite negative effects on user experience.
So how do you balance resiliency with good UX? The best way I have found is to pay careful attention to this leaky abstraction, and design your system in such a way that the applications users interact with have all the data they need to respond synchronously to requests letting them know whether they are successful. There may still be side effects of user-issued requests that need to happen asynchronously for resiliency, but the user-facing application should know whether to accept or reject a request. If it is accepted, there should not be anything that prevents the system from eventually fulfilling the side effects. There can be temporary failures that resolve themselves, but if an enqueued side effect is failing due to some invariant that has been violated, that’s a system design flaw that needs to be fixed.
For example, let’s say you have a validation constraint that users should have unique usernames, and you need to create an account in multiple services when a user signs up. If a username is accepted as unique but when you go to create it in another service you find that it isn’t unique in that service, that’s a system design flaw. You either have to ensure that the data flows through one service first, so that it has all the data it needs to know whether to accept new user requests, or implement a form of (synchronous) distributed transaction that makes sure the user can be created in all systems when the request comes in. This is a big reason I advocate strongly for single write stores.
We’re accustomed to being able to hide details as software engineers. That’s a big part of the art of software design. Communication protocols are a case where you can’t hide details, and the art has to happen at a higher-order. The entire system has to be designed to weigh tradeoffs between user experience, resiliency, possible data flows, performance, scale, and team collaboration. The fact that this seemingly simple, super low-level distinction mandates completely different system designs and products is a reminder that we are playing by rules we didn’t write and don’t get to change.