From Orchestration to Choreography — Migrating Batch Processing to the Cloud
From years of working for, or consulting with, large enterprises in a dozen different industries, I’ve recognized a lot of recurring patterns. As different and unique as companies believe themselves to be, many of them tend to solve the same problems the same way.
One of the most common solutions to these common problems is a classic example of Conway’s Law (organizations design systems that are mirrors of their own internal communication patterns). Teams form up around individual business units, and when data needs to flow between those units, orchestration patterns appear along already existing lines of human communication. In countless enterprises, this takes the form of what I call “dump and poll” orchestration implemented with files and file transfers. One side “dumps” the file, the other side “polls”, waiting for the arrival of files. This starts off small, as a simple nightly exchange between departments. It works, so more teams adopt this approach. Years later, the company has woven a tangled web of semi-related file transfers to get work done.
You can almost smell it when you’re using a highly responsive mobile or web application that fronts one of these enterprises. Changes you make don’t show up until tomorrow, your rewards points for your hotel stay don’t show up in the mobile app until a week after you check out. Customers push a button on a website and nothing happens until 24 hours later. One conjures up images of vast warehouses of little automatons scurrying back and forth, acting as couriers for tiny envelopes of information, all while the outside world sees what appears to be a modern enterprise.
I pass no judgment here. This is just a natural evolution of the way enterprises shifted from the “assembly line” mentality to more modern approaches to software development. Folks had to keep the lights on while trying to innovate with limited tools or support for radical change. Whether it was true or not, most of these organizations didn’t think they could stop what they were doing to re-architect these solutions. So, over time, these solutions became more and more entrenched, growing barnacles and attaching more process-bound tentacles.
Enter the cloud. Now these enterprises are trying to take advantage of all that the cloud offers, and they start looking at how they can lift these applications out of the past and into the modern, cloud-native future of microservice ecosystems.
What are you going to do with systems like this when running on PaaS (Platform as a Service)? Cloud native applications cannot rely on the underlying file system. The file system is transient, subject to removal at any time, and many file writing scenarios might not even be supported inside a linux container.
This is where I like to recommend changing the mentality from orchestration to choreography. Orchestration is the old model, based on tight coupling between components, all in service of a single overarching process. Choreography is reactive and consistent with modern microservice ecosystem design philosophy. Each component acts in its own isolated interests, unaffected by and unaware of the activities of other actors.
The single most important difference between these two ideas is that in orchestration, every individual component has explicit, tightly coupled knowledge of where their output is going and what consumer is using it. Worse, these components are often tightly coupled to how the consumer will be using the data/output.
Let’s take an (admittedly oversimplified) example of a complex system. In this system, we have an application that accepts bids for work to be performed. When bids are accepted, those become work orders (some domains also call these purchase orders). When a work order is complete, that becomes an invoice.
In a legacy, file-based orchestration system, we might have a flow that looks like this:
In this diagram, we’re looking at data flowing through a system used to trigger work. The problem with this system is that each component knows exactly from where its work arrives and how to push work downstream.
What do we do if we want to start a parallel stream of work in response to the arrival of a bid? What if we want the work management app to pull information from further upstream that isn’t available in the flat files its working with? What do we do if we want to run six concurrent bid processors — what kind of drastic code rewrites is that going to require?
The list of failure points in this architecture is endless, but one of the biggest is that if any piece of this system fails, the entire system grinds to a halt and nothing works properly. This system is brittle, easily confused, and very far from fault tolerant.
The good news is that we’ve decided to modernize our system to include proper microservices that are each cloud native and running on a PaaS that gives us elastic scalability, easy deploys, and resiliency. We want to move away from orchestration and toward choreography. Our services in turn then need to adhere to the Single Responsibility Principle (SRP) and must not have direct knowledge of where and how their outputs will be used. How do we design a system like that?
One simple solution is to set up a queue (or bus) architecture for communication.
In this diagram, you can see that we’ve got a bid processor service. It might emit BidProcessed events. Then, a service that produces Work Orders consumes these BidProcessed events and in turn emits WorkOrderCreated events. We have another service that consumes WorkOrderCreated events, which might provide some kind of dashboard for users of the system to manage work. When work is completed, another service emits WorkOrderCompleted events into the system. Each of these services has its own private data store, and this is a very important aspect of this architecture.
In this same system, we can add consumers of messages and introduce entire new parallel work streams and analytics flows without ever having to rewrite or redeploy our existing services. For example, we could add a service that consumes all of the events and provides management with a dashboard showing the transaction flow rate through the system, or we could add some intelligence to flag people when too much time has elapsed between a WorkOrderCreated event and a WorkOrderCompleted event.
Another benefit we get is that now each of our services are independently deployable and independently scalable. They can follow separate release cadences, be worked on by separate teams, and can scale up or down to meet demand without impacting any other part of the system.
We could introduce additional listeners for work orders that might start entire new flows that track and manage the actual work being performed.
This is just one example based on one trivialized domain. The real world is a much more complicated place where the solutions are far less cut and dry. However, the reason I’m writing this post is to hopefully show that even ancient legacy processes supported by NAS, FTP, Samba shares, or homing pigeons carrying printed messages can be brought forward into the world of modern software development with a little thought devoted to architecture and design.
Further Reading: Choreography and reactive design aren’t easy, and they aren’t magic pills. Nor are they without their own sets of drawbacks and concerns that you need to take into consideration. For example, when switching to a queue-based reactive approach, you’re stepping into the world of event sourcing and you’re going to run into the idea that there is no such thing as “exactly once” message delivery, and so you’ll need to build idempotency or de-duping (or both) into your system, as discussed in this blog post.