Distributed Entity Component System Architecture in the Cloud
Every so often, my mind will fixate on a topic and I won’t be able to eat, sleep, or function properly until I get some kind of closure on that topic. Recently, this topic was the Entity Component System (ECS) architecture and how this translates to distributed, cloud native development. I found plenty of resources on ECS, but couldn’t find enough to slake my thirst for distributed ECS.
Before I get to the “cloudy”, elastic-scaling part of this post, let me take a moment to describe ECS. The Entity Component System pattern is a way of modeling domains that uses composition rather than inheritance. In almost all cases, this pattern is used in game development, either on the server side, the client, or both.
In a typical inheritance-style system we often run into frustrating barriers trying to model the various pieces of our system. With single inheritance, we lose a lot of flexibility. Let’s say we’re building a game and we have “things” called monster
and player
and chair
. Monsters and players can initiate or respond to combat, all three of these entities have a position within the game world represented as a Cartesian coordinate. Finally, you can swing a sword at a chair a few times until it shatters.
Using inheritance we might start with a class called Mobile
that can have a position and move, and a class called Combatant
that inherits from Mobile
to support full combat capabilities. Player
and Monster
inherit from Combatant
and Chair
inherits from … Combatant
? This seems awkward and is an example of the limitations imposed by single inheritance. With multiple inheritance, we might gain a little more flexibility but we run into the “diamond problem” that creates more problems than it solves. It gets worse as we add more variety of functionality. If Mobile
objects can have an inventory, then now we have to give chairs the ability to move, attack players, and hold items all just so we can allow someone to destroy them.
This is the problem that the Entity Component System pattern attempts to solve. As the name implies, there are three core facets to this pattern:
- Entity — An entity is an extremely simple object that has little more than a unique identifier (and maybe some other housekeeping bits depending on implementation details). An entity represents anything that exists within the game world.
- Component — A component is an aspect of something that exists within the game world. It is a simple attribute or set of tightly related attributes. In my inheritance example, we might have components such as
health
,position
,inventory
, etc. Components are not behavior, they are simple data attributes. - System — Systems provide behavior and logic within a game world. Systems operate on components of certain types and usually do so without any explicit knowledge of the entities to which the components are bound. Usually some kind of loop or dispatcher is responsible for invoking systems, which then perform logic on components. The result of a system invocation is either direct mutation of components or, in more robust implementations, dispatched events which can then be used to modify component data.
The classic ECS example involves modeling an entity with a position
and a velocity
component. We then have a Physics
system which iterates through a list of all position
components and then changes the position according to the velocity.
Popular game engines like Unity and Unreal allow developers to build games up from these fundamental elements. This pattern provides such flexibility and power that it can even support marketplaces where developers can buy and share systems and their associated components. Not everyone has the skills to develop a full physics system with realistic parabolic acceleration and deceleration curves, but using ECS, you can slot such a system into virtually any game and attach its components to any entity you like.
Entities can be little things or even things that might not have a direct physical presence. An entity can be a bullet with a destination
, position
, and velocity
component. Entities can also be things like spell effects and invisible things like collision rectangles. Other entities can be spawned dynamically, like creating a shrapnel bomb
entity when the Collision
system detects a collision between a mortal entity and a triggerbox
entity with a trigger component with the ID of a new entity to spawn. As you can see, the possibilities are endless.
If you’re looking for implementations of the ECS pattern, I’m particularly fond of the Specs library in Rust and this one for Elixir.
But what about the cloud?
Now that we’ve had a quick introduction to ECS, let’s get to the main reason why I’m posting this: distributed ECS and ECS in the cloud. Let’s say we’ve got a single process that’s holding a few hundred thousand, maybe even a million entities, each with a pile of components, and we’ve got a dozen systems running in that process. Since it’s a server, we don’t have a Rendering
system but we might have a system for NetworkSync
to deal with communication with the game clients. The more concurrent users we have online and the further spread out across the world they are, the more resources they will consume. Entities will get added and the system loops will get longer and longer as it takes more effort to process all the logic for everything in memory.
Having this many users is obviously a good problem to have, because then we can go spend all the profits from our mega-hit game. However, our game is going to start crashing pretty soon as a single server, however optimized, isn’t going to hold up under load.
When building traditional microservices, the answer is to just spin up more instances of those services. The services are stateless, so we can run an arbitrary number of them behind a load balancer and the clients will be unaware of the strategy. This doesn’t really work for online, massive user simulations like MMORPGs. System loops need to run really fast. If you consider that a firebolt
spell launched from one entity at another entity is also an entity with position
, velocity
, damage
, and other components and other entities that might spawn upon impact (e.g. blast radius fire), multiply that by all the spellcasters in the game world wielding spells at any given time. And that’s just a single type of entity. You would be shocked if you saw how many entities were “live” at any given moment in a AAA MMORPG.
It would be horribly inefficient to take a microservice-y RESTful approach and load the firebolt components from persistence during each tick of a system loop as well as how to keep the firebolt movement idempotent so multiple systems don’t duplicate the ticks.
One of the popular methods for dealing with this problem is to create smaller worlds. Rather than a single monolithic process to manage the ECS world instance for everything in the known universe, we dedicate a smaller process to manage an ECS instance that only contains entities for a subset of the universe.
If you’ve played online multiplayer games you’ve probably seen this, even if you haven’t explicitly noticed. In games built this way, you might have some common areas of the game that always have a pile of players in them, even if they’re not on your team. Then, when your team leaves this area to go on other adventures, everyone else disappears and you seem to be in a private copy of that part of the world. Developers will often use components and systems to deal with the transition between ECS world instances — a player could collide with a trigger box where the box’s trigger action component indicates a transition to another part of the world, at which point the networking components will take you out of one world and into another.
Back in the good old days of the early MMORPGs like EverQuest, this process was called “zoning” and could take several minutes. Today we can create all kinds of optimizations in our code and networking infrastructure that can make these transitions seamless.
If you’ve encountered games like World of Warcraft or its ilk, then you should be familiar with the concept of a game server. They often have thematic names and even custom rules (e.g. “player versus player combat allowed”), but their true purpose is to shard traffic. These so-called game servers are often actually server clusters. These clusters are scheduling multiple processes that are managing ECS worlds as described above — sometimes as monoliths and other times as implementations of the “lobby and instance” model, or a hybrid approach.
ECS worlds often have systems built into them for monitoring, or run under the watchful eyes of other external monitors. As an example let’s take a “lobby” in a game where players can hang out, form teams, and interact with shops and vendors. Developers may have determined that on the current architecture, each lobby can support about 150 players before performance begins to suffer. Whether an internal system is doing the monitoring in a dispatch loop, or an external system is watching, you might kick off a new instance of an ECS world host process for the lobby when the player count reaches 140. Then, at 150, you start redirecting newly arriving players to the new world (often also called a shard). We can also front lobby host processes with a load balancer that is informed by the current size and capacity of each lobby. Such load balancers might round robin unaffiliated players into lobbies while sending grouped players or friends into the same lobby.
When we build microservices in the cloud, we expect that we can dynamically scale up our instances to deal with increased demand and scale back down when demand reduces. We can do the same thing with ECS world hosts, and we can even have such hosts spun up by popular container schedulers like Kubernetes. I would be interested to find out if we could use basic Kubernetes constructs like services and ingress controllers to manage UDP routing to ECS world hosts while not impacting the low-level performance such simulations demand.
But what about fault tolerance? In our cozy microservices world, we expect that stateless services can crash and recover with no one noticing. We can accomplish this using standard strategies of persisting the important information that can be loaded when a process starts. You might actually be surprised at how little information from a simulation/game needs to be persist across instances. Core player data needs to survive, but a lot of things are perfectly fine resetting to their default states if an ECS host restarts. Some ECS zones might require a higher SLA than that, so games often run hot backups of each zone that are kept in sync and take over automatically if the primary fails.
In conclusion, I just want to re-iterate that the Entity Component System pattern is a powerful one when applied to the right problem, and if we get creative we can still apply much of our knowledge and infrastructure from the world of microservices and the cloud. In addition, I think ECS might be ready to expand outside the realm of game development in certain modern simulation scenarios. For example, I can think of a few simulation applications that might provide a virtual or augmented reality layer on top of IoT systems.
TL;DR — go forth and build games!