Rust Async and the Terrible, Horrible, No Good, Very Bad Day
Yesterday I was working on an S3 Provider to provide blob store capabilities to WebAssembly modules in the waSCC ecosystem. If you’re not familiar with the waSCC project, the idea is to take non-functional requirements (or cloud-native capabilities) and dynamically bind them to portable, secure WebAssembly modules that contain pure business logic. Our goal is to make it so you can write your business logic and plug in whatever key-value store, message broker, or blob store you want. You can execute your WebAssembly module logic against a local file system when developing on your laptop and something like S3 or Azure or Google blobs when you’re running in the cloud.
The problem I had sounds deceptively simple — just give the WebAssembly module access to the S3 blob store. It’s actually pretty tricky, but for a pretty fascinating reason if you’re into distributed systems design.
WebAssembly modules are single-threaded. Even though the waSCC dispatching and multiplexing is multi-threaded, once execution enters a wasm module, it’s single threaded internally.
This has some interesting consequences when you’re trying to get these wasm modules (we call them actors in waSCC parlance) to run in the cloud. If a particular function call to an actor takes a long time, waSCC won’t drop connections or miss incoming HTTP requests, but all the inbound work will pile up at the actor’s “front door” like a mountain of delivered but unopened Amazon boxes, waiting for processing to complete.
You can imagine the kind of problems this would create if you’ve got an actor that handles HTTP requests and downloads large files from S3 or Azure or wherever the blob store is. Every long download would immediately gum up the works and block all incoming requests. This also means that a single actor couldn’t download files to handle different requests concurrently (as opposed to in parallel, which is a subtle distinction).
The solution to this wasn’t code, it was distributed system design. Instead of blocking the actor to download a giant file, the actor asks the capability provider to start a download. Control is then released and the function call is allowed to return, which then allows waSCC to pull more work off of incoming queues. The notion of doing small bits of work and then “relinquishing control” is key to patterns like the “reactor” that you see in frameworks like Node.js.
In the background (this will be important when I get to my discussion of async and await), the capability provider is pulling small chunks of a blob out of the blob store. Each chunk is then asynchronously dispatched to the requesting actor. This dispatch goes through the same work queue system that all other work for the actor uses. This means it can interleave HTTP request handling with receipt of file chunks with message broker subscription messages, etc. This allows the actor to be highly concurrent without being internally multi-threaded, a hallmark of the actor pattern.
As long as there’s enough contextual metadata on each chunk delivered to the actor, the actor can deal with it accordingly. In my case for Gantry, that was re-publishing the chunk on a NATS subject to allow a consumer to download the file from the service.
I knew this pattern was feasible, because I was able to get it to work with a local file system provider. However, this is where the terrible, horrible, no good, bad day began…
To recap: the code I needed to write would need to handle a request to initiate a download and immediately return from that handler. In the background, it would download small chunks of the requested file from S3 (or an S3-compliant store like Minio) and deliver them asynchronously to the actor using the waSCC-supplied dispatcher.
I quickly ran into trouble because rusoto — the crate suite that gives Rust developers access to Amazon Web Services (AWS) client functionality — is written entirely in asynchronous code. This did not mesh well with my original code that launched a single background thread to deliver chunks in a loop.
When I couldn’t get the borrow checker to let me combine the use of thread spawning and the invocation of async code, I turned to async_std, a crate that provides a bunch of useful abstractions and short-cuts. I was happy, because switching from my thread spawn to an
async move shortcut magically made the borrow checker happy. My troubles were over! (Narrator: it was not over).
The first painful lesson I learned was something I think Rust developers newly investigating asynchronous development don’t often understand. It certainly took me multiple firm head-to-desk slams to figure it out.
The Rust language contains structs and other utilities to deal with futures. These futures are now hidden for you behind the use of the async keyword on functions and code blocks, and the await keyword on asynchronous invocations. There’s “magic” happening there, and I dislike implicit magic.
Futures don’t do anything until they’re polled. You can chain together 500 futures but no work happens until something attempts poll that unit of work. The frustration so many experience comes from the first lesson learned yesterday:
Lesson I — The Rust language does not include any intrinsic high-level means to poll futures. That is left as an “exercise for the reader”
Sure, we can use the future traits and invoke
poll directly. We can write our own wrappers around it, but that’s going to get tedious quickly and, given the relative complexity, we’re likely to write our pollers the wrong way. This is where tokio and async_std come in. Both of these are crates, they are not a part of the core language, but they contain polling runtimes for use with the futures that are magically produced behind the scenes via the async and await keywords.
With my happy borrow checker, I wrote up some tests to see if I could stream file chunks out of S3 and into a dispatcher (the dispatcher represents the “edge” for my code under test). I immediately got an error shouting at me that the code was not running “within the tokio runtime.”
I thought this was odd, since I’d never once asked to run the tokio runtime. Why would the code complain that I wasn’t using tokio? I thought polling runtimes were supposed to be independent of futures?
After much gnashing of teeth, I realized that rusoto used tokio as a direct dependency. Further, the rusoto crate cannot function without the use of tokio. It is not written in a runtime-agnostic way. This brings up the second hard learned lesson from yesterday:
Lesson II — It is possible for a single dependency of your crate to be so tightly coupled to a future polling runtime that it effectively makes that runtime mandatory for all consumers.
Lesson 2 may be the thing that enraged me the most yesterday. While I can understand the reasoning behind the language not having its own native future poller/runtime, I also absolutely loathe being forced into a technology choice by one of my dependencies. In this case, you can’t invoke rusoto functions unless you do so within a tokio runtime.
In my ideal world, we’d all be writing runtime-agnostic futures using “pure” async and await syntax, and our crate consumers could then happily choose whichever runtime they wanted based on their needs. When my anger subsides, I might take a deeper look into what it would take to make such a library, and the choices people make that prevent this from happening.
Lesson III — If you drop (or allow to be dropped) the tokio runtime, any pending tasks are cancelled.
After I’d finally gotten things to compile, made the borrow checker happy, and had eliminated the cross-pollination of future runtimes, I ran into another issue — all my chunk deliveries were getting cancelled before they could finish! I learned another valuable lesson here: if you’ve decorated a function as a
tokio::main or an
async_std::main , then that function is the lifetime scope of the runtime. If the function returns, the runtime drops, and any active tasks are cancelled.
This doesn’t normally show up in “hello world” samples because people decorate their
main() functions with it, so the runtime lasts as long as the process. My runtime’s scope began with the actor’s request to download and ended immediately thereafter, so my background tasks were all cancelled.
To fix this I decided to create explicitly-scoped runtimes for each of my functions, and I created a runtime dedicated to the background thread in which the chunk delivery happened so I could guarantee it wouldn’t go out of scope until I was done grabbing file chunks. I’m still stewing on this design, and may experiment with some different ways of controlling the runtime scopes in future iterations.
Lesson IV — Each runtime has a single blocking entry point, so it can only ever block on a single unit of asynchronous work.
This one really stumped me. I kept getting errors from the tokio runtime claiming that you can’t create a new runtime from inside another one. I was baffled —”I’m only creating one runtime!”, I shouted at my monitor.
It turns out that deep inside the rusoto code there was a function called
into_sync_reader() that converted a byte stream future into a synchronous object that implemented the
Read trait. I’d used that to get the chunk bytes from S3. As it turns out, the only way to convert asynchronous work into synchronous work is to block on it. So, inside that function, it calls
block_on(), which is the tokio runtime’s singleton entry point. For reasons that are only obvious to me now, you can’t block inside a section of code that’s already being blocked on by an outer scope (aka a deadlock). I fixed this problem by mapping the asynchronous stream into a vector of bytes and awaiting the result in an asynchronous fashion that didn’t block or create a new runtime.
My code finally worked! It was time to celebrate!
Before I conclude this post, I need to thank https://twitter.com/o0ignition0o because he was on the receiving side of my unending tirade of questions and complaints and frustration. I would not have been able to get through the day without his support, and the fact that this community is so willing to help random people who aren’t coworkers is a beautiful thing. I hope I can one day pay that support forward to others.
In conclusion — I failed yesterday. I failed a lot. I made a ton of mistakes, I saw dozens of bugs arise from my own incorrect assumptions about how things worked and why. However, because of all of the mistakes yesterday, I learned a ton of useful information, four key lessons, and I feel infinitely better equipped to write concurrent code going forward.
I failed with purpose, and so I counted the day as a victory.