Creating a Stream-Chunking Iterator in Rust

Kevin Hoffman
4 min readFeb 26, 2018

Iterators are, in my opinion, one of the most underrated aspects of the Rust programming language. Every time I learn something new about Rust, I often find myself in awe of the subtle little design touches that facilitate some amazing developer power and productivity.

The other day I felt the need to transmit large (multi-Gig) files to a remote server over gRPC (as one does). gRPC is an amazing protocol, but there are some ways you can abuse it. For example, if you attempt to send messages greater than ~2MB then you’re likely going to start getting the dreaded EOF on a mid-transmit message. So to get this large file from my client to my server, the server’s API accepts a stream (see gRPC Streaming) of file segments.

In order to verify that the file segment wasn’t corrupted in transit or that some other error didn’t occur, the client will hash the segment’s bytes and put it on the message. The server, upon receipt of a single file segment, will also hash the bytes. The server compares its hash with the message hash and if everything is kosher, the segment is accepted. Additionally, there’s a “whole file” hash that needs to be sent on the last message just to be extra diligent.

There are a number of language primitives that let me read an entire file into memory, convert it to bytes, and then I can just slice up the file into chunks (in my case I’m using 1MB chunks or 1,048,576 bytes). The problem with these is that they can fetch too much stuff into memory. If I am transmitting a 7GB file, I certainly don’t want my app to consume 7GB of RAM before it even starts sending. The more efficient way to deal with this is to use an iterator.

Iterators are lazy. They only produce information when asked (by invoking next). If you’ve used them in other languages, this probably sounds familiar. You can also chain these lazy iterators together into a pipeline of sorts, maintaining the per-item laziness. It is the ability to produce an output iterator from an input iterator that allows for some pretty amazing and powerful code.

I felt like an ideal way to simplify my task would be to create an iterator that produces a single, hashed chunk of 1MB data from some source whenever you ask for an element. This essentially guarantees me that my chunk-and-send operation is never going to consume more than 1MB of RAM plus whatever minimal overhead I need for processing.

Combining the power of iterators with the beauty of Traits, I can make my work even more flexible. I don’t have to build my iterator on a file, or a stream, or even any type that I know of at build time. Instead, I can build my iterator on top of the Read trait. Anything that allows me to read a buffer of bytes implements the Read trait, and is therefore an ideal candidate as an input source for my iterator.

For lack of a better name, let’s call my starting struct HashedChunker. This struct just needs to own the source, the thing that implements Read.

pub struct HashedChunker<T> where T:Read {
source: T,
}

Now let’s implement Iterator for HashedChunker:

impl<T> for HashedChunker<T> where T: Read {
type Item = Chunk;
fn next(&mut self) -> Option<Self::Item> {
let mut buffer = [0; 1024*1024];
let res = self.source.read(&mut buffer);
match res {
Ok(count) => {
if count > 0 {
Some(produce_chunk(&buffer[..count]))
} else {
None
}
},
Err(e) => None,
}
}
}

This is the core of the functionality I needed, and it’s remarkably terse. I could even simplify some of this a little more, but I left some things on separate lines just to keep things as self-explanatory as possible.

The core for produce_chunk just converts the reference to a slice (we can’t own slices) into a vector of bytes and computes the hash:

fn produce_chunk(in_data: &[u8]) -> Chunk {
Chunk {
data: in_data.to_vec(),
hash: hash_data(in_data),
}
}
fn hash_data(in_data: &[u8]) -> String {
let mut hasher = Xxhash::with_seed(0);
hasher.write(in_data);
let res = hasher.finish();
format!("{:16x}", res).trim().to_string()
}

This is great, and took far less code than I expected when going down this road. My inner Go developer was screaming at me to just write a for loop over the file and be done with it, but when you see what this abstraction lets me do, I think you’ll agree that these experiments are worth it.

Here’s the code to read from a file in 1MB chunks and transmit them to a remote server along with the hash of each segment:

let f = File::open("input.dat").unwrap();
let chunker = HashedChunker::new(f);
let sent_chunks: Vec<Result<Chunk, String>> =
chunker.into_iter().map(|c| transmit_chunk(c)).collect();

That’s it! The transmit_chunk function just attempts the gRPC call and returns either Ok(chunk) or an error. If I don’t collect the results into a vector right away, I can actually partition the results (using another iterator!) into success and failures. With a type alias I can clean the code up even further:

let (good, bad): (ChunkTransmitResults, ChunkTransmitResults) =
chunker.into_iter()
.map(|c| transmit_chunk(c))
.partition(|c| c.is_ok());

This brief 2 line piece of code is extremely expressive and should be pretty easy to read. I can look at this code and, hopefully after having been away from it for a while, still tell that I’m going to get back 2 vectors after transmitting my file to the server: one containing the good chunks and one containing the failures.

if bad.len() > 0 {
panic!("transmit failed!");
}

To wrap it all up — I had a pretty straightforward problem that needed solving. Instead of just writing a one-off for loop, I embraced the power of Rust’s iterators that allowed me to create something that was both powerful and expressive.

--

--

Kevin Hoffman

In relentless pursuit of elegant simplicity. Tinkerer, writer of tech, fantasy, and sci-fi. Converting napkin drawings into code for @CapitalOne