Easypack: a data packer in Rust

2022-04-15 21:05

I've been wondering how hard would it be to write a data packer. And since Rust is my language for experimenting these things... Here we are! This short entry is about the few things I've experimented with when writing Easypack, my new toy.

First of all, I would like to say that it is all JBlow's fault :) In here he shows how to create a simple file packer in his programming language, and I just took the inspiration to try to do the same. Of course, my Easypack is not to be compared to some full-fledged packer - it does not handle (yet) compression, update, remove. But it is easy enough to be an instructive experience.

In short, the binary format

The format itself is nothing crazy. As any binary file, it comes with a 4 bytes header for the name of the format, and then a couple of bytes for the version. The body of the file is really a series of bytes (the input data).

The footer is a bit more interesting, as the Table of Contents is written down there. It contains the name of the content, and its position in the file. So, retrieving the content consists in: read the end of the file to locate the ToC, read it to find the interesting label, and go to the point in the file where this entry is located.

The fact that the ToC is at the bottom (instead of the header) allows the file itself to be updated quite fast, as we only need to read the ToC (fast, from the end of the file), write the new data (if the update is an add), and write the updated ToC again. If it where at the beginning, we would not be able to do that, as adding an entry would mean a complete rewrite, since we would not have space left to add other entries.

So, the tradeoff for this approach is clear: we waste some storage to perform fast update (we don't need to rewrite the whole file).

Enforce good API usage...

... through the type system

As this is basically only a playground, I wanted to play with the Rust type system to make the API harder to misuse. In particular, it is easy to see that when writing the data there are few "states" the packer can be in: we may be writing the header, or the records, or the footer.

If we follow this logic, we can use the type system to enforce a change of state when we move from one step to another. In Rust, this is pretty easy to achieve.

The Packer, for instance, is implemented like this:

struct Packer<S: Steps>

The generic parameter S is what's interesting here. It allows us to encode the state of the struct in the struct itself, so that we can only use methods associated to that state.

For instance, when we are in the header step, we only want to be able to write the header. If we were able to write the records, that would be an API misuse. In other programming language, I would have maybe put the state in the struct itself, and then assert at runtime. Of maybe I would have had a different struct for a different state, but it would have been hard to prevent the use of the old one.

Rust, with its borrow checker and its expressive type system, allow us to write something like:

impl Packer<NoneStep> {
    fn only_in_none_step(&mut self)
    fn other_state(self) -> Packer<HeaderStep>
}

impl Packer<HeaderStep> {
    fn only_in_header_step(&mut self)
    fn other_state(self) -> Packer<RecordStep>
}

which allows us to express 2 things:

some functions are available only when we're in certain states, and
state transition consumes the structure, that is unavailable after that.

This is great! We can enforce that we are only able to (let's say) write records after we've done with the headers phase, and so on!

... using Drop for sanity check

Let's start about what I would like to achieve first.

I would love to have a way to force the user to call an API method when s/he is done with the Packer. In particular, I would like the user to say Ok, I'm done with the records, now I write the ToC.

What I've ended up doing is adding a check in the Drop implementation, which is far from great since it is a runtime assert. Indeed, a simple test would clearly show misuse of the API, but I'm not really happy with this :'(