2022-04-15 21:05
I've been wondering how hard would it be to write a data packer. And since Rust is my language for experimenting these things... Here we are! This short entry is about the few things I've experimented with when writing Easypack, my new toy.
First of all, I would like to say that it is all
JBlow's fault :) In
here he shows how to create a simple
file packer in his programming language, and I just took the inspiration to try
to do the same. Of course, my Easypack
is not to be compared to some
full-fledged packer - it does not handle (yet) compression, update, remove. But
it is easy enough to be an instructive experience.
The format itself is nothing crazy. As any binary file, it comes with a 4 bytes header for the name of the format, and then a couple of bytes for the version. The body of the file is really a series of bytes (the input data).
The footer is a bit more interesting, as the Table of Contents is written
down there. It contains the name of the content, and its position in the
file. So, retrieving the content consists in: read the end of the file to
locate the ToC
, read it to find the interesting label, and go to the point in
the file where this entry is located.
The fact that the ToC
is at the bottom (instead of the header) allows the
file itself to be updated quite fast, as we only need to read the ToC
(fast,
from the end of the file), write the new data (if the update is an add
), and
write the updated ToC
again. If it where at the beginning, we would not be
able to do that, as add
ing an entry would mean a complete rewrite, since we
would not have space left to add other entries.
So, the tradeoff for this approach is clear: we waste some storage to perform fast update (we don't need to rewrite the whole file).
As this is basically only a playground, I wanted to play with the Rust
type system to make the API harder to misuse. In particular, it is easy to see
that when writing the data there are few "states" the packer can be in: we may
be writing the header
, or the records
, or the footer
.
If we follow this logic, we can use the type system to enforce a change of state when we move from one step to another. In Rust, this is pretty easy to achieve.
The Packer, for instance, is implemented like this:
struct Packer<S: Steps>
The generic parameter S
is what's interesting here. It allows us to encode
the state of the struct in the struct itself, so that we can only use methods
associated to that state.
For instance, when we are in the header
step, we only want to be able to
write the header. If we were able to write the records, that would be an API
misuse.
In other programming language, I would have maybe put the state in the struct itself,
and then assert at runtime. Of maybe I would have had a different struct for a
different state, but it would have been hard to prevent the use of the old one.
Rust, with its borrow checker and its expressive type system, allow us to write something like:
impl Packer<NoneStep> {
fn only_in_none_step(&mut self)
fn other_state(self) -> Packer<HeaderStep>
}
impl Packer<HeaderStep> {
fn only_in_header_step(&mut self)
fn other_state(self) -> Packer<RecordStep>
}
which allows us to express 2 things:
consumes
the structure, that is unavailable after that.This is great! We can enforce that we are only able to (let's say) write records after we've done with the headers phase, and so on!
Let's start about what I would like to achieve first.
I would love to have a way to force the user to call an API method when s/he is
done with the Packer. In particular, I would like the user to say Ok, I'm done with the records, now I write the ToC
.
What I've ended up doing is adding a
check
in the Drop
implementation, which is far from great since it is a runtime
assert. Indeed, a simple test would clearly show misuse of the API, but I'm not
really happy with this :'(