Wednesday, November 21, 2012

A Formal Language for Data Definitions

Lately, I've started to think again about the irritating problem that there is no formal language for describing binary data layouts (at least not that I know of). So when people attempt to describe a file format or a network protocol they have to resort to vague and nondescript things like:

Each section in the file starts with a header with the format:

4 bytes   header identifier
2 bytes   header length
0--20 bytes  extra data in header

The extra data is described below.

As anyone who has tried to decipher such descriptions can testify, they are not always clear-cut, which leads to a lot of unnecessary work when trying to coax data out of a document.

It is even worse when I create my own data formats (for our engine's runtime data). I would like to document those format in a clear and unambiguous way, so that others can understand them. But since I have no standardized way of doing that, I too have to resort to ad-hoc methods.

This whole thing reminds me of the state of mathematics before formal algebraic notation was introduced. When you had to write things like: the sum of the square of these two numbers equals the square of the previous number. Formal notation can bring a lot of benefits (just look at what it has done for mathematics, music, and chess).

For data layouts, a formal definition language would allow us to write a tool that could open any binary file (that we had a data definition) for and display its content in a human readable way:

height = 128
width = 128
comment = "A funny cat animation"
frames = [
 {display_time = 0.1 image_data = [100 120 25 ...]}

The tool could even allow us to edit the readable data and save it back out as a binary file.

A formal language would also allow debuggers to display more useful information. By writing data definition files, we could make the debugger understand all our types and display them nicely. And it would be a lot cleaner than the hackery that is autoexp.dat.

Just to toss something out there, here's an idea of what a data definition might look like:

typdedef uint32_t StringHash;

struct Light
 StringHash name;
 Vector3  color;
 float  falloff_start;
 float   falloff_end;

struct Level
 uint32_t version;
 uint32_t num_lights;
 uoffset32_t light_data_offset;


 Light lights[num_lights];

This is a C-inspired approach, with some additions. Array lengths can be parametrized on earlier data in the file and a labels can be used to generate offsets to different sections in the file..

I'm still tossing around ideas in my head about what the best way would be to make a language like this a reality. Some of the things I'm thinking about are:

Use Case

I don't think it would do much good to just define a langauge. I want to couple it with something that makes it immediately useful. First, for my own motivation. Second, to provide a "reality check" to make sure that the choices I make for the language are the right ones. And third, as a reference implementation for anyone else who might want to make use of the language.

My current idea is to write a binary-to-JSON converter. I.e., a program that given a data definition file can automatically convert back and forth between a binary and a JSON-representation of that same data.


The syntax in the example is very "C like". The advantage of that is that it will automatically understand C structs if you just paste them into the data definition file, which reduces the work required to set up a file.

The disadvantage is that it can be confusing with a language that is very similar to C, but not exactly C. It is easy to make mistakes. Also, C++ (we probably want some kind of template support) is quite tricky to parse. If we want to add our own enhancements on top of that, we might just make a horrible mess.

So maybe it would be better to go for something completely different. Something Lisp-like perhaps. (Because: Yay, Lisp! But also: Ugh, Lisp.)

I'm still not 100 % decided, but I'm leaning towards a restricted variant of C. Something that retains the basic syntatic elements, but is easier to parse.


Should this system be able to describe any possible binary format out there?

Completeness would be nice of course. It is kind of annoying to have gone through all the trouble of defining language and creating the tools and still not be able to handle all forms of binary data.

On the other hand, there are a lot of different formats out there and some of them have a complexity that is borderline insane. The only way to be able to describe everything is to have a data definition language that is Turing complete and procedural (in other words, a detailed list of the instructions required to pack and unpack the data).

But if we go down that route, we haven't really raised the abstraction level. In that case, why even bothering with creating a new language. The format description could just be a list of the C instructions needed to unpack the data. That doesn't feel like a step forward.

Perhaps some middle ground could be found. Maybe we could make language that was simple and readable for "normal" data, but still had the power to express more esoteric constructs. One approach would be to regard the "declarative statements" as syntactic sugar in a procedural language. With this approach, the declaration:

struct LightCollection
 unsigned num_lights;
 LightData lights[num_lights];

Would just be syntactic sugar for:

function unpack_light_collection(stream)
 local res = {}
 res.num_lights = unpack_unsigned(stream)
 res.lights = []
 for i=1,res.num_lights do
  res.lights[i] = unpack_light_data(stream)

This would allow the declarative syntax to be used in most places, but we could drop out to full-featured Turing complete code whenever needed.


  1. Years ago, I wrote a tool that made some steps towards this. It was a very simple hierarchical block structured language - like a minimal xml and trivial to parse. In it you could write both the data to be converted to binary, and a 'rules' file that would tell a compiler how to do the conversion.

    The compiler would walk the structure of both the rules and the data file simultaneously translating it as it went. There were a set of special rules for common things like adding a header, or saving the size of a block (including those in the future), or the length of an array.

    The basic point of it was to minimise the amount of crap we had to write when converting data from our tools to game ready binary data. We could change the binary format independently of the data and the tools, which was nice.

    It was also a big pain in the ass, mostly because that's all it did. It didn't help with versioning or backwards compatibility (which was a nightmare), or generate your binary loading code for you, and frankly I wrote it when I was still only 1 year into industry so it wasn't exactly my finest piece of code .

    These things are all fixable, but it's definitely a problem with a lot of details that need carefully unpicking. Code gen and backwards compatibility are definitely two biggies. Being able to avoid having to put your data into a text file first is another.

    It's an interesting problem though. We also have a need for this, so I'd be interested in talking to you about it. It just appears that standardised, but adhoc methods, have been easier to date.


  2. FWIW, there's an existing standard called ASN.1 that provides a formal language for describing binary data layouts. I think it's generally used for defining messaging protocols.

    It's verbose and nowhere near as readable as your pseudocode though, so even if ASN.1 provides some of the functionality you're looking for I suspect it's more heavyweight than most game developers would want.

  3. I can see merit in having more formal way for structural definition.
    One approach that brings to mind is a hex editor called Synalyze It[1], where you can define
    grammars for file formats etc. and it can highlight and understand parts of the binary.
    The grammar isn't quite as human readable, being a xml created by the grammar editor, but it has the mechanics laid out needed for many file formats.


  4. Google's protocol buffers might interest you...

  5. Hi,

    I think 010Editor [1] does something similar of what you describe but they have a language close to C in syntax which allows you to describe "every" cases. So it's less simple than what you are proposing but it's very powerful.

    By the way, I totally love your blog, believe it or not but I was working in a game company which was trying to do exactly what your are doing at bitsquid (but only for internal use) and I think we made all the errors you list in this blog (All-in one editor, complex XML format, complex serialization system, everything is an object etc.). I was very disappointed about the technical decisions and... a colleague show me your blog and it blows my mind!!

    Now, I don't work anymore for the game industry neither I code in C++ but I think your blog is one of the main reason (maybe also the book "Coders at work") I'm still coding for a living.

    Keep up good work,
    Andreas, a true fan


    1. Indeed 010Editors Templates have been what I have been using. After talking to a lot of friends that do reverse engineering this is basically the standard.

    2. Thanks for the nice words!

      I had a quick look at the 010Editor data templates. And you are right, it looks very similar to what I was looking for.

      I'll investigate it further.

    3. You're welcome :)

      When I read your blog for the first time I was so amaze cause it was like you answered all our problematic with a very simple, understandable and yet extremely powerful and modular solution. On our side we had some hyper blotted tech that was almost unusable despite five years of R&D... What a waste of time and resources !

      I want to write something about that cause it's the exact opposite (in term of design) as your engine and I think it will be a great example of what to avoid :).

  6. There's such examples of reading such data if you look at any open-source Halo map editor. We had a very similar approach to what you're looking at when writing updates to the editor, Entity. Though this approach would of course need some way to detect the different sets of data.

  7. Answer is "Protocol buffers", already mentioned, but worth to look at.