bitsquid: development blog: 2012

Tuesday, December 11, 2012

Four meditations on bad design decisions

I've recently been doing a major rewrite of one of our core engine systems, the graph that we use for our visual scripting language Flow. Taking it from something that looks like this:

To something that looks like this:

A major rewrite like this is always a humbling experience. When you have to rewrite your own code, every bad decision you made comes back to haunt you. And you don't have anybody else to blame them on.

As if facing your own inadequacy wasn't enough -- rewriting an existing system is always harder than writing one from scratch. When you write a new system you start with a blank slate and can do whatever you want. When you rewrite, you are constrained by what the old system did -- at least if you want to maintain any kind of backwards compatibility.

In addition, a new system can be written iteratively. You can start with a very small, simple system, release early and get feedback. Based on that feedback you can tweak the system. You don't have to think about adding features until you have a good stable base.

When you are doing a rewrite you can't release the new system until it is at least as good as the old one. Otherwise, your users will question why you have spent all that time working on a system that is worse than what you had before. And they will be right.

So a rewrite forces you away from the comfortable land of early releases and quick iterations and into the ugly old waterfall model.

With the power of hindsight, I'd like to reflect a bit on four design mistakes I made when I wrote the first version of the system that made this rewrite a lot harder than it could have been.

Don't use strings for non-text things

Strings have one really good use -- to hold pieces of text that either gets displayed to or inputted by the user. All other use of strings should be regarded as suspicious.

Strings are scary because they are both ambiguous and powerful. Does "a/b.txt" and "A//b.txt" represent the same path? Hard to tell. But maybe you can use case conversion, search and replace and some regular expression monstrosity to figure that out.

If you are doing that kind of string manipulation in any part of the code that is not directly related to user input or output, it is a clear warning sign that your code might be too "stringified".

The most obvious example stringified code is the use of "stringly typed" data, for example, storing a date as the string "2012-12-09". But the problem with strings can also manifest more subtle ways.

The original version of Flow used strings to identify connectors, both internally (as a representation of the connection) and visually (to show the name of the connector):

As a consequence, a Flow node couldn't have two connectors with the same name, and a connector couldn't be renamed (even visually) without breaking all existing connections.

In retrospect, rather than having a single Name property, it would be much better to have separate Id and DisplayName properties. The Id would be a GUID that uniquely identified the property, and the DisplayName would be a (localizable) name, suitable for displaying to the end user.

Using names/strings as identifiers has bitten me in other ways as well. In one system I knew that the names had to be unique (because that is how the script would refer to the objects) so I thought it would be safe to use them as identifiers. What I didn't consider was that there could be situations when there temporarily were two objects that had the same name. For example, if the user had created a rock object, and wanted to create a rock_small object -- as she was half-way through typing that name, there would suddenly be two objects named rock. This created problems for the system.

Lesson learned, I now avoid using strings as identifiers.

When in doubt, you should opt-out

Every system acquires features over time. That is good of course. Those features make the system more powerful and easier to work with.

But among the good features there are usually a few that don't feel quite right. That don't really fit into the design of the system. You can do them of course. You can do anything.

But usually it is best not to. Most of the time when I have added a feature that didn't quite feel right, I have regretted it later. In retrospect it would have been better to try to find a different way of doing what the users wanted that was more natural to the ideas behind the system.

An example: Users of Flow wanted some way to specify the order in which events were triggered, when multiple connections are connected to the same Out connector. This is needed in some situations, for example you may want to make sure that a unit is spawned before it is used.

In the old version of Flow, this was implemented with a context menu on the connection where you could select if it should be a "Do First", "Do Last" or "Do Normal" connection.

This solution never felt 100 % right to me. It was hard to find a good intuitive way to visually represent the "Do First" and "Do Last" connections, and as a result the Flow graphs became harder to understand.

In retrospect, it would have been much better to avoid this feature and wait until I had come up with the more elegant alternative: a sequence node that triggers each of its outputs sequentially:

Be explicit or you'll miss it

Writing code where a lot of things happen implicitly feels great -- to begin with. It is amazing how much you are able to do with just a few lines of code.

But in my experience, implicit code almost always ends up more costly in the long run. It is harder to understand, harder to debug and harder to change. It tends to lock you down in a "local minimum" that can be tricky to come out of.

In Flow, a lot of things are done implicitly. The definition of a Flow node is just a simple C# class:

[Category("Animation")]
public class AnimationEvent : Node
{
    public InVariableUnit Unit;
    public StringVariable Event;
    public InEvent In;
    public OutEvent Out;
}

Through reflection, Flow finds out the members in the class and their types and automatically generates Flow nodes for them. This process involves some ugly string processing (bad decision #1), such as stripping In and Variable from the type name to find the underlying type of members. Reflection is also used to serialize the graphs.

While it is nice to be able to express a node so concisely, there are also a lot of problematic consequences. For example, since the class names get serialized, we can't change the names of classes or properties without breaking the ability to load old files. Also, we have to use some really ugly C# hacks to make sure that the reflection system always returns the members of a class in the order they are declared in the file (so that we can control the order of the connectors).

In retrospect, it would been much better to avoid all this clever reflection stuff and instead just define the node types in configuration files.

Avoid the road of complex code

There is some code that needs to be complex, because it is dealing with fundamentally tricky stuff (like computational geometry) or because it needs to run really, really fast. But in all other cases, complexity is just a cost.

If your code starts to feel complex and hard to keep track of, it is a sign that you are probably doing something wrong. And if you are not careful, you may lock yourself in, so that when you write the next version of the system, you have to recreate all that complex behavior in your new, simpler system. You have to deliberately make your code uglier.

The old version of Flow had a way of "folding" nodes. You could select a number of nodes, group them, and then "fold" the group, collapse it to a single node.

The system had a lot of really hairy code for dealing with this. The code takes a bunch of nodes and creates a new node from them, with connectors matching only the external connectors of the collapsed nodes. it also keeps track of the internal nodes and their connections so they can be recreated if the node is later "expanded".

As you might imagine, this was complicated further by the need for connector names to be unique (see bad decision #1), which meant that some of the external connectors in the new node had to be renamed (since they could come from different internal nodes that had connectors with the same name). So a mapping table was needed to keep track of these renames. Obviously a bad idea, but once you have started down the path of wrongness, it can be hard to turn around.

The new version handles this a lot better. Collapse and expansion is just a visual feature. There are no new nodes created and no other strange things happening to the data, the visualizer just chooses to draw the data in a different way when it is collapsed. In retrospect, that is a much better choice.

That is all, four simple lessons
to guide your future coding sessions
now let your code be light and merry
until its time for Charon's ferry

Wednesday, November 21, 2012

A Formal Language for Data Definitions

Lately, I've started to think again about the irritating problem that there is no formal language for describing binary data layouts (at least not that I know of). So when people attempt to describe a file format or a network protocol they have to resort to vague and nondescript things like:

Each section in the file starts with a header with the format:

4 bytes   header identifier
2 bytes   header length
0--20 bytes  extra data in header

The extra data is described below.

As anyone who has tried to decipher such descriptions can testify, they are not always clear-cut, which leads to a lot of unnecessary work when trying to coax data out of a document.

It is even worse when I create my own data formats (for our engine's runtime data). I would like to document those format in a clear and unambiguous way, so that others can understand them. But since I have no standardized way of doing that, I too have to resort to ad-hoc methods.

This whole thing reminds me of the state of mathematics before formal algebraic notation was introduced. When you had to write things like: the sum of the square of these two numbers equals the square of the previous number. Formal notation can bring a lot of benefits (just look at what it has done for mathematics, music, and chess).

For data layouts, a formal definition language would allow us to write a tool that could open any binary file (that we had a data definition) for and display its content in a human readable way:

height = 128
width = 128
comment = "A funny cat animation"
frames = [
 {display_time = 0.1 image_data = [100 120 25 ...]}
 ...
]

The tool could even allow us to edit the readable data and save it back out as a binary file.

A formal language would also allow debuggers to display more useful information. By writing data definition files, we could make the debugger understand all our types and display them nicely. And it would be a lot cleaner than the hackery that is autoexp.dat.

Just to toss something out there, here's an idea of what a data definition might look like:

typdedef uint32_t StringHash;

struct Light
{
 StringHash name;
 Vector3  color;
 float  falloff_start;
 float   falloff_end;
};

struct Level
{
 uint32_t version;
 uint32_t num_lights;
 uoffset32_t light_data_offset;

 ...

light_data_offset:
 Light lights[num_lights];
};

This is a C-inspired approach, with some additions. Array lengths can be parametrized on earlier data in the file and a labels can be used to generate offsets to different sections in the file..

I'm still tossing around ideas in my head about what the best way would be to make a language like this a reality. Some of the things I'm thinking about are:

Use Case

I don't think it would do much good to just define a langauge. I want to couple it with something that makes it immediately useful. First, for my own motivation. Second, to provide a "reality check" to make sure that the choices I make for the language are the right ones. And third, as a reference implementation for anyone else who might want to make use of the language.

My current idea is to write a binary-to-JSON converter. I.e., a program that given a data definition file can automatically convert back and forth between a binary and a JSON-representation of that same data.

Syntax

The syntax in the example is very "C like". The advantage of that is that it will automatically understand C structs if you just paste them into the data definition file, which reduces the work required to set up a file.

The disadvantage is that it can be confusing with a language that is very similar to C, but not exactly C. It is easy to make mistakes. Also, C++ (we probably want some kind of template support) is quite tricky to parse. If we want to add our own enhancements on top of that, we might just make a horrible mess.

So maybe it would be better to go for something completely different. Something Lisp-like perhaps. (Because: Yay, Lisp! But also: Ugh, Lisp.)

I'm still not 100 % decided, but I'm leaning towards a restricted variant of C. Something that retains the basic syntatic elements, but is easier to parse.

Completeness

Should this system be able to describe any possible binary format out there?

Completeness would be nice of course. It is kind of annoying to have gone through all the trouble of defining language and creating the tools and still not be able to handle all forms of binary data.

On the other hand, there are a lot of different formats out there and some of them have a complexity that is borderline insane. The only way to be able to describe everything is to have a data definition language that is Turing complete and procedural (in other words, a detailed list of the instructions required to pack and unpack the data).

But if we go down that route, we haven't really raised the abstraction level. In that case, why even bothering with creating a new language. The format description could just be a list of the C instructions needed to unpack the data. That doesn't feel like a step forward.

Perhaps some middle ground could be found. Maybe we could make language that was simple and readable for "normal" data, but still had the power to express more esoteric constructs. One approach would be to regard the "declarative statements" as syntactic sugar in a procedural language. With this approach, the declaration:

struct LightCollection
{
 unsigned num_lights;
 LightData lights[num_lights];
};

Would just be syntactic sugar for:

function unpack_light_collection(stream)
 local res = {}
 res.num_lights = unpack_unsigned(stream)
 res.lights = []
 for i=1,res.num_lights do
  res.lights[i] = unpack_light_data(stream)
 end
end

This would allow the declarative syntax to be used in most places, but we could drop out to full-featured Turing complete code whenever needed.

Thursday, November 1, 2012

Bitsquid Foundation Library

Today I want to talk a bit about the Bitsquid Foundation Library that we recently released on Bitbucket (under the permissive MIT license).

It's a minimalistic "foundation" library with things like memory management and collection classes. The idea is to have something that can be used as a reasonable starting-off point for other open source projects.

The library makes some interesting design choices that touches on topics that I have already talked about in this blog and that I think are worth elaborating on a bit further. It also serves an example on how these techniques can be used in "real world" code.

Separation of data and code

The foundation library implements the idea of separating data definitions and function implementation, that I talked about in this article.

Data is stored in structs with public members (prefixed with an underscore to indicate that you should not mess with them unless you know what you are doing) that are found in *_types.h files. Functions that operate on the data are written outside of the struct, in separate *.h files (and organized into namespaces).

For example, the data definition for the dynamic Array<T> class is found in collection_types.h:

template<typename T> struct Array
{
    Array(Allocator &a);
    ~Array();
    Array(const Array &other);
    Array &operator=(const Array &other);
    
    T &operator[](uint32_t i);
    const T &operator[](uint32_t i) const;

    Allocator *_allocator;
    uint32_t _size;
    uint32_t _capacity;
    T *_data;
};

The struct contains only the data used by the array and the operators which C++ forces us to implement as member functions.

The implementation of these functions, as well as the declaration and definition of all other functions that operate on the arrays are found in the array.h file. It contains things like:

namespace array
{
    template<typename T>
    inline uint32_t size(const Array<T> &a)
    {return a._size;}
    
    template<typename T>
    inline bool any(const Array<T> &a)
    {return a._size != 0;}
    
    template<typename T>
    inline bool empty(const Array<T> &a)
    {return a._size == 0;}
}

template <typename T>
inline Array<T>::Array(Allocator &allocator) :
    _allocator(&allocator), _size(0), _capacity(0), _data(0)
{}

This way of arranging data and code fills two purposes.

First, it improves compile times by reducing header inclusion. Header files that want to make use of arrays only need to include collection_types.h, which just contains a few struct definitions. They don't have to drag in array.h, with all its inline code.

Headers including other headers indiscriminately because they need their types is what leads to exploding compile times. By only including the minimal thing we need (the type definitions), compile times are minimized.

Second, and more importantly, this design allows the collection types to be freely extended. Is there anything you miss in the array interface? Perhaps you would like shift() and unshift() methods? Or binary_search()?

No problem. If you want them you can just add them, and you don't even need to modify array.h. Just create your own file array_extensions.h or whatever, and add some new functions to the array namespace, that manipulate the data in the Array<T> interface. The functions you create will be just as good as the functions I have created.

Note that this isn't true for traditional class designs, where you have first-class citizens (methods) and second-class citizens (external functions).

The foundation library has some interesting examples of this. For example, the string_stream functions don't operate on any special StringStream class, they just directly use an Array<char>. Also, the hash and multi_hash interfaces both work on the same underlying Hash<T> struct.

I believe that this design leads to simpler, more orthogonal code that is easier to extend and reuse.

Memory management

The library implements the allocator system mentioned in this article. There is an abstract Allocator interface, and implementations of that interface can provide different allocation strategies (e.g. ArenaAllocator, HeapAllocator, SlotAllocator, etc).

Since I want to keep the library platform independent, I haven't implemented a PageAllocator. Instead, the MallocAllocator is used as the lowest allocator level. If you want to, you can easily add a PageAllocator for your target platform.

For the same reason, I haven't added any critical section locking to the allocators, so they aren't thread safe. (I'm thinking about adding an interface for that though, so that you can plug in a critical section implementation if needed.)

The system for temporary allocations is kind of interesting and deserves a bit further explanation.

Most games have a need for temporary memory. For example, you might need some temporary memory to hold the result of a computation until it is done, or to allow a function to return an array of results.

Allocating such memory using the ordinary allocation system (i.e., malloc) puts a lot of unnecessary stress on the allocators. It can also create fragmentation, when long lived allocations that need to stay resident in memory are mixed with short lived temporary allocations.

The foundation library has two allocators for dealing with such temporary allocations, the ScratchAllocator and the TempAllocator.

The ScratchAllocator services allocation requests using a fixed size ring buffer. An allocate pointer advances through the buffer as memory is allocated, and a corresponding free pointer advances as memory is freed. Memory can thus be allocated and deallocated with simple pointer arithmetic. No calls need to be made to the underlying memory management system.

If the scratch buffer is exhausted (the allocate pointer wraps around and catches up with the free pointer), the ScratchAllocator will revert to using the ordinary MallocAllocator to service requests. So it won't crash or run out of memory. But it will run slower, so try to avoid this by making sure that your scratch buffer is large enough.

If you forget to free something allocated with the ScratchAllocator, or if you accidentally mix in a long-lived allocation among the short-lived ones, that allocation will block the free pointer from advancing, which will eventually exhaust your scratch buffer, so keep an eye out for such situations.

TempAllocator<BYTES> is a scoped allocator that automatically frees all its allocated memory when it is destroyed (meaning you don't have to explicitly call deallocate(), you can just let the allocator fall out of scope). This means you can use it everywhere where you need a little extra memory in a function scope:

void test()
{
     TempAllocator1024 ta;
     Array<char> message(ta);
     ...
}

The BYTES argument to TempAllocator<BYTES> specifies how much stack space the allocator should reserve. The TempAllocator contains char buffer[BYTES] that gets allocated on the stack together with the TempAllocator.

Allocation requests are first serviced from the stack buffer, then (if the stack buffer is exhausted) from the ScratchAllocator.

This means that TempAllocator gives you an allocator that can be used by all collection classes and will use the fastest allocation method possible (local stack memory, followed by scratch buffer memory, followed by malloc() if all else fails).

Minimalistic collection types

The collection classes in the library are distinctly anti-STL. Some of the important differences:

They use the allocation system described above (taking an Allocator as argument to the constructor). They can thus be used sensibly with different allocators (unlike STL types).
The use the data/function separation also described above, which means that the headers are cheap to include, and that you can extend them with your own functionality.
They use a minimalistic design. They assume that the stored data consists of plain-old-data objects (PODs). Constructors and destructors are not called for the stored objects and they are moved with raw memmove() operations rather than with copy constructors.

This simplifies the code and improves the performance (calling constructors and destructors is not free). It also saves us the headache of dealing with storing objects that must be constructed with Allocators.

Personally I like this minimalistic approach. If I want to keep non-POD data in a collection, I prefer to store it as pointers anyway, so I have control over when and how the data is constructed and destroyed. I don't like those things happening "behind my back". You may disagree of course, but in that case you are probably happy to use STL (or boost).

Another example of choosing minimalism is the Hash<T> class. The hash uses a fixed key type which is a uint64_t. If you want to use a key that doesn't fit into 64 bits, you have to hash it yourself before using it to access the data.

And more?

I'm planning to add some basic math code to the library, but haven't gotten around to it yet.

Is there anything else you'd like to see in a library like this?

Wednesday, October 17, 2012

A Data-Oriented, Data-Driven System for Vector Fields -- Part 3

In this post, I'll finish my series on vector fields (see part 1 and part 2) by tying up some loose ends.

Quick recap of what has happened so far:

I've decided to represent my vector fields in functional form, as a superposition of individual effect functions G_i(p).
I represent these functions in bytecode format, as a piece of bytecode that given an input position p computes a vector field strength F_i.
By running each step of the virtual machine over a thousands of input points, the cost of decoding and interpreting the bytecode instructions is amortized over all those points.
This means that we get the bytecode decoding "for free" -- the bytecode can run at nearly native speed.

Bytecode format

In the last article I didn't say much about what format I used for the bytecode. Generally speaking, designing a bytecode format can be tricky, because you have to balance the compactness (keeping programs short) against the decoding cost (keeping bytecode fast).

Lucky for us, we don't care about either of these things. Compactness doesn't matter, because our programs will be very short anyway (just a few instructions). Decoding cost doesn't matter (much), because it is amortized.

When it doesn't really matter I always pick the simplest thing I can think of. In this case it is something like:

(instruction) (result) (argument-1) (argument-2)

Here, instruction is a 4-byte instruction identifier. result is a 4-byte channel identifier that tells us which channel the result should be written to. argument-1 and argument-2 are either channel identifiers or Vector4's with constant arguments. (Instructions of higher arity would have more arguments.)

Note that using 4 bytes for instructions and registers is beyond overkill, but it is the simplest option.

One annoyance with this representation is that I need different instructions depending on whether argument-1 or argument-2 is constant. For a 2-arity instruction, I need four variants to cover all cases. For a 4-arity instruction (such as select), I would need 16 variants.

There are two ways of dealing with this. First, I could make the code that executes each instruction a bit more complex, so that it can handle both constant and register arguments. Second, I could make all instructions operate only on registers and have a single instruction for loading constants into registers.

Unfortunately, both of these option results in significantly slower bytecode. In the first case, the extra logic in each bytecode executor makes it slower. In the second case, we need extra instructions for loading constants, which increases the execution time.

So at least for two argument functions, the best option seems to be to have separate code for handling each argument combination. For four argument functions, it might be better to use one of the other options.

Just to give you some example of how the bytecode works, here is some raw byte code and the corresponding disassembled bytecode instructions:

05000000 02000000 00000000 00000000000020410000000000000000
r2 = sub          r0       (0,10,0,0)

16000000 03000000 00000000000000000000803f00000000 02000000
r3 = cross        (0,0,1,0)                        r2

0a000000 04000000 00002041000020410000204100002041 03000000
r4 = mul          (10,10,10,10)                    r3

10000000 03000000 02000000 02000000
r3 = dot          r2       r2

0c000000 05000000 04000000 03000000
r5 = div          r4       r3

09000000 03000000 05000000 0000a0400000a0400000a0400000a040
r3 = mul          r5       (5,5,5,5)

00000000  01000000  01000000  03000000
r1 = add            r1        r3

High-level language

You can't really expect people to author their effects in raw bytecode, or even in our "bytecode assembly language". Effect authors will be a lot more productive if they can use a more comfortable language.

I decided to create such a language and model it after HLSL, since it serves a similar purpose (fast processing of vectorized data). Programmers interested in writing vector field effects are probably already used to working with HLSL. Plus, if at some point we want to move some of this work to the GPU we can reuse the code.

To show what the high level language looks like, here is an implementation of a whirl effect:

const float4 center = float4(0,10,0,0);
const float4 up = float4(0,0,1,0);
const float4 speed = float4(10,10,10,10);
const float4 radius = float4(5,5,5,5);

struct vf_in
{
    float4 position : CHANNEL0;
    float4 wind : CHANNEL1;
};

struct vf_out
{
    float4 wind : CHANNEL1;
};

void whirl(in vf_in in, out vf_out out)
{
    float4 r = in.position - center;
    out.wind = in.wind + speed * cross(up, r) / dot(r,r) * radius;
}

If you squint, you may notice that this high level code exactly corresponds to the low level bytecode in the previous example.

Just as with HLSL, although this looks like C it actually isn't C. Things that work in C may not work in this language and vice versa. I'm quite strict when I parse this. I figure it is better to be start by being strict rather than permissive. This gives you more leeway to extend or modify the language later while keeping backwards compatibility. A strict syntax can always be loosened later, but if you design the language with a too permissive syntax you can paint yourself in a corner (case in point: Ruby).

I usually don't bother with Lex or Yacc when I write a parser. They are OK tools, I guess, but if I can get by without them I prefer not to have the extra precompile step and to have code that is a bit more straightforward to read and debug.

Instead I tend to use a recursive descent parser (a predictive variant, with no backtracking) or some variation of Dijkstra's shunting yard algorithm. Or sometimes a combination of both.

For this language I parse the overall structure with recursive descent, and then use Dijkstra's algorithm to process each statement in the function body.

I generate the bytecode directly from the shunting yard algorithm. When I pop an operator from the operator stack I generate the bytecode for computing that operator and storing the result in a temporary register. I then push that register to the value stack so that the result can be used in other computations. Temporary channels are recycled after they are popped of the value stack to minimize the channel count.

Constant patching

Constants in the bytecode can be changed when an effect is played. I do this by directly patching the bytecode with the new constant values.

When I generate the bytecode I keep track of where in the bytecode different global constants can be found. This patch list is a simple array of entries like:

(hashed constant name) (offset in bytecode)

When playing a vector field effect, the gameplay programmer specifies the constant values with a table:

VectorField.add(vf, "whirl", {radius = 10})

I look through the patch list, find all the offsets of constants named "radius" and replace them with the value(s) supplied by the gameplay programmer.

Since globals can be patched later, I can't do constant folding when I generate the bytecode. (Without global patching, I could just check if both arguments were constants when I popped an operator, and in that case, compute the constant result and push that directly to the value stack, instead of generating a bytecode instruction.)

I could reduce the instruction count somewhat and improve performance by doing a constant folding pass on the bytecode after the globals have been patched, but I haven't implemented that yet.

Physics integration

In my physics system I maintain a list of all awake (non-sleeping) actors. I apply wind from a vector field with an explicit call:

void apply_wind(const VectorField &field, const CollisionFilter &filter);

This extracts the position of every awake actor that matches the collision filter and sends that list to the vector field for evaluation. It then does a second loop through the actors to apply wind forces from the returned wind velocities.

I've chosen to have an explicit step for applying wind, so that you don't have to pay anything for the wind support unless you actually use it. Having an explicit step also opens up the possibility to have other types of vector fields. For example, there could be a vector field representing gravity forces and a corresponding function:

void apply_acceleration(const VectorField &field, const CollisionFilter &filter);

The fact that the wind is only applied to awake actors is important. Without that check, the wind forces would keep every actor in the world awake all the time, which would be really expensive for the physics engine. Just as with gravity, we want physics objects to come to rest and go to "sleep" when the wind forces are in balance with other forces on the actor.

This of course creates a problem when the wind forces are varying. An actor may be in balance now, but a change in the wind direction could change that. A leaf that is resting on the ground may be lifted by a sudden updraft. Since we don't apply the wind forces to sleeping object we can't get that behavior. Once a leaf has come to rest, it will stay put.

This problem is most noticeable when you have drastic effects like explosions in the vector field. It looks really strange when actors are completely immobile and "sleep through" a big explosion.

I deal with this by having a function for explicitly waking actors in an AABB:

wake_actors(const Vector3 &min, const Vector3 &max, const CollisionFilter &filter)

If you want to play a drastic wind effect (like an explosion), you should first wake the nearby actors with a call to wake_actors(). This ensures that all nearby actors will get the wind forces from the explosion (since they are now awake).

I apply the wind force with the standard formula:

F = 1/2 r v^2 C A

Where r is the density of air, v is the relative velocity of the air with respect to the object (so v = v_wind - v_object, where v_wind is the wind speed and v_object is the object's speed). C is a drag coefficient that depends on the object's shape and A is the object's reference area.

For C and A, I actually loop through all the physics shapes in the actor and estimate C and A based on those shapes. This is by no means a perfect approach. There are many situations where C might be really different from what such an estimation gives. For example, an object that is heavily perforated would receive much less wind force.

However, I want to have something in place that gives decent behavior in most cases, so that it only very rarely has to be changed. The less artists have to mess around with physical parameters, the smaller is the chance that anything gets messed up.

Note that the wind force is just air resistance with a velocity for the air. So by implementing wind you get the "air resistance" behavior "for free".

Rotation

If you compute the drag force using the formula above and apply it to a physics actor, it won't add any rotation to the actor. This is actually correct. The drag force, as we compute it here, has no rotational component.

Yet it feels counter-intuitive. We expect objects to rotate when they are blown about by the wind. Leafs and papers certainly swirl around a lot when the wind blows.

What happens in that case is actually a second order effect. When the wind blows around an object you get zones of high and low pressure as well as turbulence, and it is the forces from these interactions that affects the object's rotation.

These interactions are tricky to model accurately and they depend a lot on the object's shape. Right now, I'm not even trying. Instead I use a much simpler approach: I apply the drag force a bit above the object's actual center of mass so that it produces a torque and makes the object rotate. This is a complete hack that has no basis at all in physical reality, but it does add some rotation. At least it looks a lot better than applying the wind force without any rotation.

It should be possible to do better -- to make some kind of estimate of what rotational forces wind induces when it blows against typical physics shapes: boxes, spheres, capsules, etc. Just give my a couple of days in a wind tunnel and I'll try to come up with something.

Tuesday, October 2, 2012

A Data-Oriented, Data-Driven System for Vector Fields -- Part 2

In Part 1 we decided to represent a vector field as a superposition of individual effects:

G(p) = G_0(p) + G_1(p) + ... + G_n(p)

Here, each G_i(p) is a function that represents some effect, such as wind, an explosion or the updraft from an air vent.

The next step is to find a way of quickly evaluating the function G(p), a general function that could be almost anything, for lots of different positions p_i. This is quite tricky to do well in C++.

Of course, evaluating specific functions is not hard. If we want to evaluate a specific function, such as:

Vector3(sin(p.x), sin(p.y), 0);

we can just type it up:

inline Vector3 f(const Vector3 &p)
{
 return vector3(sin(p.x), sin(p.y), 0);
}

But if we don't know beforehand what G(p) will be we don't have that option.

We could write our system so that it supported a limited set of specific effects, with hardcoded C++ implementations. For example, there could be an "explosion" effect with some parameters (radius, strength, etc), an "updraft" effect, a "whirl" effect, etc. Similarly we could have support for a variety of standard shapes, such as "sphere", "cylinder", "capsule", etc. And perhaps some different types of falloffs ("linear", "quadratic"). Perhaps also some temporal effects ("attack-sustain-release", "ease-in-ease-out").

But it is hard to know where to draw the limit with this approach. Exactly what effects and shapes and falloffs and time curves should the system support? The more things we add, the more cluttered the system becomes. And the system is still not completely general. No matter how much we add, there will still be some things that the user just can't do, without disturbing a programmer and get her to add a new effect to the system. This means that the system is not truly data-driven.

Whether this is a problem or not depends a lot on your development style. If you are a single artist-programmer working on a single game you may not even care. To you code and data is the same thing. Who cares if you have to add something to the code to make a special effect. That is what the code is for.

At Bitsquid, however, we are in a different position. We are making a general purpose engine to be used on multiple platforms for all kinds of tasks. We can't put game specific code in the engine or everything will end up a total mess. Sure, our licensees could modify their cloned copy of the source to add their own effects. But that is not an ideal solution. It forces them to learn our code, it makes it harder for us to reproduce their bugs, since our code bases have now diverged and it makes it harder for us to modify and optimize the source code without putting our licensees in merge hell.

So our aim is always to be completely data-driven.

But how can we represent a general function as data? There are really only two possibilities:

As a piece of executable machine code.
As a piece of bytecode that gets executed by a virtual machine.

The first approach is the fastest of course, but it has two drawbacks. First, machine code is platform dependent. Writing a system that can dynamically generate machine code for a lot of different targets is no small undertaking (though it could be simplified by using LLVM). Second, and more serious, many systems simply don't allow us execute dynamically generated machine code.

The inevitable conclusion is that we have to use bytecode (perhaps coupled with a machine code compiler on the platforms where that is feasible).

Unfortunately, as everybody who has used a dynamic language without a JIT compiler knows, bytecode is slow. Usually, at least a factor 10 slower than machine code. And remember that one of our design goals for this system was that it should be fast. We said in the beginning that it should be able to handle at least 10 000 queries per frame.

So what can we do?

The Massively Vectorized Virtual Machine

At this point it makes sense to stop and think a bit about why bytecode is slow. If you look at the code of a virtual machine, it is essentially a tight loop that repeatedly does three things:

Decode the next bytecode instruction into operation + arguments.
Jump to the code that performs the operation.
Execute the operation.

The third step is usually just as fast as handwritten machine code would be. Computing a+b is not more expensive because it was triggered by an OP_ADD bytecode instruction.

So all the overhead of bytecode, the thing that makes it "slow", is found in the first two steps.

Well then here is an idea: what if we could reuse the computations that we make in those two steps?

Remember that our goal is to compute G(p) for a lot of points p_i. We want to evaluate the same function, the same bytecode instructions, for a lot of different data points. In that case, why repeat the expensive operation of decoding the bytecode instructions again and again for each point? Why not just decode the instruction once and then execute it for all data points?

So, with that change, our virtual machine loop now becomes:

Decode the next bytecode instruction.
Jump to the code that executes it.
Execute that single instruction for all the input data.

With this change, the cost of decoding the bytecode is now amortized over all the query points. The more query points we have, the less time (proportionally) we will spend on decoding bytecode. With enough points (>1024) that time should be nearly negligible . In other worlds, our bytecode should be able to run at nearly the same speed as native machine code.

In a quick test I made, the overhead of a bytecode implementation compared to native code was just 16 % -- a far cry from the 10x slowdown we have come to expect.

Fleshing out the Details

Since we are computing a vector function on vector input and we want it to run as fast as possible, it makes sense to use SSE (or its equivalent on other platforms) and represent all our data as vector4 intrinsics.

Virtual machines can be stack-based or register-based. Stack-based machines produce more compact bytecode since the arguments are implicit. Register-based machines need fewer instructions to accomplish a task, since they don't have to juggle things around on the stack. In our case, compact bytecode doesn't buy us much, since our programs are short and the decoding cost is amortized. On the other hand, accomplishing the same thing with fewer instructions means less code to execute for each query point. So a register-based virtual machine seems to be a clear win.

Here is what the code for an explosion effect could look like in a made-up intermediate language for our virtual machine. The effect produces a wind of 50 m/s outwards from the center of a sphere of radius 5 m located at (2,4,0):

direction = sub position, (2,4,0,0)
lensqr = dot direction, direction
direction = normalize direction
direction = mul direction, (50,50,50,50)
direction = select_lt lensqr, (25,25,25,25), direction, (0,0,0,0)
output = add output, direction

Here position is the input query position and output is the output result of the function. direction and lensqr are temporary variables.

Note that the final operation adds the result to the output register instead of overwriting it. This allows us to merge multiple effects by simply concatenating their bytecode. So to evaluate G(p) for a large number of points, we can first intersect the AABB of the points with the AABB of each individual effect G_i(p). Then we merge the bytecodes of each intersecting effect into a single bytecode function G'(p) that we finally evaluate for each point.

We can feed position and output to the virtual machine as arrays of intrinsics:

void evaluate(void *bytecode, unsigned n, Vector4I *positions, Vector4I *output)

Note that since we are running the bytecode one instruction at a time for all the data, the local variables (direction and lensqr) need to be arrays too, since we need to remember their value for each of the input positions.

We could allocate arrays for these local variables and pass them to evaluate just as we do for positions and output. But that seems a bit wasteful. A complicated function could have twenty global variables or more, meaning that with 10 000 particles we would need to allocate 3.2 MB of temporary memory. The amount needed will vary widely, depending on how complicated the function is, which is driven by the data. This makes it hard to do a memory budget for the system.

So let's use an alternative approach. We allocate all local variable buffers from a "scratch space" which is provided by the caller:

void evaluate(void *bytecode, unsigned n, Vector4I *positions, Vector4I *output, unsigned scratch_bytes, void *scratch_space)

Now the caller has complete control over the amount of temporary memory the system uses. It is predictable and can be made to fit any desired memory budget.

To make this work, we need to chop this scratch memory up into areas for each local variable. The size of those buffers then determine how many input positions we can process at a time.

For example, suppose we have 256 K of scratch memory and 8 local variables. Each local variable then gets 32 K of memory, which can hold 2 K Vector4I's. So this means that instead of processing all 10 000 particles at the same time when we execute an opcode, we process the particles in 5 chunks, handling 2 048 particles each time. The cost of decoding the bytecode gets amortized over 2 048 particles, instead of over 10 000, but it is still negligible.

The nice thing about this approach is that we always use a constant, predictable amount of scratch space, regardless of how many query points we process and how complicated the function is. Instead we scale down how many particles we process at a time.

Since both input data and local variables are now Vector4I buffers, the inner loop of the virtual machine is simple to write, it will look something like:

void run_vm(const void *bytecode, unsigned n, Vector4I **registers)
{
 const void *pc = bytecode;
 while (true) {
  unsigned op = DECODE_OP(pc);
  switch(op) {
   case OP_ADD:
    Vector4I *a = registers[DECODE_REGISTER(pc)];
    Vector4I *b = registers[DECODE_REGISTER(pc)];
    Vector4I *c = registers[DECODE_REGISTER(pc)];
    Vector4I *ae = a + n;
    while (a != ae) {
     *a++ = addi(*b++, *c++);
    }
    break;
   ...
  }
 }
}

An Example

Here is a YouTube video that shows a vector field implemented using this method. Unfortunately, the YouTube compression is not very nice to a video that contains this much high-frequency information. But at least it gives some idea of the effect.

The video shows 20 000 particles being animated by the vector field at a query cost of about 0.4 ms on a single thread (of course, parallelization is trivial, so you can divide that by the number of available cores).

Monday, September 17, 2012

A Data-Oriented, Data-Driven System for Vector Fields - Part 1

A vector field is a function that assigns a vector value to each point in 3D space. Vector fields can be used to represent things like wind (the vector field specifies the wind velocity at each point in space), water, magnetism, etc.

To me, wind is the most interesting use case. I want a system that can be used for physics (trees, tumble weed, paper cups), particles (leaves, sparks, smoke) and graphics (grass). I also want the system to be capable of handling both global effects (wind blowing through the entire level) and local effects (explosions, air vents, landing helicopters, rising hot air from fires, etc). But I don't want to limit the system to only handling wind. I imagine that once the system is in place, it could be put to other interesting uses as well.

There are a number of things that make this an interesting non-trivial design challenge:

Vector fields represent a global shared state. All systems (particles, physics, etc) should react to the same wind. This can create strong couplings between unrelated systems, which we want to avoid.
The system must be fast. We want to be able to make large particle effects that are affected by wind. As a design goal, let's say that it should be able to handle at least 10 000 queries / frame.
As stated above, the system must be flexible enough to handle both global wind and a large variety of different local effects (air vents, fans, etc).

I'll outline the system in a series of articles. Let's start by thinking a bit about how we can represent the vector field in a way that allows for fast queries.

1. Use a functional representation

Storing the vector value for every point in 3D space at a decent resolution would require huge amounts of memory. It would also be very expensive to update. If we wanted to change the global wind direction, we would have to loop over all those points and change the value.

So, instead, we will use a functional representation. We will express the field as some closed function F(p, t) that gives us the field vector at point p in space at the time t.

For example, we could express a global wind that oscillates in the x-direction as:

F(p, t) = Vector3(sin(t), 0, 0)

The closed function form allows us to evaluate the vector field at any point in space and time.

Note that even with a functional form as the main representation, we can still interact with grid based representations. For example, we can render some section of the F(p, t) function to a texture for use on a GPU. Similarly, if we have some grid based wind data that we want to add to the simulation, we could use that as part of the F(p, t) expression:

F(p, t) = Vector3(sin(t), 0, 0) + sample_grid(grid, p)

2. Ignore the time coordinate

The vector field function F(p, t) is a function of both space and time. The wind varies throughout the level and if we look at any one point, the wind at that point varies over time.

But in practice, we treat the p and t coordinates very differently. We start at some time t_0 and then evaluate F(p, t_0) for thousands of different p values. Then we move on to t_1 and do the same thing.

We can make use of the fact that t remains constant for a large number of evaluations to simplify the function. For example at t=0.5 the function:

F(p, t) = sin(p.x) * sin(p.y) * cos(t)

simplifies to:

G(p) = sin(p.x) * sin(p.y) * 0.8776

which is cheaper to evaluate.

Taking this approach a step further, it makes sense to split our system in two parts -- a high level system that knows about time and every frame produces a new G(p) for the current time, and a low level system that ignores time completely and just computes G(p). Since the high level system only runs once per frame it can afford to do all kinds of complicated but interesting stuff, like constant folding, optimization, etc.

For the low level system we have reduced the problem to evaluating G(p).

3. Express the field as a superposition of individual effects

To make it possible for the field to contain both global effects (world wind) and local effects (air vents, explosions) we express it as a superposition of individual effect functions:

G(p) = G_1(p) + G_2(p) + ... + G_n(p)

Here G_i(p) represents each individual effect. A base wind could be expressed as just a constant:

G_0(p) = Vector3(2.1, 1.4, 0)

A turbulence function could add a random component

G_1(p) = turbulence(seed, p, 4)

An explosion effect could create a wind with a speed of 100 m/s outwards from the center of the explosion in a sphere with radius 4.0 meter around the explosion center:

G_2(p) = sphere(p,c,4) * normalize(p-c) * 100

Here sphere(p,c,4) is a spherical support function that defines the range of the effect. It is 1 if ||p - c|| <= 4.0 and 0 otherwise.

Note again that we have stripped out the time component. At the higher level, this might be an expanding sphere with decreasing wind speeds, but at the low level we only care what it looks like at this instance.

Similar functions can be added for other local effects.

4. Use the AABB to cull local fields

If we have a lot of local effects (explosions, etc), evaluating G(p) will be pretty expensive.

We can reduce the cost by only evaluating the local effects that are close enough to our particle system to matter.

I.e., instead of evaluating G(p) for all particles, we first intersect the AABB of each G_i(p)'s support with the AABB of our particle system.

That gives us a simpler function G'(p) that we can then evaluate for each particle.

If we wanted to, we could use the wavelength of the field for further simplifications. If the scale at which a field effect changes is much larger than our AABB, we can replace that effect with a Taylor series expansion. Similarly, if an effect oscillates at a scale much smaller than the size of our particles, we can replace it with its average value.

Next time

Next time I will look at how we can efficiently evaluate arbitrary functions, such as:

G(p) = Vector3(1,1,0) + turbulence(seed, p, 2) + sphere(p, c, 4)

for a huge number of particle positions p.

This has also been posted to The Bitsquid blog.

Monday, September 3, 2012

A new way of organizing header files

Recently, I've become increasingly dissatisfied with the standard C++ way of organizing header files (one .h file and one .cpp file per class) and started experimenting with alternatives.

I have two main problems with the ways headers are usually organized.

First, it leads to long compile times, especially when templates and inline functions are used. Fundamental headers like array.h and vector3.h get included by a lot of other header files that need to use the types they define. These, in turn, get included by other files that need their types. Eventually you end up with a messy nest of header files that get included in a lot more translation units than necessary.

Sorting out such a mess once it has taken root can be surprisingly difficult. You remove an #include statement somewhere and are greeted by 50 compile errors. You have to fix these one by one by inserting missing #include statements and forward declarations. Then you notice that the Android release build is broken and needs additional fixes. This introduces a circular header dependency that needs to be resolved. Then it is on to the next #include line -- remove it, rinse and repeat. After a day of this mind-numbingly boring activity you might have reduced your compile time by four seconds. Hooray!

Compile times have an immediate and important effect on programmer productivity and through general bit rot they tend to grow over time. There are many things that can increase compile times, but relatively few forces that work in the opposite direction.

It would be a lot better if we could change the way we work with headers, so that we didn't get into this mess to begin with.

My second problem is more philosophical. The basic idea behind object-oriented design is that data and the functions that operate on it should be grouped together (in the same class, in the same file). This idea has some merits -- it makes it easier to verify that class constraints are not broken -- but it also leads to problems. Classes get coupled tightly with concepts that are not directly related to them -- for example things like serialization, endian-swapping, network synchronization and script access. This pollutes the class interface and makes reuse and refactoring harder.

Class interfaces also tend to grow indefinitely, because there is always "more useful stuff" that can be added. For example, a string class (one of my pet peeves) could be extended with functionality for tokenization, path manipulation, number parsing, etc. To prevent "class bloat", you could write this code as external functions instead, but this leads to a slightly strange situation where a class has some "canonized" members and some second-class citizens. It also means that the class must export enough information to allow any kind of external function to be written, which kind of breaks the whole encapsulation idea.

In my opinion, it is much cleaner to organize things by functionality than by type. Put the serialization code in one place, the path manipulation code in another place, etc.

My latest idea about organization is to put all type declarations for all structs and classes in a single file (say types.h):

struct Vector3 {
 float x, y, z;
};

template <class T>
class Array<T> {
public:
 Array() : _capacity(0), _size(0), _data(0) {}
 ~Array() {free(_data);}
 unsigned _capacity;
 unsigned _size;
 T *_data;
};

class IFileSystem;
class INetwork;

Note that types.h has no function declarations, but it includes the full data specification of any struct or class that we want to use "by value". It also has forward declarations for classes that we want to use "by reference". (These classes are assumed to have pure virtual interfaces. They can only be created by factory functions.)

Since types.h only contains type definitions and not a ton of inline code, it ends up small and fast to compile, even if we put all our types there.

Since it contains all type definitions, it is usually the only file that needs to be included by external headers. This means we avoid the hairy problem with a big nest of headers that include other headers. We also don’t have to bother with inserting forward declarations in every header file, since the types we need are already forward declared for us in types.h.

We put the function declarations (along with any inline code) in the usual header files. So vector3.h would have things like:

inline Vector3 operator+(const Vector3 &a, const Vector3 &b)
{
 Vector3 res;
 res.x = a.x + b.x;
 res.y = a.y + b.y;
 res.z = a.z + b.z;
 return res;
}

.cpp files that wanted to use these operations would include vector3.h. But .h files and other .cpp files would not need to include the file. The file gets included where it is needed and not anywhere else.

Similarly, array.h would contain thinks like:

template <class T>
void push_back(Array<T> &a, const T &item)
{
 if (a._size + 1 > a._capacity)
  grow(a);
 a._data[a._size++] = item;
}

Note that types.h only contains the constructor and the destructor for Array<T>, not any other member functions.

Furthermore, I prefer to design classes so that the "zero-state" where all members are zeroed is always a valid empty state for the class. That way, the constructor becomes trivial, it just needs to zero all member variables. We can also construct arrays of objects with a simple memset().

If a class needs a more complicated empty state, then perhaps it should be an abstract interface-class instead of a value class.

For IFileSystem, file_system.h defines the virtual interface:

class IFileSystem
{
 virtual bool exists(const char *path) = 0;
 virtual IFile *open_read(const char *path) = 0;
 virtual IFile *open_write(const char *path) = 0;
 ...
};

IFileSystem *make_file_system(const char *root);
void destroy_file_system(IFileSystem *fs);

Since the “open structs” in types.h can be accessed from anywhere, we can grop operations by what they do rather than by what types they operate on. For example, we can put all the serialization code in serialization.h and serialization.cpp. We can create a file path.h that provides path manipulation functions for strings.

An external project can also "extend" any of our classes by just writing new methods for it. These methods will have the same access to the Vector3 data and be called in exactly the same way as our built-in ones.

The main drawback of this model is that internal state is not as "protected" as in standard object-oriented design. External code can "break" our objects by manipulating members directly instead of using methods. For example, a stupid programmer might try to change the size of an array by manipulating the _size field directly, instead of using the resize() method.

Naming conventions can be used to mitigate this problem. In the example above, if a type is declared with class and the members are preceded by an underscore, the user should not manipulate them directly. If the type is declared as a struct, and the members do not start with an underscore, it is OK to manipulate them directly. Of course, a stupid programmer can still ignore this and go ahead and manipulate the members directly anyway. On the other hand, there is no end to the things a stupid programmer can do to destroy code. The best way to protect against stupid programmers is to not hire them.

I haven’t yet written anything really big in this style, but I've started to nudge some files in the Bitsquid codebase in this direction, and so far the experience has been positive.

Saturday, August 18, 2012

Cleaning bad code

Guess what! You've just inherited a stinking, steaming pile of messy old code. Congratulations! It's all yours.

Bad code can code can come from all kinds of places. Middleware, the internet, perhaps even your own company.

You know that nice guy in the corner that nobody had time to check up on? Guess what he was doing all that time. Churning out bad code.

Or remember that module someone wrote years ago, just before she left the company. That module that twenty different people have then added hacks, patches and bug fixes to, without really understanding what they were doing. Yup, that one.

Or what about that open source thing you downloaded that you knew was horrible, but it solved a very specific and quite hairy problem that would have taken you ages to do by yourself.

Bad code doesn't have to be a problem, as long as it's not misbehaving, and nobody pokes their bloody nose in it. Unfortunately, that state of ignorant bliss rarely lasts. A bug will be discovered. A feature requested. A new platform released. Now you have to dig into that horrible mess and try to clean it up. This article offers some humble advice for that unfortunate situation.

0. Is it worth doing?

The first thing you need to ask yourself is whether the code is worth cleaning. I'm of the opinion that when it comes to code cleaning you should either karate do "yes", or karate do "no". Either you assume full responsibility for the code and rework it until you end up with something that you are actually happy to maintain and proud to have in your codebase.

Or you decide that even though the code looks horrible, it isn't cost-effective to take time out of your busy schedule to fix it. So instead you just do the smallest change possible that solves your current problem.

In other words, you either regard the code as yours or theirs.

There are merits to both alternatives. Good programmers get an itch when they see bad code. They bring out their torches and pitchforks and chant: "Unclean! Unclean!" And that is a good instinct.

But cleaning code is also a lot of work. It is easy to underestimate the time it takes. It can be nearly as time consuming as rewriting the whole thing from scratch. And it doesn't bring any short term benefits. Two weeks cleaning code won't add any new features to the game, but it might give you some new bugs.

On the other hand, the long term effects of never cleaning your code can be devastating. Entropy is the code-killer.

So, never an easy choice. Some things to consider are:

How many changes do you expect to make to the code?

Is it just this one small bug that you need to fix, or is this code that you expect to return to many times to tweak and tune and add new features. If it's just this one bug, then perhaps it is best to let sleeping dogs lie. However, if this is a module that you will need to mess around with a lot, then spending some time to clean it up now, will save a lot of headache later.
Will you need/want to import upstream changes?

Is this an open source project that is under active development? If so, and you want to pull the changes made upstream you can't make any big changes to the code or you will be in merge hell every time you pull. So just be a nice team player, accept its idiosyncrasies and send patches with your bug fixes to the maintainer.
How much work is it?

How many lines of code can you realistically clean in a day? An order of magnitude estimate says more than 100 and less than 10 000, so let's say 1 000. So if the module has 30 000 lines, you might be looking at a month of work. Can you spend that? Is it worth it?
Is it a part of your core functionality?

If what the module does is something peripheral, like say font rendering or image loading, you might not care that it is messy. You might swap out the whole thing for something else in the future, who knows. But you should own the code that relates to your core competence.
How bad is it?

If the code is just slightly bad, then perhaps you can live with it. If it is mind-numbingly, frustratingly incomprehensibly bad, then perhaps something needs to be done.

1. Get a test case

Seriously cleaning a piece of code means messing around with it a lot. You will break things.

If you have a decent test case with good coverage you will immediately know what has broken and you can usually quite quickly figure out what stupid mistake you just made. The time and anxiety this saves over the course of the cleaning process is just ridiculous. Get a test case. It's the first thing you should do.

Unit tests are best, but all code is not amenable to to unit testing. (Test fanatics, send your hate mail now!) If unit tests are too cumbersome, use an integration test instead. For example, fire up a game level and run the character through a specific set of actions related to the code you are cleaning.

Since such tests ate more time consuming, it might not make sense to run it after every change you make, which would be ideal. But as you put every single change you make into source control, it's not so bad. Run the test every once in a while (e.g., every five changes). When it discovers a problem you can do a binary search of those last few commits to find out which one caused the problem.

If you discover an issue that wasn't detected by your test, make sure that you add that to the test, so that you capture it in the future.

2. Use source control

Do people still have to be told to use source control? I sure hope not.

For cleaning work it is absolutely crucial. You will be making lots and lots of small changes to the code. If something breaks you want to be able to look back in the revision history and find out where it broke.

Also, if you are anything like me, you will sometimes start down a refactoring path (like removing a stupid class) and realize after a while that it wasn't such a good idea, or, that it was a good idea, but that everything would be a lot simpler if you did something else first. So you want to be able to quickly revert everything you just did and begin anew.

Your company should have a source control system in-place that allows you to do these changes in a separate branch and commit as much as you like without disturbing anybody else.

But even if it doesn't, you should still use source control. In that case, download mercurial (or git), create a new repository and put the code that you checked out of your company's stupid system there. Do your changes in that repository, committing as you go. When you are done you can merge everything back into the stupid system.

Cloning the repository into a sensible source control system only takes a few minutes. It is absolutely worth it. If you don't know mercurial, spend an hour to learn it. You will be happy you did. Or if you prefer, spend 30 hours to learn git instead. (I kid! Not really. Nerd fight now!)

3. Make one (small) change at a time

There are two ways of improving bad code: revolution and reform. The revolution method is to burn everything with fire and rewrite it from scratch. The reform method is to refactor the code with one small change at a time without ever breaking it.

This article is about the reform method. I'm not saying that revolutions never are necessary. Sometimes things are so bad that they just need to go. But people who get frustrated with the slow pace of reform and advocate revolution often fail to realize the full complexity of the problem and thus don't give the existing system enough credit for the things it does.

Joel Spolsky has written a classic article about this without falling into the trap of making strained political metaphors.

The best way of reforming code is to make one minimal change at a time, test it and commit it. When the change is small it is easier to understand its consequences and make sure that it doesn't affect the existing functionality. If something goes wrong, you only have a small amount of code that you need to check. If you start doing a change and realize that it is bad, you won't loose much work by reverting to the last commit. If you notice after a while that something has gone subtly wrong, a binary search in the revision history will let you find the small change that introduced the problem.

A common mistake is to do more than one thing at the same time. For example, while getting rid of an unnecessary level of inheritance you might notice that the API methods are not as orthogonal as you would like them to be and start to rearrange them. Don't! Get rid of the inheritance first, commit that and then fix the API.

Smart programmers organize the way they work so that they don't have to be that smart.

Try to find a path that takes you from what the code is now to what you want it to be in a sequence of small steps. For example, in one step you might rename the methods to give them more sane names. In the next, you might change some member variables to function parameters. Then you reorder some algorithms so that they are clearer. And so on.

If you start doing a change and realize that it was a bigger change than you originally thought, don't be afraid to revert and find a way of doing the same thing in smaller, simpler steps.

4. Don't clean and fix at the same time

This is a corollary to (3), but important enough to get its own point.

It is a common problem. You start to look at a module because you want to add some new functionality. Then you notice that the code is really badly organized, so you start reorganizing it at the same time as you are adding the new functionality.

The problem with this is that cleaning and fixing has diametrically opposite goals. When you clean, you want to make the code look better without changing its functionality. When you fix, you want to change its functionality to something better. If you clean and fix at the same time it becomes very hard to make sure that your cleaning didn't indadvertedly change something.

Do the cleaning first. Then, when you have a nice clean base to work with, add the new functionality.

5. Remove any functionality that you are not using

The time it takes to clean is proportional to the amount of code, its complexity and its messiness.

If there is any functionality in the code that you are currently not using and don't plan to be using in the foreseeable future -- get rid of it. That will both reduce the amount of code you will have to go through and its complexity (by getting rid of unnecessary concepts and dependencies). You will be able to clean faster and the end result will be simpler.

Don't save code because "who knows, you might need it some day". Code is costly -- it needs to be ported, bug checked, read and understood. The less code you have, the better. In the unlikely event that you do need the old code, you can always find it in the source repository.

6. Delete most of the comments

Bad code rarely has good comments. Instead, they are often:

// Pointless:

 // Set x to 3
 x = 3;

// Incomprehensible:

 // Fix for CB (aug)
 pos += vector3(0, -0.007, 0);

// Sowing fear and doubt:

 // Really we shouldn't be doing this
 t = get_latest_time();

// Downright lying:

 // p cannot be NULL here
 p->set_speed(0.7);

Read through the code. If a comment doesn't make sense to you and doesn't further your understanding of the code -- get rid of it. Otherwise you will just waste mental energy on trying to understand that comment on each future reading of the code.

The same goes for dead code that has been commented or #ifdef'ed out. Get rid of it. It's there in the source repository if you need it.

Even when comments are correct and useful, remember that you will be doing a lot of refactoring of the code. The comments may no longer be correct when you are done. And there is no unit test in world that can tell you if you have "broken the comments".

Good code needs few comments because the code itself is clearly written and easy to understand. Variables with good names do not need comments explaining their purpose. Functions with clear inputs and outputs and no special cases or gotchas require little explanation. Simple, well written algorithms can be understood without comments. Asserts document expectations and preconditions.

In many cases the best thing to do is just to get rid of all old comments, focus on making the code clear and readable, and then add back whatever comments are needed -- now reflecting the new API and your own understanding of the code.

7. Get rid of shared mutable state

Shared mutable state is the single biggest problem when it comes to understanding code, because it allows for spooky "action at a distance", where one piece of code changes how a completely different piece of code behaves. People often say that multithreading is difficult. But really, it is the fact that the threads share mutable state that is the problem. If you get rid of that, multithreading is not so complex.

Since your goal is to write high-performant software, you won't be able to get rid of all mutable state, but your code can still benefit enormously from reducing it as much as possible. Strive for programs that are "almost functional" and make sure you know exactly what state you are mutating where and why.

Shared mutable state can come from several different places:

Global variables. The classic example. By now everybody surely knows that global variables are bad. But note (and this is a distinction that people sometimes fail to make), that it is only shared mutable state that is problematic. Global constants are not bad. Pi is not bad. Sprintf is not bad.
Objects -- big bags of fun. Objects are a way for a large number of functions (the methods) to implicitly share a big bag of mutable state (the members). If a lazy programmer needs to pass some information around between methods, she can just make a new member that they can read and write as they please. It's almost like a global variable. How fun! The more members and the more methods an object has, the bigger this problem is.
Megafunctions. You have heard about them. These mythic creatures that dwell in the deepest recesses of the darkest codebases. Broken programmers talk about them in dusky bars, their sanity shattered by their encounters: "I just kept scrolling and scrolling. I couldn't believe my eyes. It was 12 000 lines long."

When functions are big enough, their local variables are almost as bad as global variables. It becomes impossible to tell what effect a change to a local variable might have 2 000 lines further down in the code.
Reference and pointer parameters. Reference and pointer parameters that are passed without const can be used to subtly share mutable state between the caller, the callee and anyone else who might be passed the same pointer.

Here are some practical ideas for getting rid of shared mutable state:

Split big functions into smaller ones.
Split big objects into smaller ones by grouping members that belong together.
Make members private.
Change methods to be const and return the result instead of mutating state.
Change methods to be static and take their arguments as parameters instead of reading them from shared state.
Get rid of objects entirely and implement the functionality as pure functions without side effects.
Make local variables const.
Change pointer and reference arguments to const.

8. Get rid of unnecessary complexity

Unnecessary complexity is often a result of over-engineering -- where the support structures (for serialization, reference counting, virtualized interfaces, abstract factories, visitors, etc) dwarf the code that performs the actual functionality.

Sometimes over-engineering occurs because software projects start out with a lot more ambitious goals than what actually gets implemented. More often, I think, it reflects the ambitions/esthetics of a programmer who has read books on design patterns and the waterfall model and believes that over-engineering makes a product "solid" and "high-quality".

Often, the heavy, rigid, overly complex model that results is unable to adapt to feature requests that were not anticipated by the designer. Those features are then implemented as hacks, bolt-ons and backdoors on top of the ivory tower resulting in a schizophrenic mix of absolute order and utter chaos.

The cure against over-engineering is YAGNI -- you are not gonna need it! Only build the things that you know you need. Add more complicated stuff when you need it, not before.

Some practical ideas for cleaning out of unnecessary complexity:

Remove the functionality you are not using (as suggested above).
Simplify necessary concepts, and get rid of unneeded ones.
Remove unnecessary abstractions, replace with concrete implementations.
Remove unnecessary virtualization and simplify object hierarchies.
If only one setting is ever used, get rid of the possibility of running the module in other configurations.

9. That is all

Now go clean your room!

Saturday, August 4, 2012

A simpler design for asynchronous APIs

Accessing Internet services, e.g. to fetch a web page or to store data on a leaderboard, requires an asynchronous API. You send a request and then, at some later point, you receive a reply.

Asynchronous APIs are trickier to design than synchronous ones. You can't simply return the result of the operation, since it isn't ready yet. Instead you have to wait until it is done and then send it to the caller through some other channel. This often results in designs that are needlessly complicated and cumbersome to work with.

Callbacks

The most common approach is perhaps to use callbacks. You make the asynchronous request and when it completes the callback is called. The callback can either be a global system-wide callback, or (which is nicer) a callback that you supply when you make the asynchronous call.

leaderboard->set_score(100, set_score_cb, my_user_data);

void set_score_cb(SetScoreResult *result, void *user_data)
{
   ...
}

I have already mentioned in a previous article that I'm not too fond of callbacks and that I prefer polling in most cases. Badly designed polling can be expensive, but in the case of asynchronous network operations we wouldn't expect to have more than a dozen or so in-flight at any one time, which means the cost of polling is negligible.

Callbacks tend to make code worse. There are several reasons.

First, you usually have little control over when a callback happens. This means that it can happen at a time that isn't very suitable to you. For cleanliness, you may want to do all your leaderboard processing in your update_leaderboard() function. But the callback might be called outside update_leaderboard(), messing up all your carefully laid plans.

Second, it can be tricky to know what you can and cannot do in a callback. The code that calls you might make some assumptions that you inadvertently violate. These things can sometimes be really tricky to spot. Consider something as simple as:

int n = _leaderboard_operations.size();
for (int i=0; i!=n; ++i) {
 if (done(_leaderboard_operations[i]))
  do_callback(_leaderboard_operations[i]);
}

This looks perfectly innocent. But if the callback happens to do something that changes the _leaderboard_operations vector, for example by posting a new request or removing an old one, the code can blow up with memory access errors. I have been bitten by things like this many times. By now, every time I see a callback a warning clock goes off in my head: "danger, danger -- there is a callback here, remember that when you make a callback anything can happen".

Sometimes it can be necessary to double buffer data to get rid of bugs like this.

Third, callbacks always happen in the wrong context. You get the callback in some "global", "top-level" context, and from there you have to drill down to the code that actually knows what to do with the information. (Typically by casting the user_data pointer to some class and calling a member function on it.) This makes the code hard to follow.

In other words, callbacks lead to hard-to-read code, hard-to-follow code flow, subtle bugs, redundant boilerplate forwarding stubs and instruction cache misses. Bleh!

Request objects

Another common approach is to have some sort of request object that represents the asynchronous operation. Something like:

SetScoreRequest *request = _leaderboard->set_score(100);
...
if (request->is_done()) {
 bool success = request->result();
 delete request;
}

Or perhaps, using the C++11 concepts of promises and futures (I have only a passing acquaintance with C++11, so forgive me if I mess something up):

std::promise<bool> *promise = new std::promise<bool>();
_leaderboard->set_score(100, promise);
...
std::future<bool> future = promise->get_future();
if (future.valid()) {
 bool success = future.get();
 _leaderboard->forget_promise(promise);
 delete promise;
}

This is a lot better than the callback approach, but still in my view, overly complicated. It is clearly a design based on the object-oriented philosophy of -- when in doubt, make more objects.

But these extra objects don't really do much. They just act as pointless intermediaries that pass some information back and forth between our code and the _leaderboard object. And they are a hassle for the caller to keep track of. She must store them somewhere and make sure to delete them when she is done to avoid memory leaks.

Furthermore, if we want to expose this API to a scripting language, such as Lua, we have to expose these extra objects as well.

ID tokens

As readers of my previous articles know, I'm a big fan of using IDs. Instead of exposing internal system objects to the caller of an API, I prefer to give the caller IDs that uniquely identifies the objects and provide functions for obtaining information about them.

That way, I am free to organize my internal data however I like. And it is easier to see when the state of my objects might mutate, since all calls go through a single API.

With this approach the interface would look something like this:

unsigned set_score(int value);
enum SetScoreResult {SSR_IN_PROGRESS, SSR_SUCCESS, SSR_FAILURE};
SetScoreResult set_score_result(unsigned id);

Note that there are no objects that the user must maintain and release. The ID can easily be manipulated by a scripting layer. If the user doesn't need to know if the operation succeeded, she can just throw away the returned ID.

In this API I don't have any method for freeing tokens. I don't want to force the user to do that, since it is both a hassle (the user must track all IDs and decide who owns them) and error prone (easy to forget to release an ID).

But obviously, we must free tokens somehow. We can't store the results of the set_score() operations forever. If we did, we would eventually run out of memory.

There are several ways you could approach this problem. My preferred solution in this particular case is to just have a fixed limit on the number of operations that we remember. Since we don't expect more than a dozen simultaneous operations, if we make room for 64, we have plenty of slack and still use only 64 bytes of memory. A modest amount by any standard.

We can keep the results in a round-robin buffer:

/// Maximum number of requests whose result we remember.
static const int MAX_IN_FLIGHT = 64;

/// The result of the last MAX_IN_FLIGHT requests.
char results[MAX_IN_FLIGHT];

/// Number of requests that have been made.
unsigned num_requests;

SetScoreResult set_score_result(unsigned id)
{
 // If more than MAX_IN_FLIGHT requests have been made after this one,
 // the information about it is lost.
 if (num_requests - id > MAX_IN_FLIGHT)
  return SSR_NO_INFORMATION;

 return results[id % MAX_IN_FLIGHT];
}

This means that you can only ask about the result of the last 64 operations. On the other hand, this solution uses very little memory, does not allocate anything, has very quick lookups and doesn't require the user to explicitly free tokens.

To me, this added simpleness and flexibility outweighs the disadvantage of having a limit on the maximum number of in flight operations that we support.

Implicit APIs

In many cases, the best solution to asynchronous conundrums is to redesign the API to abstract away the entire concept of asynchronous operations, so that the user doesn't even have to bother with it.

This can require some creative rethinking in order to focus on what it is the user really wants to do. For example, for our example, we might come up with this:

/// Sets the score to the specified value. This is an asynchronous operation.
/// You can use acknowledged_score() to find out when it has completed.
void set_score(int score);

/// Returns the last score that has been acknowledged by the server.
int acknowledged_score();

This is probably all that the user needs to know.

Now we have really simplified the API. The user still needs to be aware that set_score() isn't propagated to the server immediately, but she doesn't at all have to get involved in what asynchronous operations are performed and how they progress.

This kind of radical rewrite might not be possible (or even desirable) for all asynchronous systems. You have to balance the value of high-level abstractions and simplifications against the need for low-level control. But it is almost always worth exploring the possibility since it can lead to interesting ideas and dramatically simplified APIs.

For example, the interface for an asynchronous web fetcher could be as simple as:

const char *fetch(const char *url);

If called with an URL that hadn't been fetched yet, the function would issue a request for the URL and return NULL. Once the data was available, the function would return it. On the next call, the data would be freed. To fetch a web page, you would just repeatedly call the function with an URL until you got a reply.

Quite fetching, wouldn't you say?

Tuesday, July 3, 2012

Matrices, Rotation, Scale and Drifting

If you are using Matrix4x4s to store your node transforms and want to support scaling you are facing an annoying numerical problem: rotating a node causes its scale to drift from the original value.

The cause of drifting

Drifting happens because in a Matrix4x4 the rotation and the scale are stored together in the upper left 3x3 part of the matrix:

This means that if we want to change the rotation of a Matrix4x4 without affecting the scale we must extract the scale and reapply it:

void set_rotation(Matrix4x4 &pose, const Quaternion &rot)
{
     Vector3 s = scale(pose);
     Matrix3x3 rotm = matrix3x3(rot);
     scale(rotm, s);
     set_3x3(pose, rotm);
}

The problem here is that since floating point computation is imprecise, scale(pose) is not guaranteed to be exactly the same before this operation as after. Numerical errors will cause a very small difference. So even though we only intended to rotate the node we have inadvertently made it ever so slightly bigger (or smaller).

Does it matter? Sure, it is annoying that an object that we didn't want to have any scaling at all suddenly has a scale of 1.0000001, but surely such a small change would be impercievable and couldn't affect gameplay.

True, if we only rotated the object once. However, if we are dealing with an animated or spinning object we will be changing its rotation every frame. So if the error is 0.0000001 the first frame, it might be 0.0000002 the second frame and 0.0000003 the third frame.

Note that the error growth is linear rather than geometric because the error in each iteration is proportional to the current scale, not to the current error. I. e., to (1 + e) rather than e. We can assume that 1 >> e, because otherwise we already have a clearly visible error.

I ran a test using our existing math code. Rotating a transform using the method described above yields the following result:

Error	Frames	Time (at 60 Hz)
0.000001	202	3 s
0.000002	437	7 s
0.000005	897	15 s
0.000010	1654	28 s
0.000020	3511	58 s
0.000050	8823	2 min
0.000100	14393	4 min
0.000200	24605	7 min
0.000500	52203	15 min
0.001000	100575	28 min

As you can see, after 28 minutes we have an error of 0.1 %. At this point, it starts to get noticeable.

You could debate if this is something that needs fixing. Maybe you can live with the fact that objects grow by 0.1 % every half hour, because your game sessions are short and the small scale differences will never be noted. However, since Bitsquid is a general purpose engine, we need a better solution to the problem.

At this point, you might be asking yourself why this problem only happens when we introduce scaling. Don't we have the same issue with just translation and rotation? No, because translation and rotation are stored in completely separate parts of the matrix:

Setting the rotation doesn't touch any of the position elements and can't introduce errors in them, and vice versa.

Solutions to scale drifting

I can think of four possible solutions to this problem:

Store rotation and scale separately

Always set rotation and scale together

Quantize the scale values

Prevent systematic errors

Solution 1: Store rotation and scale separately

The root cause of the problem is that rotation and scale are jumbled together in the Matrix4x4. We can fix that by separating them. So instead of using a Matrix4x4 we would store our pose as:

struct Pose {
      Vector3 translation;
      Matrix3x3 rotation;
      Vector3 scale;
};

With the pose stored like this, changing the rotation does not touch the scale values, so we have eliminated the problem of drifting.

Note that this representation is actually using slightly less memory than a Matrix4x4 -- 15 floats instead of 16. (We could reduce the storage space even further by storing the rotation as a quaternion, but then it would be more expensive to convert it to matrix form.)

However, the representation is not as convenient as a Matrix4x4. We can't compose it or compute its inverse with regular matrix operations, as we can do for a Matrix4x4. We could write custom operations for that, or we could just convert this representation to a temporary Matrix4x4 whenever we needed those operations.

Converting to a Matrix4x4 requires initializing the 16 floats (some with values from the pose) and 9 floating point multiplications (to apply the scale). What kind of a performance impact would this have?

I would guess that the part of the codebase that would be most affected would be the scene graph local-to-world transformation. With this solution, you would want to store the local transform as a Pose and the world transform as a Matrix4x4. The local-to-world transform requires about 36 multiplications and 36 additions (says my quick estimate). So adding a temp Matrix4x4 conversion would take you from 72 to 81 FLOPS.

So a very rough estimate is that this change would make your scene graph transforms about 12 % more expensive. Likely, the real value is less than that since you probably have additional overhead costs that are the same for both methods. And of course, the scene graph transforms are just one small (and parallelizable) part of what your engine does. We rarely spend more than 2 % of our frame time there, meaning the total performance hit is something like 0.2 %.

I think that is a quite reasonable price to pay for a neat solution to the problem of drifting, but you may disagree of course. Also, perhaps the use of Matrix4x4s is so ingrained in your code base that it is simply not possible to change it. So let's look at the other possible solutions.

Solution 2: Always set rotation and scale together

The fundamental problem with set_rotation() is that we try to change just the orientation of the node without affecting the scale. Extracting the scale and reapplying it is what causes the drifting.

If we don't allow the user to just change the rotation, but force him to always set the scale and the rotation together, the problem disappears:

void set_rotation_and_scale(Matrix4x4 &pose, const Quaternion &rot, const Vector3 &s)
{
    Matrix3x3 rotm = matrix3x3(rot);
    scale(rotm, s);
    set_3x3(pose, rotm);
}

Since we have eliminated the step where we extract the scale and feed it back, we have rid ourselves of the feedback loop that caused runaway drifting. Of course, we haven't completely eliminated the problem, because nothing prevents the user from emulating what we did in set_rotation() and recreating the feedback loop:

Vector3 s = scale(pose);
set_rotation_and_scale(pose, new_rotation, s);

Now the drifting problem is back with a vengeance, reintroduced by the user.

To prevent drifting the user must take care not to create such feedback loops. I.e., she can never extract the scale from the matrix. Instead she must store the scale at some other place (separate from the matrix) so that she can always feed the matrix with the correct scale value.

What we have done is essentially to move the burden of keeping track of the scale of objects from the transform (the Matrix4x4) to the user of the transform. This prevents drifting and doesn't have any performance costs, but it is pretty inconvenient for the user to have to track the scale of objects manually. Also, it is error prone, since the user who is not 100 % certain of what she is doing can accidentally recreate the feedback loop that causes drifting.

Solution 3: Quantize the scale values

If none of the two options presented so far seem palpable to you, there is actually a third possibility.

Consider what would happen if we changed the Vector3 scale(const Matrix4x4 &) function so that it always returned integer values.

Calling set_rotation() as before would introduce an error to the scale and set it to, say 1.0000001. But the next time we ran set_rotation() and asked for the scale it would be rounded to the nearest integer value, so it would be returned as 1 -- the correct value. Applying the new rotation would again introduce an error and change the value to 1.0000001, but then again, the next time the function ran, the value returned would be snapped back to 1.

So by rounding the returned scale to fixed discrete values we prevent the feedback loop. We still get small errors in the scale, but without the runaway effect they are unnoticeable. (Small errors occur everywhere, for example in the scene graph transforms. That's the nature of floating point computation. It is not the small errors that are the problem but the mechanisms that can cause them to result in visible effects.)

Of course, if we round to integer values we can only scale an object by 1, 2, 3, etc. Not by 0.5, for instance. But we can fix that by using some other set of discrete numbers for the scale. For example, we could round to the nearest 0.0001. This would let us have scales of 0.9998, 0.9999, 1.0000, 1.0001, 1.0002, … Hopefully that is enough precision to cover all the different scales that our artists might want to use.

Drifting won't happen in this scheme, because the floating point errors will never be big enough to change the number to the next discrete value. (Unless you used really large scale values. If you want to support that -- typically not interesting, because things like texture and vertex resolution start to look wonky -- you could use a geometric quantization scheme instead of an arithmetic one.)

Snapping the scale values in this way might be OK for static scaling. But what if you want to smoothly change the scaling with an animation? Won't the discrete steps cause visible jerks in the movement?

Actually not. Remember that it is only the value returned by scale() that is quantized, the user is still free to set_scale() to any non-quantized value. When the scale is driven by an animation, it is fed from an outside source. We don't need to read it from the matrix and reapply it. So the quantization that happens in scale() never comes into play.

So amazingly enough, this hacky solution of snapping the scale to a fixed set of discrete values actually seems to work for most real world problems. There might be situations where it would cause trouble, but I can't really come up with any.

Solution 4: Prevent systematic errors

A final approach is to try to address how the numbers are drifting instead of stopping them from drifting. If you look at the table above you see that the errors are growing linearly. That is not what you would expect if the errors were completely random.

If the errors in each iteration were completely random, you would get a random walk process where the total error would be e * sqrt(N) rather than e * N, where e is the error from one iteration and N the number of iterations. The fact that the error grows linearly tells us that our computation has a systematic bias -- the error is always pushed in one particular direction.

If we could get rid of this systematic bias and get a truly random error, the accumulated error would grow much more slowly, the square root makes all the difference. For example, for the error to grow to 0.1 % it would take 5.2 years rather than 28 minutes. At that point, we might be ok with the drifting.

I haven't thought that much about what would be needed to get rid of the systematic bias in the set_rotation() function. It's a pretty tricky problem that requires a deep understanding of what happens to all the floating point numbers as they travel through the equations.

Conclusion

In the Bitsquid engine we have so far gone with #2, as a make-shift until we decided on the best permanent solution to this problem. After reviewing the options in this article I think we will most likely go with #1. #3 is an interesting hack and I think it would work almost everywhere, but I'm willing to pay the slight performance price for the cleaner and clearer solution of #1.