Tuesday, December 11, 2012

Four meditations on bad design decisions

I've recently been doing a major rewrite of one of our core engine systems, the graph that we use for our visual scripting language Flow. Taking it from something that looks like this:

To something that looks like this:

A major rewrite like this is always a humbling experience. When you have to rewrite your own code, every bad decision you made comes back to haunt you. And you don't have anybody else to blame them on.

As if facing your own inadequacy wasn't enough -- rewriting an existing system is always harder than writing one from scratch. When you write a new system you start with a blank slate and can do whatever you want. When you rewrite, you are constrained by what the old system did -- at least if you want to maintain any kind of backwards compatibility.

In addition, a new system can be written iteratively. You can start with a very small, simple system, release early and get feedback. Based on that feedback you can tweak the system. You don't have to think about adding features until you have a good stable base.

When you are doing a rewrite you can't release the new system until it is at least as good as the old one. Otherwise, your users will question why you have spent all that time working on a system that is worse than what you had before. And they will be right.

So a rewrite forces you away from the comfortable land of early releases and quick iterations and into the ugly old waterfall model.

With the power of hindsight, I'd like to reflect a bit on four design mistakes I made when I wrote the first version of the system that made this rewrite a lot harder than it could have been.

Don't use strings for non-text things

Strings have one really good use -- to hold pieces of text that either gets displayed to or inputted by the user. All other use of strings should be regarded as suspicious.

Strings are scary because they are both ambiguous and powerful. Does "a/b.txt" and "A//b.txt" represent the same path? Hard to tell. But maybe you can use case conversion, search and replace and some regular expression monstrosity to figure that out.

If you are doing that kind of string manipulation in any part of the code that is not directly related to user input or output, it is a clear warning sign that your code might be too "stringified".

The most obvious example stringified code is the use of "stringly typed" data, for example, storing a date as the string "2012-12-09". But the problem with strings can also manifest more subtle ways.

The original version of Flow used strings to identify connectors, both internally (as a representation of the connection) and visually (to show the name of the connector):

As a consequence, a Flow node couldn't have two connectors with the same name, and a connector couldn't be renamed (even visually) without breaking all existing connections.

In retrospect, rather than having a single Name property, it would be much better to have separate Id and DisplayName properties. The Id would be a GUID that uniquely identified the property, and the DisplayName would be a (localizable) name, suitable for displaying to the end user.

Using names/strings as identifiers has bitten me in other ways as well. In one system I knew that the names had to be unique (because that is how the script would refer to the objects) so I thought it would be safe to use them as identifiers. What I didn't consider was that there could be situations when there temporarily were two objects that had the same name. For example, if the user had created a rock object, and wanted to create a rock_small object -- as she was half-way through typing that name, there would suddenly be two objects named rock. This created problems for the system.

Lesson learned, I now avoid using strings as identifiers.

When in doubt, you should opt-out

Every system acquires features over time. That is good of course. Those features make the system more powerful and easier to work with.

But among the good features there are usually a few that don't feel quite right. That don't really fit into the design of the system. You can do them of course. You can do anything.

But usually it is best not to. Most of the time when I have added a feature that didn't quite feel right, I have regretted it later. In retrospect it would have been better to try to find a different way of doing what the users wanted that was more natural to the ideas behind the system.

An example: Users of Flow wanted some way to specify the order in which events were triggered, when multiple connections are connected to the same Out connector. This is needed in some situations, for example you may want to make sure that a unit is spawned before it is used.

In the old version of Flow, this was implemented with a context menu on the connection where you could select if it should be a "Do First", "Do Last" or "Do Normal" connection.

This solution never felt 100 % right to me. It was hard to find a good intuitive way to visually represent the "Do First" and "Do Last" connections, and as a result the Flow graphs became harder to understand.

In retrospect, it would have been much better to avoid this feature and wait until I had come up with the more elegant alternative: a sequence node that triggers each of its outputs sequentially:

Be explicit or you'll miss it

Writing code where a lot of things happen implicitly feels great -- to begin with. It is amazing how much you are able to do with just a few lines of code.

But in my experience, implicit code almost always ends up more costly in the long run. It is harder to understand, harder to debug and harder to change. It tends to lock you down in a "local minimum" that can be tricky to come out of.

In Flow, a lot of things are done implicitly. The definition of a Flow node is just a simple C# class:

[Category("Animation")]
public class AnimationEvent : Node
{
    public InVariableUnit Unit;
    public StringVariable Event;
    public InEvent In;
    public OutEvent Out;
}

Through reflection, Flow finds out the members in the class and their types and automatically generates Flow nodes for them. This process involves some ugly string processing (bad decision #1), such as stripping In and Variable from the type name to find the underlying type of members. Reflection is also used to serialize the graphs.

While it is nice to be able to express a node so concisely, there are also a lot of problematic consequences. For example, since the class names get serialized, we can't change the names of classes or properties without breaking the ability to load old files. Also, we have to use some really ugly C# hacks to make sure that the reflection system always returns the members of a class in the order they are declared in the file (so that we can control the order of the connectors).

In retrospect, it would been much better to avoid all this clever reflection stuff and instead just define the node types in configuration files.

Avoid the road of complex code

There is some code that needs to be complex, because it is dealing with fundamentally tricky stuff (like computational geometry) or because it needs to run really, really fast. But in all other cases, complexity is just a cost.

If your code starts to feel complex and hard to keep track of, it is a sign that you are probably doing something wrong. And if you are not careful, you may lock yourself in, so that when you write the next version of the system, you have to recreate all that complex behavior in your new, simpler system. You have to deliberately make your code uglier.

The old version of Flow had a way of "folding" nodes. You could select a number of nodes, group them, and then "fold" the group, collapse it to a single node.

The system had a lot of really hairy code for dealing with this. The code takes a bunch of nodes and creates a new node from them, with connectors matching only the external connectors of the collapsed nodes. it also keeps track of the internal nodes and their connections so they can be recreated if the node is later "expanded".

As you might imagine, this was complicated further by the need for connector names to be unique (see bad decision #1), which meant that some of the external connectors in the new node had to be renamed (since they could come from different internal nodes that had connectors with the same name). So a mapping table was needed to keep track of these renames. Obviously a bad idea, but once you have started down the path of wrongness, it can be hard to turn around.

The new version handles this a lot better. Collapse and expansion is just a visual feature. There are no new nodes created and no other strange things happening to the data, the visualizer just chooses to draw the data in a different way when it is collapsed. In retrospect, that is a much better choice.

That is all, four simple lessons
to guide your future coding sessions
now let your code be light and merry
until its time for Charon's ferry

Wednesday, November 21, 2012

A Formal Language for Data Definitions

Lately, I've started to think again about the irritating problem that there is no formal language for describing binary data layouts (at least not that I know of). So when people attempt to describe a file format or a network protocol they have to resort to vague and nondescript things like:

Each section in the file starts with a header with the format:

4 bytes   header identifier
2 bytes   header length
0--20 bytes  extra data in header

The extra data is described below.

As anyone who has tried to decipher such descriptions can testify, they are not always clear-cut, which leads to a lot of unnecessary work when trying to coax data out of a document.

It is even worse when I create my own data formats (for our engine's runtime data). I would like to document those format in a clear and unambiguous way, so that others can understand them. But since I have no standardized way of doing that, I too have to resort to ad-hoc methods.

This whole thing reminds me of the state of mathematics before formal algebraic notation was introduced. When you had to write things like: the sum of the square of these two numbers equals the square of the previous number. Formal notation can bring a lot of benefits (just look at what it has done for mathematics, music, and chess).

For data layouts, a formal definition language would allow us to write a tool that could open any binary file (that we had a data definition) for and display its content in a human readable way:

height = 128
width = 128
comment = "A funny cat animation"
frames = [
 {display_time = 0.1 image_data = [100 120 25 ...]}
 ...
]

The tool could even allow us to edit the readable data and save it back out as a binary file.

A formal language would also allow debuggers to display more useful information. By writing data definition files, we could make the debugger understand all our types and display them nicely. And it would be a lot cleaner than the hackery that is autoexp.dat.

Just to toss something out there, here's an idea of what a data definition might look like:

typdedef uint32_t StringHash;

struct Light
{
 StringHash name;
 Vector3  color;
 float  falloff_start;
 float   falloff_end;
};

struct Level
{
 uint32_t version;
 uint32_t num_lights;
 uoffset32_t light_data_offset;

 ...

light_data_offset:
 Light lights[num_lights];
};

This is a C-inspired approach, with some additions. Array lengths can be parametrized on earlier data in the file and a labels can be used to generate offsets to different sections in the file..

I'm still tossing around ideas in my head about what the best way would be to make a language like this a reality. Some of the things I'm thinking about are:

Use Case

I don't think it would do much good to just define a langauge. I want to couple it with something that makes it immediately useful. First, for my own motivation. Second, to provide a "reality check" to make sure that the choices I make for the language are the right ones. And third, as a reference implementation for anyone else who might want to make use of the language.

My current idea is to write a binary-to-JSON converter. I.e., a program that given a data definition file can automatically convert back and forth between a binary and a JSON-representation of that same data.

Syntax

The syntax in the example is very "C like". The advantage of that is that it will automatically understand C structs if you just paste them into the data definition file, which reduces the work required to set up a file.

The disadvantage is that it can be confusing with a language that is very similar to C, but not exactly C. It is easy to make mistakes. Also, C++ (we probably want some kind of template support) is quite tricky to parse. If we want to add our own enhancements on top of that, we might just make a horrible mess.

So maybe it would be better to go for something completely different. Something Lisp-like perhaps. (Because: Yay, Lisp! But also: Ugh, Lisp.)

I'm still not 100 % decided, but I'm leaning towards a restricted variant of C. Something that retains the basic syntatic elements, but is easier to parse.

Completeness

Should this system be able to describe any possible binary format out there?

Completeness would be nice of course. It is kind of annoying to have gone through all the trouble of defining language and creating the tools and still not be able to handle all forms of binary data.

On the other hand, there are a lot of different formats out there and some of them have a complexity that is borderline insane. The only way to be able to describe everything is to have a data definition language that is Turing complete and procedural (in other words, a detailed list of the instructions required to pack and unpack the data).

But if we go down that route, we haven't really raised the abstraction level. In that case, why even bothering with creating a new language. The format description could just be a list of the C instructions needed to unpack the data. That doesn't feel like a step forward.

Perhaps some middle ground could be found. Maybe we could make language that was simple and readable for "normal" data, but still had the power to express more esoteric constructs. One approach would be to regard the "declarative statements" as syntactic sugar in a procedural language. With this approach, the declaration:

struct LightCollection
{
 unsigned num_lights;
 LightData lights[num_lights];
};

Would just be syntactic sugar for:

function unpack_light_collection(stream)
 local res = {}
 res.num_lights = unpack_unsigned(stream)
 res.lights = []
 for i=1,res.num_lights do
  res.lights[i] = unpack_light_data(stream)
 end
end

This would allow the declarative syntax to be used in most places, but we could drop out to full-featured Turing complete code whenever needed.

Thursday, November 1, 2012

Bitsquid Foundation Library

Today I want to talk a bit about the Bitsquid Foundation Library that we recently released on Bitbucket (under the permissive MIT license).

It's a minimalistic "foundation" library with things like memory management and collection classes. The idea is to have something that can be used as a reasonable starting-off point for other open source projects.

The library makes some interesting design choices that touches on topics that I have already talked about in this blog and that I think are worth elaborating on a bit further. It also serves an example on how these techniques can be used in "real world" code.

Separation of data and code

The foundation library implements the idea of separating data definitions and function implementation, that I talked about in this article.

Data is stored in structs with public members (prefixed with an underscore to indicate that you should not mess with them unless you know what you are doing) that are found in *_types.h files. Functions that operate on the data are written outside of the struct, in separate *.h files (and organized into namespaces).

For example, the data definition for the dynamic Array<T> class is found in collection_types.h:

template<typename T> struct Array
{
    Array(Allocator &a);
    ~Array();
    Array(const Array &other);
    Array &operator=(const Array &other);
    
    T &operator[](uint32_t i);
    const T &operator[](uint32_t i) const;

    Allocator *_allocator;
    uint32_t _size;
    uint32_t _capacity;
    T *_data;
};

The struct contains only the data used by the array and the operators which C++ forces us to implement as member functions.

The implementation of these functions, as well as the declaration and definition of all other functions that operate on the arrays are found in the array.h file. It contains things like:

namespace array
{
    template<typename T>
    inline uint32_t size(const Array<T> &a)
    {return a._size;}
    
    template<typename T>
    inline bool any(const Array<T> &a)
    {return a._size != 0;}
    
    template<typename T>
    inline bool empty(const Array<T> &a)
    {return a._size == 0;}
}

template <typename T>
inline Array<T>::Array(Allocator &allocator) :
    _allocator(&allocator), _size(0), _capacity(0), _data(0)
{}

This way of arranging data and code fills two purposes.

First, it improves compile times by reducing header inclusion. Header files that want to make use of arrays only need to include collection_types.h, which just contains a few struct definitions. They don't have to drag in array.h, with all its inline code.

Headers including other headers indiscriminately because they need their types is what leads to exploding compile times. By only including the minimal thing we need (the type definitions), compile times are minimized.

Second, and more importantly, this design allows the collection types to be freely extended. Is there anything you miss in the array interface? Perhaps you would like shift() and unshift() methods? Or binary_search()?

No problem. If you want them you can just add them, and you don't even need to modify array.h. Just create your own file array_extensions.h or whatever, and add some new functions to the array namespace, that manipulate the data in the Array<T> interface. The functions you create will be just as good as the functions I have created.

Note that this isn't true for traditional class designs, where you have first-class citizens (methods) and second-class citizens (external functions).

The foundation library has some interesting examples of this. For example, the string_stream functions don't operate on any special StringStream class, they just directly use an Array<char>. Also, the hash and multi_hash interfaces both work on the same underlying Hash<T> struct.

I believe that this design leads to simpler, more orthogonal code that is easier to extend and reuse.

Memory management

The library implements the allocator system mentioned in this article. There is an abstract Allocator interface, and implementations of that interface can provide different allocation strategies (e.g. ArenaAllocator, HeapAllocator, SlotAllocator, etc).

Since I want to keep the library platform independent, I haven't implemented a PageAllocator. Instead, the MallocAllocator is used as the lowest allocator level. If you want to, you can easily add a PageAllocator for your target platform.

For the same reason, I haven't added any critical section locking to the allocators, so they aren't thread safe. (I'm thinking about adding an interface for that though, so that you can plug in a critical section implementation if needed.)

The system for temporary allocations is kind of interesting and deserves a bit further explanation.

Most games have a need for temporary memory. For example, you might need some temporary memory to hold the result of a computation until it is done, or to allow a function to return an array of results.

Allocating such memory using the ordinary allocation system (i.e., malloc) puts a lot of unnecessary stress on the allocators. It can also create fragmentation, when long lived allocations that need to stay resident in memory are mixed with short lived temporary allocations.

The foundation library has two allocators for dealing with such temporary allocations, the ScratchAllocator and the TempAllocator.

The ScratchAllocator services allocation requests using a fixed size ring buffer. An allocate pointer advances through the buffer as memory is allocated, and a corresponding free pointer advances as memory is freed. Memory can thus be allocated and deallocated with simple pointer arithmetic. No calls need to be made to the underlying memory management system.

If the scratch buffer is exhausted (the allocate pointer wraps around and catches up with the free pointer), the ScratchAllocator will revert to using the ordinary MallocAllocator to service requests. So it won't crash or run out of memory. But it will run slower, so try to avoid this by making sure that your scratch buffer is large enough.

If you forget to free something allocated with the ScratchAllocator, or if you accidentally mix in a long-lived allocation among the short-lived ones, that allocation will block the free pointer from advancing, which will eventually exhaust your scratch buffer, so keep an eye out for such situations.

TempAllocator<BYTES> is a scoped allocator that automatically frees all its allocated memory when it is destroyed (meaning you don't have to explicitly call deallocate(), you can just let the allocator fall out of scope). This means you can use it everywhere where you need a little extra memory in a function scope:

void test()
{
     TempAllocator1024 ta;
     Array<char> message(ta);
     ...
}

The BYTES argument to TempAllocator<BYTES> specifies how much stack space the allocator should reserve. The TempAllocator contains char buffer[BYTES] that gets allocated on the stack together with the TempAllocator.

Allocation requests are first serviced from the stack buffer, then (if the stack buffer is exhausted) from the ScratchAllocator.

This means that TempAllocator gives you an allocator that can be used by all collection classes and will use the fastest allocation method possible (local stack memory, followed by scratch buffer memory, followed by malloc() if all else fails).

Minimalistic collection types

The collection classes in the library are distinctly anti-STL. Some of the important differences:

  • They use the allocation system described above (taking an Allocator as argument to the constructor). They can thus be used sensibly with different allocators (unlike STL types).

  • The use the data/function separation also described above, which means that the headers are cheap to include, and that you can extend them with your own functionality.

  • They use a minimalistic design. They assume that the stored data consists of plain-old-data objects (PODs). Constructors and destructors are not called for the stored objects and they are moved with raw memmove() operations rather than with copy constructors.

This simplifies the code and improves the performance (calling constructors and destructors is not free). It also saves us the headache of dealing with storing objects that must be constructed with Allocators.

Personally I like this minimalistic approach. If I want to keep non-POD data in a collection, I prefer to store it as pointers anyway, so I have control over when and how the data is constructed and destroyed. I don't like those things happening "behind my back". You may disagree of course, but in that case you are probably happy to use STL (or boost).

Another example of choosing minimalism is the Hash<T> class. The hash uses a fixed key type which is a uint64_t. If you want to use a key that doesn't fit into 64 bits, you have to hash it yourself before using it to access the data.

And more?

I'm planning to add some basic math code to the library, but haven't gotten around to it yet.

Is there anything else you'd like to see in a library like this?

Wednesday, October 17, 2012

A Data-Oriented, Data-Driven System for Vector Fields -- Part 3

In this post, I'll finish my series on vector fields (see part 1 and part 2) by tying up some loose ends.

Quick recap of what has happened so far:

  • I've decided to represent my vector fields in functional form, as a superposition of individual effect functions G_i(p).

  • I represent these functions in bytecode format, as a piece of bytecode that given an input position p computes a vector field strength F_i.

  • By running each step of the virtual machine over a thousands of input points, the cost of decoding and interpreting the bytecode instructions is amortized over all those points.

  • This means that we get the bytecode decoding "for free" -- the bytecode can run at nearly native speed.

Bytecode format

In the last article I didn't say much about what format I used for the bytecode. Generally speaking, designing a bytecode format can be tricky, because you have to balance the compactness (keeping programs short) against the decoding cost (keeping bytecode fast).

Lucky for us, we don't care about either of these things. Compactness doesn't matter, because our programs will be very short anyway (just a few instructions). Decoding cost doesn't matter (much), because it is amortized.

When it doesn't really matter I always pick the simplest thing I can think of. In this case it is something like:

(instruction) (result) (argument-1) (argument-2) 

Here, instruction is a 4-byte instruction identifier. result is a 4-byte channel identifier that tells us which channel the result should be written to. argument-1 and argument-2 are either channel identifiers or Vector4's with constant arguments. (Instructions of higher arity would have more arguments.)

Note that using 4 bytes for instructions and registers is beyond overkill, but it is the simplest option.

One annoyance with this representation is that I need different instructions depending on whether argument-1 or argument-2 is constant. For a 2-arity instruction, I need four variants to cover all cases. For a 4-arity instruction (such as select), I would need 16 variants.

There are two ways of dealing with this. First, I could make the code that executes each instruction a bit more complex, so that it can handle both constant and register arguments. Second, I could make all instructions operate only on registers and have a single instruction for loading constants into registers.

Unfortunately, both of these option results in significantly slower bytecode. In the first case, the extra logic in each bytecode executor makes it slower. In the second case, we need extra instructions for loading constants, which increases the execution time.

So at least for two argument functions, the best option seems to be to have separate code for handling each argument combination. For four argument functions, it might be better to use one of the other options.

Just to give you some example of how the bytecode works, here is some raw byte code and the corresponding disassembled bytecode instructions:

05000000 02000000 00000000 00000000000020410000000000000000
r2 = sub          r0       (0,10,0,0)

16000000 03000000 00000000000000000000803f00000000 02000000
r3 = cross        (0,0,1,0)                        r2

0a000000 04000000 00002041000020410000204100002041 03000000
r4 = mul          (10,10,10,10)                    r3

10000000 03000000 02000000 02000000
r3 = dot          r2       r2

0c000000 05000000 04000000 03000000
r5 = div          r4       r3

09000000 03000000 05000000 0000a0400000a0400000a0400000a040
r3 = mul          r5       (5,5,5,5)

00000000  01000000  01000000  03000000
r1 = add            r1        r3

High-level language

You can't really expect people to author their effects in raw bytecode, or even in our "bytecode assembly language". Effect authors will be a lot more productive if they can use a more comfortable language.

I decided to create such a language and model it after HLSL, since it serves a similar purpose (fast processing of vectorized data). Programmers interested in writing vector field effects are probably already used to working with HLSL. Plus, if at some point we want to move some of this work to the GPU we can reuse the code.

To show what the high level language looks like, here is an implementation of a whirl effect:

const float4 center = float4(0,10,0,0);
const float4 up = float4(0,0,1,0);
const float4 speed = float4(10,10,10,10);
const float4 radius = float4(5,5,5,5);

struct vf_in
{
    float4 position : CHANNEL0;
    float4 wind : CHANNEL1;
};

struct vf_out
{
    float4 wind : CHANNEL1;
};

void whirl(in vf_in in, out vf_out out)
{
    float4 r = in.position - center;
    out.wind = in.wind + speed * cross(up, r) / dot(r,r) * radius;
}

If you squint, you may notice that this high level code exactly corresponds to the low level bytecode in the previous example.

Just as with HLSL, although this looks like C it actually isn't C. Things that work in C may not work in this language and vice versa. I'm quite strict when I parse this. I figure it is better to be start by being strict rather than permissive. This gives you more leeway to extend or modify the language later while keeping backwards compatibility. A strict syntax can always be loosened later, but if you design the language with a too permissive syntax you can paint yourself in a corner (case in point: Ruby).

I usually don't bother with Lex or Yacc when I write a parser. They are OK tools, I guess, but if I can get by without them I prefer not to have the extra precompile step and to have code that is a bit more straightforward to read and debug.

Instead I tend to use a recursive descent parser (a predictive variant, with no backtracking) or some variation of Dijkstra's shunting yard algorithm. Or sometimes a combination of both.

For this language I parse the overall structure with recursive descent, and then use Dijkstra's algorithm to process each statement in the function body.

I generate the bytecode directly from the shunting yard algorithm. When I pop an operator from the operator stack I generate the bytecode for computing that operator and storing the result in a temporary register. I then push that register to the value stack so that the result can be used in other computations. Temporary channels are recycled after they are popped of the value stack to minimize the channel count.

Constant patching

Constants in the bytecode can be changed when an effect is played. I do this by directly patching the bytecode with the new constant values.

When I generate the bytecode I keep track of where in the bytecode different global constants can be found. This patch list is a simple array of entries like:

(hashed constant name) (offset in bytecode)

When playing a vector field effect, the gameplay programmer specifies the constant values with a table:

VectorField.add(vf, "whirl", {radius = 10})

I look through the patch list, find all the offsets of constants named "radius" and replace them with the value(s) supplied by the gameplay programmer.

Since globals can be patched later, I can't do constant folding when I generate the bytecode. (Without global patching, I could just check if both arguments were constants when I popped an operator, and in that case, compute the constant result and push that directly to the value stack, instead of generating a bytecode instruction.)

I could reduce the instruction count somewhat and improve performance by doing a constant folding pass on the bytecode after the globals have been patched, but I haven't implemented that yet.

Physics integration

In my physics system I maintain a list of all awake (non-sleeping) actors. I apply wind from a vector field with an explicit call:

void apply_wind(const VectorField &field, const CollisionFilter &filter);

This extracts the position of every awake actor that matches the collision filter and sends that list to the vector field for evaluation. It then does a second loop through the actors to apply wind forces from the returned wind velocities.

I've chosen to have an explicit step for applying wind, so that you don't have to pay anything for the wind support unless you actually use it. Having an explicit step also opens up the possibility to have other types of vector fields. For example, there could be a vector field representing gravity forces and a corresponding function:

void apply_acceleration(const VectorField &field, const CollisionFilter &filter);

The fact that the wind is only applied to awake actors is important. Without that check, the wind forces would keep every actor in the world awake all the time, which would be really expensive for the physics engine. Just as with gravity, we want physics objects to come to rest and go to "sleep" when the wind forces are in balance with other forces on the actor.

This of course creates a problem when the wind forces are varying. An actor may be in balance now, but a change in the wind direction could change that. A leaf that is resting on the ground may be lifted by a sudden updraft. Since we don't apply the wind forces to sleeping object we can't get that behavior. Once a leaf has come to rest, it will stay put.

This problem is most noticeable when you have drastic effects like explosions in the vector field. It looks really strange when actors are completely immobile and "sleep through" a big explosion.

I deal with this by having a function for explicitly waking actors in an AABB:

wake_actors(const Vector3 &min, const Vector3 &max, const CollisionFilter &filter)

If you want to play a drastic wind effect (like an explosion), you should first wake the nearby actors with a call to wake_actors(). This ensures that all nearby actors will get the wind forces from the explosion (since they are now awake).

I apply the wind force with the standard formula:

F = 1/2 r v^2 C A

Where r is the density of air, v is the relative velocity of the air with respect to the object (so v = v_wind - v_object, where v_wind is the wind speed and v_object is the object's speed). C is a drag coefficient that depends on the object's shape and A is the object's reference area.

For C and A, I actually loop through all the physics shapes in the actor and estimate C and A based on those shapes. This is by no means a perfect approach. There are many situations where C might be really different from what such an estimation gives. For example, an object that is heavily perforated would receive much less wind force.

However, I want to have something in place that gives decent behavior in most cases, so that it only very rarely has to be changed. The less artists have to mess around with physical parameters, the smaller is the chance that anything gets messed up.

Note that the wind force is just air resistance with a velocity for the air. So by implementing wind you get the "air resistance" behavior "for free".

Rotation

If you compute the drag force using the formula above and apply it to a physics actor, it won't add any rotation to the actor. This is actually correct. The drag force, as we compute it here, has no rotational component.

Yet it feels counter-intuitive. We expect objects to rotate when they are blown about by the wind. Leafs and papers certainly swirl around a lot when the wind blows.

What happens in that case is actually a second order effect. When the wind blows around an object you get zones of high and low pressure as well as turbulence, and it is the forces from these interactions that affects the object's rotation.

These interactions are tricky to model accurately and they depend a lot on the object's shape. Right now, I'm not even trying. Instead I use a much simpler approach: I apply the drag force a bit above the object's actual center of mass so that it produces a torque and makes the object rotate. This is a complete hack that has no basis at all in physical reality, but it does add some rotation. At least it looks a lot better than applying the wind force without any rotation.

It should be possible to do better -- to make some kind of estimate of what rotational forces wind induces when it blows against typical physics shapes: boxes, spheres, capsules, etc. Just give my a couple of days in a wind tunnel and I'll try to come up with something.

Tuesday, October 2, 2012

A Data-Oriented, Data-Driven System for Vector Fields -- Part 2

In Part 1 we decided to represent a vector field as a superposition of individual effects:

G(p) = G_0(p) + G_1(p) + ... + G_n(p)

Here, each G_i(p) is a function that represents some effect, such as wind, an explosion or the updraft from an air vent.

The next step is to find a way of quickly evaluating the function G(p), a general function that could be almost anything, for lots of different positions p_i. This is quite tricky to do well in C++.

Of course, evaluating specific functions is not hard. If we want to evaluate a specific function, such as:

Vector3(sin(p.x), sin(p.y), 0);

we can just type it up:

inline Vector3 f(const Vector3 &p)
{
 return vector3(sin(p.x), sin(p.y), 0);
}

But if we don't know beforehand what G(p) will be we don't have that option.

We could write our system so that it supported a limited set of specific effects, with hardcoded C++ implementations. For example, there could be an "explosion" effect with some parameters (radius, strength, etc), an "updraft" effect, a "whirl" effect, etc. Similarly we could have support for a variety of standard shapes, such as "sphere", "cylinder", "capsule", etc. And perhaps some different types of falloffs ("linear", "quadratic"). Perhaps also some temporal effects ("attack-sustain-release", "ease-in-ease-out").

But it is hard to know where to draw the limit with this approach. Exactly what effects and shapes and falloffs and time curves should the system support? The more things we add, the more cluttered the system becomes. And the system is still not completely general. No matter how much we add, there will still be some things that the user just can't do, without disturbing a programmer and get her to add a new effect to the system. This means that the system is not truly data-driven.

Whether this is a problem or not depends a lot on your development style. If you are a single artist-programmer working on a single game you may not even care. To you code and data is the same thing. Who cares if you have to add something to the code to make a special effect. That is what the code is for.

At Bitsquid, however, we are in a different position. We are making a general purpose engine to be used on multiple platforms for all kinds of tasks. We can't put game specific code in the engine or everything will end up a total mess. Sure, our licensees could modify their cloned copy of the source to add their own effects. But that is not an ideal solution. It forces them to learn our code, it makes it harder for us to reproduce their bugs, since our code bases have now diverged and it makes it harder for us to modify and optimize the source code without putting our licensees in merge hell.

So our aim is always to be completely data-driven.

But how can we represent a general function as data? There are really only two possibilities:

  • As a piece of executable machine code.

  • As a piece of bytecode that gets executed by a virtual machine.

The first approach is the fastest of course, but it has two drawbacks. First, machine code is platform dependent. Writing a system that can dynamically generate machine code for a lot of different targets is no small undertaking (though it could be simplified by using LLVM). Second, and more serious, many systems simply don't allow us execute dynamically generated machine code.

The inevitable conclusion is that we have to use bytecode (perhaps coupled with a machine code compiler on the platforms where that is feasible).

Unfortunately, as everybody who has used a dynamic language without a JIT compiler knows, bytecode is slow. Usually, at least a factor 10 slower than machine code. And remember that one of our design goals for this system was that it should be fast. We said in the beginning that it should be able to handle at least 10 000 queries per frame.

So what can we do?

The Massively Vectorized Virtual Machine

At this point it makes sense to stop and think a bit about why bytecode is slow. If you look at the code of a virtual machine, it is essentially a tight loop that repeatedly does three things:

  • Decode the next bytecode instruction into operation + arguments.

  • Jump to the code that performs the operation.

  • Execute the operation.

The third step is usually just as fast as handwritten machine code would be. Computing a+b is not more expensive because it was triggered by an OP_ADD bytecode instruction.

So all the overhead of bytecode, the thing that makes it "slow", is found in the first two steps.

Well then here is an idea: what if we could reuse the computations that we make in those two steps?

Remember that our goal is to compute G(p) for a lot of points p_i. We want to evaluate the same function, the same bytecode instructions, for a lot of different data points. In that case, why repeat the expensive operation of decoding the bytecode instructions again and again for each point? Why not just decode the instruction once and then execute it for all data points?

So, with that change, our virtual machine loop now becomes:

  • Decode the next bytecode instruction.

  • Jump to the code that executes it.

  • Execute that single instruction for all the input data.

With this change, the cost of decoding the bytecode is now amortized over all the query points. The more query points we have, the less time (proportionally) we will spend on decoding bytecode. With enough points (>1024) that time should be nearly negligible . In other worlds, our bytecode should be able to run at nearly the same speed as native machine code.

In a quick test I made, the overhead of a bytecode implementation compared to native code was just 16 % -- a far cry from the 10x slowdown we have come to expect.

Fleshing out the Details

Since we are computing a vector function on vector input and we want it to run as fast as possible, it makes sense to use SSE (or its equivalent on other platforms) and represent all our data as vector4 intrinsics.

Virtual machines can be stack-based or register-based. Stack-based machines produce more compact bytecode since the arguments are implicit. Register-based machines need fewer instructions to accomplish a task, since they don't have to juggle things around on the stack. In our case, compact bytecode doesn't buy us much, since our programs are short and the decoding cost is amortized. On the other hand, accomplishing the same thing with fewer instructions means less code to execute for each query point. So a register-based virtual machine seems to be a clear win.

Here is what the code for an explosion effect could look like in a made-up intermediate language for our virtual machine. The effect produces a wind of 50 m/s outwards from the center of a sphere of radius 5 m located at (2,4,0):

direction = sub position, (2,4,0,0)
lensqr = dot direction, direction
direction = normalize direction
direction = mul direction, (50,50,50,50)
direction = select_lt lensqr, (25,25,25,25), direction, (0,0,0,0)
output = add output, direction

Here position is the input query position and output is the output result of the function. direction and lensqr are temporary variables.

Note that the final operation adds the result to the output register instead of overwriting it. This allows us to merge multiple effects by simply concatenating their bytecode. So to evaluate G(p) for a large number of points, we can first intersect the AABB of the points with the AABB of each individual effect G_i(p). Then we merge the bytecodes of each intersecting effect into a single bytecode function G'(p) that we finally evaluate for each point.

We can feed position and output to the virtual machine as arrays of intrinsics:

void evaluate(void *bytecode, unsigned n, Vector4I *positions, Vector4I *output)

Note that since we are running the bytecode one instruction at a time for all the data, the local variables (direction and lensqr) need to be arrays too, since we need to remember their value for each of the input positions.

We could allocate arrays for these local variables and pass them to evaluate just as we do for positions and output. But that seems a bit wasteful. A complicated function could have twenty global variables or more, meaning that with 10 000 particles we would need to allocate 3.2 MB of temporary memory. The amount needed will vary widely, depending on how complicated the function is, which is driven by the data. This makes it hard to do a memory budget for the system.

So let's use an alternative approach. We allocate all local variable buffers from a "scratch space" which is provided by the caller:

void evaluate(void *bytecode, unsigned n, Vector4I *positions, Vector4I *output, unsigned scratch_bytes, void *scratch_space)

Now the caller has complete control over the amount of temporary memory the system uses. It is predictable and can be made to fit any desired memory budget.

To make this work, we need to chop this scratch memory up into areas for each local variable. The size of those buffers then determine how many input positions we can process at a time.

For example, suppose we have 256 K of scratch memory and 8 local variables. Each local variable then gets 32 K of memory, which can hold 2 K Vector4I's. So this means that instead of processing all 10 000 particles at the same time when we execute an opcode, we process the particles in 5 chunks, handling 2 048 particles each time. The cost of decoding the bytecode gets amortized over 2 048 particles, instead of over 10 000, but it is still negligible.

The nice thing about this approach is that we always use a constant, predictable amount of scratch space, regardless of how many query points we process and how complicated the function is. Instead we scale down how many particles we process at a time.

Since both input data and local variables are now Vector4I buffers, the inner loop of the virtual machine is simple to write, it will look something like:

void run_vm(const void *bytecode, unsigned n, Vector4I **registers)
{
 const void *pc = bytecode;
 while (true) {
  unsigned op = DECODE_OP(pc);
  switch(op) {
   case OP_ADD:
    Vector4I *a = registers[DECODE_REGISTER(pc)];
    Vector4I *b = registers[DECODE_REGISTER(pc)];
    Vector4I *c = registers[DECODE_REGISTER(pc)];
    Vector4I *ae = a + n;
    while (a != ae) {
     *a++ = addi(*b++, *c++);
    }
    break;
   ...
  }
 }
}

An Example

Here is a YouTube video that shows a vector field implemented using this method. Unfortunately, the YouTube compression is not very nice to a video that contains this much high-frequency information. But at least it gives some idea of the effect.

The video shows 20 000 particles being animated by the vector field at a query cost of about 0.4 ms on a single thread (of course, parallelization is trivial, so you can divide that by the number of available cores).

Monday, September 17, 2012

A Data-Oriented, Data-Driven System for Vector Fields - Part 1

A vector field is a function that assigns a vector value to each point in 3D space. Vector fields can be used to represent things like wind (the vector field specifies the wind velocity at each point in space), water, magnetism, etc.

To me, wind is the most interesting use case. I want a system that can be used for physics (trees, tumble weed, paper cups), particles (leaves, sparks, smoke) and graphics (grass). I also want the system to be capable of handling both global effects (wind blowing through the entire level) and local effects (explosions, air vents, landing helicopters, rising hot air from fires, etc). But I don't want to limit the system to only handling wind. I imagine that once the system is in place, it could be put to other interesting uses as well.

There are a number of things that make this an interesting non-trivial design challenge:

  • Vector fields represent a global shared state. All systems (particles, physics, etc) should react to the same wind. This can create strong couplings between unrelated systems, which we want to avoid.

  • The system must be fast. We want to be able to make large particle effects that are affected by wind. As a design goal, let's say that it should be able to handle at least 10 000 queries / frame.

  • As stated above, the system must be flexible enough to handle both global wind and a large variety of different local effects (air vents, fans, etc).

I'll outline the system in a series of articles. Let's start by thinking a bit about how we can represent the vector field in a way that allows for fast queries.

1. Use a functional representation

Storing the vector value for every point in 3D space at a decent resolution would require huge amounts of memory. It would also be very expensive to update. If we wanted to change the global wind direction, we would have to loop over all those points and change the value.

So, instead, we will use a functional representation. We will express the field as some closed function F(p, t) that gives us the field vector at point p in space at the time t.

For example, we could express a global wind that oscillates in the x-direction as:

F(p, t) = Vector3(sin(t), 0, 0)

The closed function form allows us to evaluate the vector field at any point in space and time.

Note that even with a functional form as the main representation, we can still interact with grid based representations. For example, we can render some section of the F(p, t) function to a texture for use on a GPU. Similarly, if we have some grid based wind data that we want to add to the simulation, we could use that as part of the F(p, t) expression:

F(p, t) = Vector3(sin(t), 0, 0) + sample_grid(grid, p)

2. Ignore the time coordinate

The vector field function F(p, t) is a function of both space and time. The wind varies throughout the level and if we look at any one point, the wind at that point varies over time.

But in practice, we treat the p and t coordinates very differently. We start at some time t_0 and then evaluate F(p, t_0) for thousands of different p values. Then we move on to t_1 and do the same thing.

We can make use of the fact that t remains constant for a large number of evaluations to simplify the function. For example at t=0.5 the function:

F(p, t) = sin(p.x) * sin(p.y) * cos(t)

simplifies to:

G(p) = sin(p.x) * sin(p.y) * 0.8776

which is cheaper to evaluate.

Taking this approach a step further, it makes sense to split our system in two parts -- a high level system that knows about time and every frame produces a new G(p) for the current time, and a low level system that ignores time completely and just computes G(p). Since the high level system only runs once per frame it can afford to do all kinds of complicated but interesting stuff, like constant folding, optimization, etc.

For the low level system we have reduced the problem to evaluating G(p).

3. Express the field as a superposition of individual effects

To make it possible for the field to contain both global effects (world wind) and local effects (air vents, explosions) we express it as a superposition of individual effect functions:

G(p) = G_1(p) + G_2(p) + ... + G_n(p)

Here G_i(p) represents each individual effect. A base wind could be expressed as just a constant:

G_0(p) = Vector3(2.1, 1.4, 0)

A turbulence function could add a random component

G_1(p) = turbulence(seed, p, 4)

An explosion effect could create a wind with a speed of 100 m/s outwards from the center of the explosion in a sphere with radius 4.0 meter around the explosion center:

G_2(p) = sphere(p,c,4) * normalize(p-c) * 100

Here sphere(p,c,4) is a spherical support function that defines the range of the effect. It is 1 if ||p - c|| <= 4.0 and 0 otherwise.

Note again that we have stripped out the time component. At the higher level, this might be an expanding sphere with decreasing wind speeds, but at the low level we only care what it looks like at this instance.

Similar functions can be added for other local effects.

4. Use the AABB to cull local fields

If we have a lot of local effects (explosions, etc), evaluating G(p) will be pretty expensive.

We can reduce the cost by only evaluating the local effects that are close enough to our particle system to matter.

I.e., instead of evaluating G(p) for all particles, we first intersect the AABB of each G_i(p)'s support with the AABB of our particle system.

That gives us a simpler function G'(p) that we can then evaluate for each particle.

If we wanted to, we could use the wavelength of the field for further simplifications. If the scale at which a field effect changes is much larger than our AABB, we can replace that effect with a Taylor series expansion. Similarly, if an effect oscillates at a scale much smaller than the size of our particles, we can replace it with its average value.

Next time

Next time I will look at how we can efficiently evaluate arbitrary functions, such as:

G(p) = Vector3(1,1,0) + turbulence(seed, p, 2) + sphere(p, c, 4)

for a huge number of particle positions p.

This has also been posted to The Bitsquid blog.

Monday, September 3, 2012

A new way of organizing header files

Recently, I've become increasingly dissatisfied with the standard C++ way of organizing header files (one .h file and one .cpp file per class) and started experimenting with alternatives.

I have two main problems with the ways headers are usually organized.

First, it leads to long compile times, especially when templates and inline functions are used. Fundamental headers like array.h and vector3.h get included by a lot of other header files that need to use the types they define. These, in turn, get included by other files that need their types. Eventually you end up with a messy nest of header files that get included in a lot more translation units than necessary.

Sorting out such a mess once it has taken root can be surprisingly difficult. You remove an #include statement somewhere and are greeted by 50 compile errors. You have to fix these one by one by inserting missing #include statements and forward declarations. Then you notice that the Android release build is broken and needs additional fixes. This introduces a circular header dependency that needs to be resolved. Then it is on to the next #include line -- remove it, rinse and repeat. After a day of this mind-numbingly boring activity you might have reduced your compile time by four seconds. Hooray!

Compile times have an immediate and important effect on programmer productivity and through general bit rot they tend to grow over time. There are many things that can increase compile times, but relatively few forces that work in the opposite direction.

It would be a lot better if we could change the way we work with headers, so that we didn't get into this mess to begin with.

My second problem is more philosophical. The basic idea behind object-oriented design is that data and the functions that operate on it should be grouped together (in the same class, in the same file). This idea has some merits -- it makes it easier to verify that class constraints are not broken -- but it also leads to problems. Classes get coupled tightly with concepts that are not directly related to them -- for example things like serialization, endian-swapping, network synchronization and script access. This pollutes the class interface and makes reuse and refactoring harder.

Class interfaces also tend to grow indefinitely, because there is always "more useful stuff" that can be added. For example, a string class (one of my pet peeves) could be extended with functionality for tokenization, path manipulation, number parsing, etc. To prevent "class bloat", you could write this code as external functions instead, but this leads to a slightly strange situation where a class has some "canonized" members and some second-class citizens. It also means that the class must export enough information to allow any kind of external function to be written, which kind of breaks the whole encapsulation idea.

In my opinion, it is much cleaner to organize things by functionality than by type. Put the serialization code in one place, the path manipulation code in another place, etc.

My latest idea about organization is to put all type declarations for all structs and classes in a single file (say types.h):

struct Vector3 {
 float x, y, z;
};

template <class T>
class Array<T> {
public:
 Array() : _capacity(0), _size(0), _data(0) {}
 ~Array() {free(_data);}
 unsigned _capacity;
 unsigned _size;
 T *_data;
};

class IFileSystem;
class INetwork;

Note that types.h has no function declarations, but it includes the full data specification of any struct or class that we want to use "by value". It also has forward declarations for classes that we want to use "by reference". (These classes are assumed to have pure virtual interfaces. They can only be created by factory functions.)

Since types.h only contains type definitions and not a ton of inline code, it ends up small and fast to compile, even if we put all our types there.

Since it contains all type definitions, it is usually the only file that needs to be included by external headers. This means we avoid the hairy problem with a big nest of headers that include other headers. We also don’t have to bother with inserting forward declarations in every header file, since the types we need are already forward declared for us in types.h.

We put the function declarations (along with any inline code) in the usual header files. So vector3.h would have things like:

inline Vector3 operator+(const Vector3 &a, const Vector3 &b)
{
 Vector3 res;
 res.x = a.x + b.x;
 res.y = a.y + b.y;
 res.z = a.z + b.z;
 return res;
}

.cpp files that wanted to use these operations would include vector3.h. But .h files and other .cpp files would not need to include the file. The file gets included where it is needed and not anywhere else.

Similarly, array.h would contain thinks like:

template <class T>
void push_back(Array<T> &a, const T &item)
{
 if (a._size + 1 > a._capacity)
  grow(a);
 a._data[a._size++] = item;
}

Note that types.h only contains the constructor and the destructor for Array<T>, not any other member functions.

Furthermore, I prefer to design classes so that the "zero-state" where all members are zeroed is always a valid empty state for the class. That way, the constructor becomes trivial, it just needs to zero all member variables. We can also construct arrays of objects with a simple memset().

If a class needs a more complicated empty state, then perhaps it should be an abstract interface-class instead of a value class.

For IFileSystem, file_system.h defines the virtual interface:

class IFileSystem
{
 virtual bool exists(const char *path) = 0;
 virtual IFile *open_read(const char *path) = 0;
 virtual IFile *open_write(const char *path) = 0;
 ...
};

IFileSystem *make_file_system(const char *root);
void destroy_file_system(IFileSystem *fs);

Since the “open structs” in types.h can be accessed from anywhere, we can grop operations by what they do rather than by what types they operate on. For example, we can put all the serialization code in serialization.h and serialization.cpp. We can create a file path.h that provides path manipulation functions for strings.

An external project can also "extend" any of our classes by just writing new methods for it. These methods will have the same access to the Vector3 data and be called in exactly the same way as our built-in ones.

The main drawback of this model is that internal state is not as "protected" as in standard object-oriented design. External code can "break" our objects by manipulating members directly instead of using methods. For example, a stupid programmer might try to change the size of an array by manipulating the _size field directly, instead of using the resize() method.

Naming conventions can be used to mitigate this problem. In the example above, if a type is declared with class and the members are preceded by an underscore, the user should not manipulate them directly. If the type is declared as a struct, and the members do not start with an underscore, it is OK to manipulate them directly. Of course, a stupid programmer can still ignore this and go ahead and manipulate the members directly anyway. On the other hand, there is no end to the things a stupid programmer can do to destroy code. The best way to protect against stupid programmers is to not hire them.

I haven’t yet written anything really big in this style, but I've started to nudge some files in the Bitsquid codebase in this direction, and so far the experience has been positive.