Wednesday, October 20, 2010

A* is Overrated


Open any textbook on AI and you are sure to find a description of the A*-algorithm. Why? Because it is a provably correct solution to well defined, narrow problem. And that just makes it so... scientific and... teachable. Universities teach Computer Science, because nobody knows how to teach Computer Artistry or Computer Craftsmanship. Plus science is important, everybody knows that.

But I digress.

A* is a useful algorithm. You should know it. But when we are talking about AI navigation, A* is just an imperfect solution to a tiny and not very difficult part of the problem. Here are three reasons why you shouldn't give A* so much credit:

1. Pathfinding is often not that important.

In many games, the average enemy is only on-screen a couple of seconds before it gets killed. The player doesn't even have a chance to see if it is following a path or not. Make sure it does something interesting that the player can notice instead.

2. A* is not necessarily the best solution to path finding problems.

A* finds the shortest path, but that is usually not that important. As long as the path "looks reasonable" and is "short enough" it works fine. This opens the floor for a lot of other algorithms.

For long-distance searches you get the best performance by using a hierarchical search structure. A good design of the hierarchy structure is more important for performance than the algorithm you use to search at each level.

Path-finding usually works on a time scale of seconds, which means that we can allow up to 30 frames of latency in answering search queries. To distribute the work evenly, we should use a search algorithm that runs incrementally. And it should of course parallelize to make the most use of modern hardware.

So what we really want is an incremental, parallelizable, hierarchical algorithm to find a shortish path, not cookie cutter A*.

3. Even if we use A*, there are a lot of other navigational issues that are more important and harder to solve than path finding.

Implementation, for starters. A* only reaches its theoretical performance when the data structures are implemented efficiently. You can still find A* implementations where the open nodes are kept in a sorted list rather than a heap. Ouch. Some implementation details are non-trivial. How do you keep track of visited nodes? A hashtable? (Extra memory and CPU.) A flag in the nodes themselves? (Meaning you can only run one search at a time over the nodes.)

How is the graph created? Hand edited in the editor? How much work is it to redo it every time the level changes? Automatic? What do you do when the automatic generation fails? Can you modify it? Will these modifications stick if the level is changed and the graph is regenerated? Can you give certain nodes special properties when the graph is auto-generated?

How do you handle dynamic worlds were paths can be blocked and new paths can be opened? Can you update the graph dynamically in an efficient way? What happens to running queries? How do past queries get invalidated when their path is no longer valid?

Once you have a path, how do you do local navigation? I.e. how do you follow the path smoothly without colliding with other game agents or dynamic objects? Do AI units collide against the graph or against real physics? What happens if the two don't match up? How do you match animations to the path following movement?

In my opinion, local navigation is a lot more important to the impression a game AI makes than path finding. Nobody will care that much if an AI doesn't follow the 100 % best part towards the target. To err is human. But everybody will notice if the AI gets stuck running against a wall because its local navigation system failed.

In the next blog post I will talk a bit about how local navigation is implemented in the BitSquid engine.

Monday, October 18, 2010

Time Step Smoothing

Today I'm going to argue for smoothing your update delta time, i.e. to apply a simple low pass filter to it before using it to update the game state:

float dt = elapsed_time();
dt = filter(dt);
update_game(dt);

This assumes of course that you are using a variable time step. A fixed time step is always constant and doesn't need smoothing.

Some people will say that you should always use a fixed time step, and there is a lot of merit to that. A stable frame rate always looks better and with a fixed time step the gameplay code behaves more deterministically and is easier to test. You can avoid a whole class of bugs that "only appear when the frame rate drops below 20 Hz".

But there are situations where using a fixed frame rate might not be possible. For example if your game is dynamic and player driven, the player can always create situations when it will drop below 30 fps (for example, by gathering all the physics objects in the same place or aggroing all the enemies on the level). If you are using a fixed time step, this will force the game into slow motion, which may not be acceptable.

On the PC you have to deal with a wide range of hardware. From the über-gamer that wants her million dollar rig to run the game in 250 fps (even though the monitor only refreshes at 90 Hz) to the poor min spec user who hopes that he can at least get the game to boot and chug along at 15 fps. It is hard to satisfy both customers without using a variable time step.

The BitSquid engine does not dictate a time step policy, that is entirely up to the game. You can use fixed, variable, smoothed or whatever else fits your game.

So, why am I arguing for smoothing variable time steps? A basic update loop without time smoothing looks something like this:

float dt = elapsed_time();
update_game(dt);

Each frame, we measure the time that has elapsed and use that to advance the game timer.

This seems straight forward enough. Every frame we advance the game clock by the time that has elapsed since the last frame. Even if the time step oscillates you would expect this to result in objects moving in a smooth and natural manner:



Here the dots represents the times (x-axis) at which we sample an object's position (y-axis). As you can see the samples result in a straight line for objects moving at a constant velocity, even though the time steps vary. All seems well.

Except this graph doesn't tell the entire truth. The time shown in this graph is the time when we start simulating a frame, not when that frame is drawn on screen. For each frame we simulate, there is an amount of latency from the point when we start simulating the frame until it is drawn. In fact, the variation in that latency is what accounts for the varying elapsed times that we measure. (Well not entirely, there can be GPU latency that doesn't show up in the delta time, but for the sake of simplicity, let's assume that the latency of the current frame is the delta time we measure in the next frame.)

If the take the latency into account and offset each time value in the graph above with the latency (the elapsed time of the next frame), we get a truer representation of what the player actually sees in the game:



Oops, the line is no longer straight and there is a good chance that the user will notice this as jerky motion.

If we could predict perfectly what the latency of the next frame would be, we could use that rather than the elapsed time of the last frame to advance our timers. This would compensate perfectly for the latency and the user would again see a straight line.

Unfortunately, perfectly predicting how long the next frame will take to simulate and render is impossible. But we can make a guess and that is one of the goals of our time step filter:
  • To make a better guess of how long it will take to update and draw the next frame.
When we are using the standard update loop, we are guessing that the next frame will take as long to draw as the current frame. This is a decent guess, but we can do better. For example, if we calculate the mean delta time of the last few frames and use that instead, we can reduce the average error by about 30 %. (Says my quick and dirty guesstimate calculation. If you want a real figure you have to presume a delta time probability distribution and do the math. Because I'm too lazy to.)

Here is the same graph again, but here we have smoothed the time step that we use to evaluate the object's position.



As you can see, we get a straighter line than we got by using the "raw" variable time steps.

Can we do anything to improve the prediction even further? Yes, if we know more about the data we are predicting. For example, on the PC, you can get occasional "glitch" frames with very long update times if you get interrupted by other processes. If you are using a straight mean, any time you encounter such a glitch frame you will predict that the next few frames will also take very long to update. But the glitch frame was an anomaly. You will get a better predictor by ignoring such outliers when you calculate the mean.

Another pattern in the frame rate that is quite common is a zig-zag pattern where the game oscillates between two different frame rates:



You could make a predictor for this. The predictor could detect if there is a strong correlation between the delta times for even and odd frames and if so, it could use only the even/odd frames to calculate the mean.

But I don't recommend doing that. For two reasons. First, the zig-zag pattern is usually a result of a bad setup in the engine (or, if you are unfortunate, in the driver). It is better to fix that problem and get rid of the oscillations than to work around them. Second, heavily oscillating time steps, which you would get if you tried to follow the zig-zag pattern, tend to make gameplay code behave badly.

So this gives us a second goal of time step filtering:
  • Keep the time step reasonably stable from frame to frame.
Why do I say that oscillating time steps make gameplay code behave badly? Certainly, at any good studio, the gameplay code is written to be time step independent. Whenever an object is moved, the length of the time step is taken into account, so that it moves with the same speed regardless of the frame rate:

s = s + v*dt

Yes, it is easy to make the update code for a single object frame rate independent in this manner. But once you start to have complex interactions between multiple objects, it gets a lot more difficult.

An example: Suppose you want a camera to follow behind a moving object, but you don't want it to be completely stiff, you want some smoothness to it, like maybe lerping it to the right position, or using a dampened spring. Now try to write that in a completely time step independent manner. I.e., an oscillation in the time step should not cause the camera to oscillate with respect to the object it is following.

Not so easy.

Game play code faces many similar issues. With 20+ game play programmers banging a way at the code base you can be quite certain that there are several places where oscillations in the time step lead to badly looking oscillating visuals in the game. And the best way to prevent that, in my opinion, is to smooth the time step so that it doesn't oscillate.

In summary, this is our default method for time step smoothing in the BitSquid engine (though as I said above, you are of course free to use whatever time step you want):
  • Keep a history of the time step for the last 11 frames.
  • Throw away the outliers, the two highest and the two lowest values.
  • Calculate the mean of the remaining 7 values.
  • Lerp from the time step for the last frame to the calculated mean (adding more smoothness)
Here is an example of the filter in action. The yellow line in this graph shows the raw time step. The red line is the smoothed time step.



One last thing to mention. This calculation can cause your game clock to drift from the world clock. If you need them to be synchronized (for network play for example) you need to keep track of your "time debt" -- i.e., how far your clock has drifted from the world clock and "pay off" that debt over a number of frames by increasing or decreasing your time step, until you are back at a synchronized state.

Tuesday, October 5, 2010

The Dependency Checker

Maintaining referential integrity between resources in a big game project can be challenging:

  • Someone might accidentally delete an entity that is used somewhere in a level.
  • Someone might change a texture to improve the look of one object, without knowing that that texture is shared by two other objects.
  • A resource may have a misspelled name, but no one dares change it, because it is used in too many places.
  • There may be "dead" resources in the project, that aren't used anywhere, but no one knows how to find them, so that they can be deleted.

To help with these issues, I've created a new tool for the BitSquid tool chain, the Dependency Checker. The Dependency Checker understands all BitSquid file formats and knows how they can refer to other resources. By parsing the source tree it is thus able to create the complete dependency graph of a project.

This isn't as complicated as it sounds, because we don't have that many different file formats, they are all based on SJSON and they use a standardized way of referring to other resources (type, name). The entire code for parsing and understanding all the different file formats is just 500 lines long.

Once we have the dependency graph, we can do lots of interesting things with it. The first is to find all missing and dangling resources:


Missing resources are resources that are referred somewhere, but that don't actually exist in the source tree. Dangling resources are existing resources that aren't referred anywhere.

We can click on any resource in the list (or any other resource in the project) to see its dependencies.


But the real interesting thing is the ability to patch dependencies. The Dependency Checker does not only know how to parse dependencies, it also knows how to modify them. (Yes, that is included in the 500 lines of code.) That means that it can replace, move and copy resources.

For example, we can replace the missing font texture with something else.


All files that used the font texture will be patched up to refer to the new resource.


Move is useful when we have given something a bad name and just want to clean up our resources a bit:


Copy can be used to quickly clone resources. But since the dependency checker lets you decide which (if any) references you want to redirect to the new copy, it can also be used as a quick way of splitting resources. For example if you decide that you want to use a different key entity on the himalaya level, you can make a copy of the key entity for just those levels.

Friday, October 1, 2010

Static Hash Values

We use 32-bit string hashes instead of strings in many places to save memory and improve performance. (When there is a risk for collision we use 64-bit hashes instead.)

At a number of places in the code we want to check these hashes against predefined values. For example, we may want to check if a certain object is the "root_point". With a straight forward implementation, you get code that looks like this:
const char *root_point_str = "root_point";
static unsigned root_point_id = murmur_hash(root_point_str, 
    strlen(root_point_str), 0);
if (object.name() == root_point_id)
    ...
We use a static variable to avoid having to hash the string more than once, but this is still pretty inefficient. There is the extra application data, the computation of the hash the first time the function is run. On subsequent invocations there is still the check to see if the static variable has been initialized.

It would be a lot more efficient if we could precompute the hashes somehow to avoid that cost in the runtime. I can see three ways:
  • We could run a code generation pass in a pre-build step that generates the hash values and patches the code with them.
  • We could use the preprocessor to generate the values.
  • We could compute the values offline and hard-code them in the code.
I'm not too found of code generation. It is nice in theory, but to me it always seems kind of messy the way it interacts with the build system, the debugger, etc.

Rewriting the murmur hash algorithm in the preprocessor requires me to bring out some serious preprocessor-fu. But it is fun. It is almost like functional programming: With these lovely macros in place, we can now write:
if (object.name() == HASH_STR_10('r','o','o','t','_','p','o','i','n','t'))
    ...
Having completed this task I feel a bit empty. That is certainly a lot of macro code for an end result that still is kind of meh.

I disregarded hard coding the values to begin with because no one wants to look at code like this:
if (object.name() == 0x5e43bd96)
    ...
Even dressed up in comments, it is still kind of scary:
unsigned root_point_id = 0x5e43bd96; // hash of "root_point"
if (object.name() == root_point_id)
    ...
What if someone types in the wrong value? What if we decide to change hash algorithm at some later point? Scary. But maybe we can ameliorate those fears:
#ifdef _DEBUG
    inline unsigned static_hash(const char *s, unsigned value) {
        assert( murmur_hash(s, strlen(s), 0) == value );
        return value;
    }
#else
    #define static_hash(s,v) (v)
#end

...

if (object.name() == static_hash("root_point", 0x5e43bd96)
    ...
That looks better and is completely safe. If something goes wrong, the assert will trigger in the debug builds.

I think I like this better than the preprocessor solution. It will make the debug builds run a bit slower, but that's what debug builds are for, right?

Thursday, September 30, 2010

BitBucket for BitSquid

I have made some improvements to the Json Merge tool. It can now display diffs and merges visually:


To make the distribution of our public tools easier I've uploaded them as bitbucket repositories. You can find them at:


Currently their are three projects available, our Json Merger, our Distance Field Font Generator and our Motion Builder Exporter.

Tuesday, September 21, 2010

Custom Memory Allocation in C++

For console development, memory is a very precious resource. You want good locality of reference and as little fragmentation of possible. You also want to be able to track the amount of memory used by different subsystems and eliminate memory leaks. To do that, you want to write your own custom memory allocators. But the standard ways of doing that in C++ leave a lot to be desired.

You can override global new and replace it with something else. This way you can get some basic memory tracking, but you still have to use the same allocation strategy for all allocations, which is far from ideal. Some systems work better with memory pools. Some can use simple frame allocation (i.e., pointer bump allocation).  You really want each system to be able to have its own custom allocators.

The other option in C++ is to override new on a per class basis. This has always has seemed kind of strange to me. Pretty much the only thing you can use it for are object pools. Global, per-class object pools. If you want one pool per thread, or one pool per streaming chunk -- you run into problems.

Then you have the STL solution, where containers are templated on their allocator, so containers that use different allocators have different types. It also has fun things such as rebind(). But the weirdest thing is that all instances of the allocator class must be equivalent. So you must put all your data in static variables. And if you want to create two separate memory pools you have to have two different allocator classes.

I must admit that every time I run into something in STL that seems completely bonkers I secretly suspect that I have missed something. Because obviously STL has been created by some really clever people who have thought long and hard about these things. But I just don't understand the idea behind the design of the custom allocator interface at all. Can any one explain it to me? Does any one use it? Find it practical? Sane?

If it weren't for the allocator interface I could almost use STL. Almost. There is also the pretty inefficient map implementation. And the fact that deque is not a simple ring buffer, but some horrible beast. And that many containers allocate memory even if they are empty... So my own version of everything it is. Boring, but what's a poor gal gonna do?

Back to allocators. In conclusion, all the standard C++ ways of implementing custom allocators are (to me) strange and strangely useless. So what do I do instead? I use an abstract allocator interface and implement it with a bunch of concrete classes that allocate  memory in different ways:


class Allocator
{
public:
    virtual void *allocate(size_t size, size_t align) = 0;
    virtual void deallocate(void *p) = 0;
    virtual size_t allocated_size(void *p) = 0;
}


I think this is about as sane as an allocator API can get. One possible point of contention is the allocated_size() method. Some allocators (e.g., the frame allocator) do not automatically know the sizes of their individual allocations, and would have to use extra memory to store them. However, being able to answer questions about allocation sizes is very useful for memory tracking, so I require all allocators to provide that information, even if it means that a frame allocator will have to use a little extra memory to store it.

I use an abstract interface with virtual functions, because I don't want to template my classes on the allocator type. I like my allocators to be actual objects that I can create more than one of, thank you very much. Memory allocation is expensive anyway, so I don't care about the cost of a virtual function call.

In the BitSquid engine, you can only allocate memory through an Allocator object. If you call malloc or new the engine will assert(false).

Also, in the BitSquid engine all allocators keep track of the total number of allocations they have made, and the total size of those allocations. The numbers are decreased on deallocate(). In the allocator destructor we assert(_size == 0 && _allocations == 0) and when we shut down the application we tear down all allocators properly. So we know that we don't have any memory leaks in the engine. At least not along any code path that has ever been run.

Since everything must be allocated through an Allocator, all our collection classes (and a bunch of other low-level classes) take an Allocator & in the constructor and use that for all their allocations. Higher level classes either create their own allocator or use one of the globals, such as memory_globals::default_allocator().

With this interface set, we can implement a number of different allocators. A HeapAllocator that allocates from a heap. A PoolAllocator that uses an object pool. A FrameAllocator that pointer bumps. A PageAllocator that allocates raw virtual memory. And so on.

Most of the allocators are set up to use a backing allocator to allocate large chunks of memory which they then chop up into smaller pieces. The backing allocator is also an Allocator. So a pool allocator could use either the heap or the virtual memory to back up its allocations.

We use proxy allocators for memory tracking. For example, the sound system uses:


ProxyAllocator("sound", memory_globals::default_allocator());


which forwards all allocations to the default allocator, but keeps track of how much memory has been allocated by the sound system, so that we can display it in nice memory overviews.

If we have a hairy memory leak in some system, we can add a TraceAllocator, another proxy allocator which records a stack trace for each allocation. Though, truth be told, we haven't actually had to use that much. Since our assert triggers as soon as a memory leak is introduced, and the ProxyAllocator tells us in which subsystem the leak occurred, we usually find them quickly.

To create and destroy objects using our allocators, we have to use placement new and friends:


void *memory = allocator.allocate( sizeof(MyClass), alignof(MyClass) );
MyClass *m = new (memory) MyClass(10);

if (m) {
    m->~MyClass();
    allocator.deallocate(m);
}


My eyes! The pain! You certainly don't want to type or read that a lot. Thanks C++ for making my code so pretty. I've tried to make it less hurtful with some template functions in the allocator class:


class Allocator
{
    template <class T, class P1> T *make_new(const P1 &p1) {return new (allocate(sizeof(T), alignof(T))) T(p1);}

    template <class T> void make_delete(T *p) {
        if (p) {
            p->~T();
            deallocate(p);
        }
    }


Add a bunch of other templates for constructors that take a different number of arguments that can be const or non-const and now you can at least write:


MyClass *m = allocator.make_new<MyClass>(10);

allocator.make_delete(m);


That's not too bad.

One last interesting thing to talk about. Since we use the allocators to assert on memory leaks, we really want to make sure that we set them up and tear them down in a correct, deterministic order. Since we are not allowed to allocate anything without using allocators, this raises an interesting chicken-and-egg problem: who allocates the allocators? How does the first allocator get allocated?

The first allocator could be static, but I want deterministic creation and destruction. I don't want the allocator to be destroyed by some random _exit() callback god knows when.

The solution -- use a chunk of raw memory and new the first allocator into that:


char _buffer[BUFFER_SIZE];

HeapAllocator *_static_heap = 0;
PageAllocator *_page_allocator = 0;
HeapAllocator *_heap_allocator = 0;

void init()
{
    _static_heap = new (_buffer)
        HeapAllocator(NULL, _buffer + sizeof(HeapAllocator), BUFFER_SIZE - sizeof(HeapAllocator));
           
    _page_allocator = _static_heap->make_new<PageAllocator>("page_allocator");
    _heap_allocator = _static_heap->make_new<HeapAllocator>("heap_allocator", *_page_allocator);
    ...
}

void shutdown()
{
    ...
    _static_heap->make_delete(_heap_allocator);
    _heap_allocator = 0;
   
    _static_heap->make_delete(_page_allocator);
    _page_allocator = 0;
   
    _static_heap->~HeapAllocator();
    _static_heap = 0;
}


Note how this works. _buffer is initialized statically, but since that doesn't call any constructors or destructors, we are fine with that. Then we placement new a HeapAllocator at the start of that buffer. That heap allocator is a static heap allocator that uses a predefined memory block to create its heap in. And the memory block that it uses is the rest of the _buffer -- whatever remains after _static_heap has been placed in the beginning.

Now we have our bootstrap allocator, and we can go on creating all the other allocators, using the bootstrap allocator to create them.

Friday, September 17, 2010

Visual Scripting the Data-Oriented Way

The BitSquid engine has two separate scripting systems. The first is Lua. The entire engine is exposed to Lua in such a way that you can (and are encouraged to) write an entire game in Lua without touching the C++ code at all.

The second system, which is the focus of this post, is a visual scripting system that allows artists to add behaviors to levels and entities. We call this system Flow.

Lua and Flow have different roles and complement each other. Lua allows gameplay programmers to quickly test designs and iterate over gameplay mechanics. Flow lets artists enrich their objects by adding effects, destruction sequences, interactivity, etc. When the artists can do this themselves, without the help of a programmer, they can iterate much faster and the quality is raised.

Flow uses a pretty standard setup with nodes representing actions and events.  Black links control how events cascade through the graph causing actions to be taken. Blue links represent variables that are fed into the flow nodes:


In this graph, the green nodes represent events and the red node represents an action. The yellow node is an action implemented in Lua. The gameplay programmers can set up such actions on a per-project basis and make them available to the artists.

Each unit (= entity) type can have its own flow graph, specifying that unit's behavior, e.g. an explosion sequence for a barrel. There is also a separate flow graph for the entire level that the level designers can use to set up the level logic.

Since Flow will be used a lot I wanted to implement it as efficiently as possible, both in terms of memory and performance. This means that I don't use a standard object-oriented design where each node is a separate heap-allocated object that inherits from an abstract Node class with a virtual do_action() method. That way lies heap allocation and pointer chasing madness.

Sure, we might use such a design in the editor, where we don't care about performance and where the representation needs to be easy to use, modifiable, stable to version changes, etc. In the runtime, where the graph is static and compiled for a particular platform, we can do a lot better.

(This is why you should keep your editor (source) data format completely separate from your in-game data format. They have very different requirements. One needs to be dynamic, multi-platform and able to handle file format version changes. The other is static, compiled for a specific platform and does not have to care about versioning -- since we can just recompile from source if we change the format. If you don't already -- just use JSON for all your source data, and binary blobs for your in-game data, it's the only sane option.)

The runtime data doesn't even have to be a graph at all. There are at least two other options:

  • We could convert the entire graph into a Lua program, representing the nodes as little snippets or functions of Lua code. This would be a very flexible approach that would be easy to extend. However, I don't like it because it would introduce significant overhead in both CPU and memory usage.
  • We could "unroll" the graph. For each possible external input event we could trace how it flows through the graph and generate a list of triggered actions as a sort of "bytecode". The runtime data would then just be a bytecode snippet for each external event. This has the potential of being very fast but can be complicated by nodes with state that may cause branching or looping. Also, it could potentially use a lot of extra memory if there are many shared code paths.

So I've decided to use a graph. But not a stupid object-oriented graph. A data-oriented graph, where we put the entire graph in a single blob of cohesive memory:


Oh, these blobs, how I love them. They are really not that complicated. I just concatenate all the node data into a single memory chunk. The data for each node begins with a node type identifier that lets me switch on the node type and do the appropriate action for each node. (Yes, I'll take that over your virtual calls any day.) Pointers to nodes are stored as offsets within the blob. To follow a pointer, just add the offset to the blob's start address, cast the pointer to something useful and presto!

Yes, it is ok to cast pointers. You can do it. You don't have to feel bad about it. You know that there is a struct ParticleEffectNode at that address. Just cast the pointer and get it over with.

The many nice things about blobs with offsets deserve to be repeated:

  • We can allocate the entire structure with a single memory allocation. Efficient! And we know exactly how much memory it will use.
  • We can make mem-copies of the blob without having to do any pointer patching, because there are no pointers, only offsets.
  • We can even DMA these copies over to a SPU. Everything is still valid.
  • We can write the data to disk and read it back with a single call. No pointer patching or other fixups needed.

In fact, doesn't this whole blob thing look suspicously like a file format? Yes! That is exactly what it is. A file format for memory. And it shouldn't be surprising. The speed difference between RAM and CPU means that memory is the new disk!

If you are unsure about how to do data-oriented design, thinking "file formats for memory" is not a bad place to start.


Note that there is a separate blob for dynamic data. It is used for the nodes' state data, such as which actor collided with us for a collision event (the blue data fields in the flow graph). We build that blob in the same way. When we compile the graph, any node that needs dynamic data reserves a little space in the dynamic blob and stores the offset to that space from the start of the dynamic block.

When we clone a new entity, we allocate a new dynamic data block and memcopy in the template data from the dynamic data block that is stored in the compiled file.

Sharing of dynamic data (the blue links in the graph) is implemented by just letting nodes point to the same offset in the dynamic data blob.

Since we are discussing performance, I should perhaps also say something about multithreading. How do we parallelize the execution of flow graphs?

The short answer: we don't.

The flow graph is a high level system that talks to a lot of other high level systems. A flow graph may trigger an effect, play a sound, start an animation, disable a light, etc, etc. At the same time there isn't really any heavy CPU processing going on in the flow graph itself. In fact it doesn't really do anything other than calling out to other systems. Multithreading this wouldn't gain a lot (since the compute cost is low to begin with) and add significant costs (since all the external calls would have to be synchronized in some way).

Wednesday, September 8, 2010

Code Share: Motion Builder Exporter

Writing exporters is boring. It mostly involves navigating huge unwieldy and poorly documented APIs.

I'd like to spend as little of my time as possible writing exporters, and I'm guessing most other developers feel the same. So, in that spirit I'm sharing the code of the simple Motion Builder exporter I just wrote:

BitSquid Python Motion Builder Exporter

Feel free to cannibalize the code for your own exporter. Hopefully you can save a little time and spend it doing something more interesting than writing exporters.

Wednesday, August 25, 2010

BitSquid's Dual Mode GUIs

The BitSquid engine uses a dual mode GUI system. That is, the GUI system can be run both in retained and immediate mode. For GUIs with lots of static data, retained mode can be used for increased efficiency. For smaller or more dynamic GUIs it is simpler to work in immediate mode.

The retained mode and the immediate mode use the same API and the same implementation with just a simple flag that controls the mode. To see how that is possible, it is easiest to begin by looking at how our GUIs are rendered.

Despite their simplicity, ordinary 2D GUIs can be quite taxing to a renderer. The reason is that they often contain many small individual objects. You can easily have a GUI with 500 little icons, text strings, radar blips, etc. If you render them as individual objects your batch count will go through the roof. The key to efficient GUI rendering is thus to batch together similar objects into larger buffers to get fewer draw calls.

In the BitSquid engine, the GUI batching works like this: When the main thread wants to render a GUI object it generates three pieces of data and sends them to the renderer.

  • An id that uniquely identifies the object.
  • A batch key consisting of (gui layer, material). Objects in the same layer with the same material can be batched together.
  • The vertex data (positions, normals, vertex colors, uv-coordinates) for the object to be rendered.
The renderer finds an existing batch with a matching batch key and appends the vertex data to the vertex buffer of that batch. If no matching batch exists a new batch is created.
When it is time to render, the renderer just renders all its batches with their corresponding data.

The main thread can modify an object by resending the same id with a new batch key and new vertex data. The renderer will delete the old data from the batch buffers and insert the new data. The main thread can also send an id to the renderer and request the object to be deleted. The renderer will delete the object's vertex data from the batch buffers.

To higher level systems, the GUI exposes an interface that looks something like this:

create_text(pos, text, font, color) : id
update_text(id, pos, text, font, color)
destroy_text(id)

create_text() creates a new id, generates the vertex data for the text object and sends it to the renderer. update_text()generates new vertex data and sends it to the renderer to replace the old data. destroy_text() tells the renderer to delete the vertex data corresponding to the object.

What is interesting about this API is that there are no separate move(), set_text(), set_font() and set_color() functions. If you want to change the text object, you have to provide all the necessary data to the update_text() function. This means that update_text() has all the data required to generate the object's data from scratch, so we don't have to retain any information about the objects in the main thread. The only data that is retained anywhere are the batch vertex buffers kept by the renderer. In this way we save memory, reduce the number of functions in the API and make the implementation a lot simpler. It also becomes easy to add new object types to the GUI, you just have to write a function that generates the batch key and the vertex data for the object.

You could argue that the API makes things more complicated for the user, since she now has to supply all the parameters to configure the text even if she just wants to change one of them (the color, for instance).  In my experience, that is usually not a problem. Typically, the user already has all the needed data stored somewhere and can just pass it to the update() function. For instance, the text to be displayed might be stored in a player_name variable. Retaining the data in the GUI would just mean that the data would be stored in two different places and add the burden of keeping them synchronized.

With all this in place, it is easy to see how we can support both retained mode and immediate mode in the same implementation.

In retained mode everything works as described above. The user calls create() to create an object, update() to modify it and destroy() to destroy it.

In immediate mode, only the create() function is used and the renderer is set to clear its batches every frame. Thus, any object drawn with create() will be drawn exactly one frame and then get cleared by the renderer.

Wednesday, August 18, 2010

A new data storage model

In my head, I am toying with an idea of a new data storage model that combines the flexibility and simplicity of JSON with the multi-user friendliness of a traditional database.

The basic idea is as follows:
  • A database is a collection of objects.
  • Each object is identified by a GUID.
  • An object consists of a set of key-value pairs.
  • The keys are always strings.
  • A value can be one of:
    • null (this is the same as the key not existing)
    • true/false
    • a number
    • a string
    • a generic data blob (texture, vertex data, etc)
    • a reference to another object (GUID)
    • a set of references to other objects (GUIDs)
  • A special object with GUID 0000-00000000-0000 acts as the root object of the database.
This simple setup has many nice properties.

It is easy to map objects back and forth between this storage representation and an in-memory representation in C++, C#, Lua, etc.

We can perform referential integrity checks on the GUIDs to easily locate "dangling pointers" or "garbage objects".

We can add new fields to objects and still be "backwards compatible" with old code.

It is easy to write batch scripts that operate on the database. For example, we can lookup the key "textures" in the root object to find all texture objects and then loop over them and examine their "height" and "width" to find any non-power-of-two textures.

All modifications to the data can be represented by a small set of operations:

    create(guid)
    destroy(guid)
    change_key(guid, key, value)
    add_to_set(guid, key, object_guid)
    remove_from_set(guid, key, object_guid)

These operations can also be used to represent a diff between two different versions of the database.

A user of the database can have a list of such operations that represents her local changes to the data (for testing purposes, etc). She can then commit all or some of these local changes to the central database. The database can thus be used in both online and offline mode. Versioning and branching systems can be built on this without too much effort.

Merge conflicts are eliminated in this system. The only possible conflict is when two users have changed the same key of the same object to two different values. In that case we resolve the conflict by letting the later change overwrite the value of the earlier one.

Note that this model only supports sets, not arrays. The reason is that array reordering operations are tricky to merge and in most cases the order of objects does not matter. In the few cases where order really does matter, you can use a key in the objects to specify the sort order.

Thursday, June 3, 2010

Avoiding Content Locks and Conflicts -- 3-way Json Merge

Locking content files in a CVS is annoying, doesn't scale well and prevents multiple people from working on different parts of the same level (unless you split the level in many small files which have to be locked individually -- which is even more annoying).

But having content conflicts is no fun either. A level designer wants to work in the level editor, not manage strange content conflicts in barely understandable XML-files. The level designer should never have to mess with WinMerging the engine's file formats.

And conflicts shouldn't be necessary. Most content conflicts are not actual conflicts. It is not that often that two people have moved the exact same object or changed the exact same settings parameter. Rather, the conflicts occur because a line-based merge tool tries to merge hierarchical data (XML or JSON) and messes up the structure.

In those rare cases when there is an actual conflict, the content people don't want to resolve it in WinMerge. If two level designers have moved the same object, we don't really help them address the issue by bringing up a dialog box with a ton of XML mumbo-jumbo. Instead, it is much better to just pick one of the two locations and go ahead with merging the file. Then, the level designers can fix any problems that might have occurred in the level editor -- the right tool for the job.

At BitSquid we use JSON for all our content files (actually, a slightly simplified version of JSON that we call SJSON). So to get rid of our conflict issues, I have written a 3-way merger that understands the structure of JSON files and resolves any remaining actual conflicts by always picking the right-hand branch.

If we disregard arrays for the moment, merging JSON files is quite simple. A diff between two JSON files can be expressed as a list of object[key] = value operations. Deleting a key is represented by changing its value to null. Adding a key is represented by changing a null value to something else. Merging these operations is simple. We only have trouble when the same key in the same object is changed to two different values, but then we just pick one of the values, as explained above.

Arrays are trickier because without context, it is impossible to tell what a change to an array means semantically. If the array [1, 2, 3] is changed to [1, 2, 4] is that a single operation that changed the last value from 3 to 4. Or is it two operations, deleting the 3 from the array and inserting 4. How we interpret it will affect the result of our 3-way merges. For example, the 3-way merge of [1, 2, 3], [1, 2, 4] and [1, 2, 5] can give either the result [1, 2, 5] or [1, 2, 4, 5].

I have resolved this by adding extra information to the arrays in our source files. Most of our arrays are arrays of objects. For such arrays, I require that the objects have an "id"-field with a GUID that uniquely identifies the object. With such an id in our array [ {x = 1, id = a}, {x = 2, id = b}, {x = 3, id = c} ] it becomes possible to distinguish between updating an existing value  [ {x = 1, id = a}, {x = 2, id = b}, {x = 4, id = c} ] and removing + adding a value [ {x = 1, id = a}, {x = 2, id = b}, {x = 4, id = d} ].

The 3-way merge algorithm I'm using applies some heuristics to guess array transformations even when no id-field is present, but the recommendation is to always add id-fields to array elements to get perfect merges.

You can download my 3-way Json merger here. I just wrote it today, so it haven't received much testing yet. But it is public domain software, so free free to fix the bugs and do whatever else you like with it.

Friday, April 23, 2010

Our Tool Architecture

The BitSquid tool architecture is based on two main design principles:
  • Tools should use the "real" engine for visualization.
  • Tools should not be directly linked or otherwise strongly coupled to the engine.
Using the real engine for visualization means that everything will look and behave exactly the same in the tools as it does in-game. It also saves us the work of having to write a completely separate "tool visualizer" as well as the nightmare of trying to keep it in sync with changes to the engine.

By decoupling the tools from the engine we achieve freedom and flexibility, both in the design of the tools and in the design of the engine. The tools can be written in any language (C#, Ruby, Java, Lisp, Lua, Python, C++, etc), using any methodology and design philosophy. The engine can be optimized and the runtime data formats changed without affecting the tools.

What we envision is a Unix-like environment with a plethora of special purpose tools (particle editor, animation editor, level editor, material editor, profiler, lua debugger, etc) rather than a single monolithic Mega-Editor. We want it to be easy for our licensees to supplement our standard tool set with their own in-house tools, custom written to fit the requirements of their particular games. For example, a top-down 2D game may have a custom written tile editor. Another programmer may want to hack together a simple batch script that drops a MIP-step from all vegetation textures.

At first glance, our two design goals may appear conflicting. How can we make our tools use the engine for all visualization without strongly coupling the tools to the engine? Our solution is shown in the image below:


Note that there is no direct linkage between the tool and the engine. The tool only talks to the engine through the network. All messages on the network connection are simple JSON structs, such as:

{
    "type" : "message",
    "level" : "info",
    "system" : "D3DRenderDevice",
    "message" : "Resizing swap chain: 1626 1051"
}

This applies for all tools. When the lua debugger wants to set a breakpoint, it sends a message to the engine with the lua file and line number. When the breakpoint is hit, the engine sends a message back. (So you can easily swap in your own lua debugger integrated with your favorite editor, by simply receiving and sending these messages.) When the engine has gathered a bunch of profiling data, it sends a profiler message. Et cetera.

For visualization, the tool creates a window where it wants the engine to render and sends the window handle to the engine. The engine then creates a swap chain for that window and renders into it.

(In the future we may also add support for a VNC-like mode where we instead let the engine send the content of the frame buffer over the network. This would allow the tools to work directly against consoles, letting the artists see, directly in their editors, how everything will look on the lead platform.)

A tool typically boots the engine in a special mode where it runs a custom lua script designed to collaborate with that particular tool. For example, the particle editor boots the engine with particle_editor_slave.lua which sets up a default scene for viewing particle effects with a camera, skydome, lights, etc. The tool then sends script commands over the network connection that tells the engine what to do, for example to display a particular effect:

{
    type = "script",
    script = "ParticleEditorSlave:test_effect('fx/grenade/explosion')"
}

These commands are handled by the slave script. The slave script can also send messages back if the tool is requesting information.

The slave scripts are usually quite simple. The particle editor slave script is just 120 lines of lua code.

To make the tools independent of the engine data formats we have separated the data into human-readable, extensible and backwards compatible generic data and fast, efficient, platform specific runtime data. The tools always work with the generic data, which is pretty much all in JSON (exceptions are textures and WAVs). Thus, they never need to care about how the engine represents its runtime data and the engine is free to change and optimize the runtime format however it likes.

When the tool has changed some data and wants to see the change in-engine, it launches the data compiler to generate the runtime data. (The data compiler is in fact just the regular Win32 engine started with a -compile flag, so the engine and the data compiler are always in sync. Any change of the runtime formats triggers a recompile.) The data compiler is clever about just compiling the data that has actually changed.

When the compile is done, the tool sends a network message to the engine, telling it to reload the changed data file at which point you will see the changes in-game. All this happens nearly instantaneously allowing very quick tweaking of content and gameplay (by reloading lua files).

This system has worked out really well for us. The decoupling has allowed for fast development of both the tools and the engine. Today we have about ten different tools that use this system and we have been able to make many optimizations to the engine and the runtime formats without affecting the tools or the generic data.

Friday, April 9, 2010

Distance Field Based Rendering of AngelCode Fonts

This morning, we added support for distance field based font rendering to the BitSquid engine (from Valve's paper http://www.valvesoftware.com/publications/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf). An example is shown below:


The top row shows the original font, below that is the original font rendered with alpha test. The third row is a distance field representation of the font and the final row shows the distance field representation rendered with alpha test. Note that the distance field version gives better quality in the diagonal lines.

(Note: The last row looks thicker than the second row, because it was generated from a large font size and scaled down, while the second row was generated from a small font size. Because of true type font hinting at small sizes, the result is different. The last row gives a truer representation of the "actual" thickness of the font.)

A quick Google search didn't show any good tools for generating distance field font maps, so I decided to write my own. We use the excellent AngelCode Bitmap Font Generator (http://www.angelcode.com/products/bmfont/) to generate our font maps, so I decided to make a tool that works with the files generated by AngelCode:


The tool takes a high resolution AngelCode .fnt file as input. It scales it down by the specified scale factor and converts it to a distance field. The spread specifies how many pixels the distance field should extend outside the character outline before it clamps to zero. (It is useful if you want to add things such as glow effects to the font rendering.) After the conversion, the tool outputs new scaled down .tga images of the fonts and a new .fnt file with all measurements converted to work with the scaled down textures.

So to use it, you first generate a font bitmap and .fnt file using AngelCode at 8 x the font size and 8 x the texture size you want in the final image. (Make sure to add 8 x spread pixels of padding around the characters or else the distance fields will bleed into each other.) Then you run the tool to convert it to a distance field texture.

The tool is a bit limited -- it only works with monochrome uncompressed .tga files. It only reads and writes the XML version of the AngelCode font format. The distance field generation isn't particularly clever or fast. But I thought I should share it anyway since I couldn't find any other tools for generating distance field based font maps. Modifying it to support more formats shouldn't be much work.

Grab a binary version here:


Or the C# project files here:


Feel free to do whatever you want with it!

Thursday, March 25, 2010

Task Management -- A Practical Example

I've spent the last couple of days rewriting the task manager in the BitSquid engine. Task management is an important topic in our glorious multicore future, but it is hard to find good practical information about it. GDC was also a bit of a disappointment in this regard. So I thought I should share some of my thoughts and experiences.

The previous iteration of our task scheduler was based on Vista ThreadPools and mainly supported data parallelism. (Though we still had a degree of task parallelism from running two main threads -- an update thread and a render thread -- which both posted batches of jobs to the task manager.)

For the rewrite, I had a number of goals:
  • Move away from Vista Thread Pools. We want complete control over our job threads.
  • Minimize context switching. This is a guessing game on Windows, since the OS will do what the OS will do, but minimizing oversubscription of threads should help.
  • Make a system that can run completely task based. I. e., everything in the system is run as a task and there are no explicit wait() calls. Instead the entire code flow is controlled by task dependencies. Such a design allows us to exploit all possibilities for parallelism in the code which leads to maximum core utilization.
  • Still be "backwards compatible" with a system that uses one or more "main threads" that wait() for data parallel jobs to complete, so that we can move incrementally to a more and more task based code flow.
  • Support tasks that run on external processors, such as SPUs or GPUs.
  • Support hierarchical decomposition of tasks.
By hierarchical decomposition I mean that it should be possible to analyze the system in terms of tasks and subtasks. So that, at a higher level, we can regard the animation system as a single task that runs in parallel to other system tasks:

But then we can zoom in on the animation task and see that in fact is composed of a number of subtasks which in turn parallelize:


Hierarchical decomposition makes it possible to analyze systems and subsystems at different levels of abstraction rather than having to keep the entire task dependency graph in our heads. This is good because my head just isn't big enough.

A task in the new implementation is a simple data structure:
Here work is a work item to be performed on an SPU, CPU or GPU. affinity can be set for items that must be performed on particular threads.

parent specifies child/parent relationships between tasks. A task can have any number of children/subtasks. A task is considered completed when its work has been executed and all its children has completed. In practice this is implemented by the open_work_items counter. The counter is initially set to the number of child tasks + 1 (for the task's own work item). When a task completes, it reduces the open_work_items count of its parent and when that figure reaches zero, the parent work is completed.

I do not explicitly track completed task. Instead I keep a list of all open (i.e. not completed) tasks. Any task that is not in the open list is considered completed. Note that the open list is separate from the queue of work items that need to be performed. Items are removed from the queue when they are scheduled to a worker thread and removed from the open list when they have completed.

The dependency field specifies a task that the task depends on. The task is not allowed to start until its dependency task has completed. Note that a task can only have a single dependency. The reason for this is that I wanted the task structure to be a simple POD type and not include any arrays or other external memory references.

Having a single dependency is not a limitation, because if we want to depend on more than one task we can just introduce an anonymous task with no work item that has all the tasks we want to depend on as children. That task will complete when all its children has completed, so depending on that task gives us the wanted dependencies.

The priority field specfies the importance of the task. When several tasks are available, we will pick the one with the highest priority. I will discuss this a bit more in a minute.

The Task Manager has a number of threads for processing tasks. Some of these are "main threads" that are created by other parts of the system and registered with the thread manager (in our case, an update thread and a render thread). The rest are worker threads created internally by the task manager. The number of worker threads is:

worker_thread_count = number_of_cores - main_thread_count

The total number of threads managed by the task manager thus equals the number of cores in the system, so we have no over- or undersubscription.

The worker threads are in a constant loop where they check the task manager for work items to perform. If a work item is available, they perform it and then notify the task manager of its completion. If no work items are available, they sleep and are woken by the task manager when new work items become available.

The main threads run their normal serial code path. As part of that code path, they can create tasks and subtasks that get queued with the task manager. They can also wait() for tasks to complete. When a thread waits for a task it doesn't go idle. Instead it loops and helps the task manager with completing tasks. Only when there are no more tasks in the queue does the thread sleep. It wakes up again when there are more tasks to perform or when the task it originally waited for has completed.

The main threads can also process tasks while waiting for other events by calling a special function in the task manager do_work_while_waiting_for(Event &). For example, the update thread calls this to wait for the frame synchronization event from the render thread.

This means that all task manager threads are either running their serial code paths or processing jobs -- as long as there are jobs to perform and they don't get preempted by the OS. This means that as long as we have lots of jobs and few sync points we will achieve 100 % core utilization.

This approach also allows us to freely mix serial code with a completely task based approach. We can start out with a serial main loop (with data parallelization in the update() functions):


void World::update()
{
  _animation->update()
  _scene_graph->update();
  _gui->update();
  render();
  _sound->update();
}


And gradually convert it to fully braided parallelism (this code corresponds to the task graph shown above):


void World::update()
{
  TaskId animation = _tasks->add( animation_task(_animation) );
  TaskId scene_graph = _tasks->add( scene_graph_task(_scene_graph) );
  _tasks->depends_on(scene_graph, animation);
  TaskId gui = _tasks->add( gui_task(_gui) );
  
  TaskId gui_scene = _tasks->add_empty();
  _tasks->add_child(gui_scene, scene_graph);
  _tasks->add_child(gui_scene, gui);
  
  TaskId render = _tasks->add( render_task(this) );
  _tasks->depends_on(render, gui_scene);
  
  TaskId sound = _tasks->add( sound_update_task(_sound) );
  
  TaskId done = _tasks->add_empty();
  _tasks->add_child(done, render);
  _tasks->add_child(done, sound);
  
  _tasks->wait(done);
}


Note that tasks, subtasks and dependencies are created dynamically as part of the execution of serial code or other tasks. I believe this "immediate mode" approach is more flexible and easier to work with than some sort of "retained" or "static" task graph building.

A screenshot from our profiler shows this in action for a scene with 1000 animated characters with state machines:

Notice how the main and render threads help with processing tasks while they are waiting for tasks to be completed.

Once we have a task graph we want to make sure that our scheduler runs it as fast possible. Theoretically, we would do this by finding the critical path of the graph and making sure that tasks along the critical path are prioritized over other tasks. It's the classical task scheduling problem.

In a game, the critical path can vary a lot over different scenes. Some scenes are render bound, others are CPU bound. Of the CPU bound scenes, some may be bounded by script, others by animation, etc.

To achieve maximum performance in all situations we would have to dynamically determine the critical path and prioritize the tasks accordingly. This is certainly feasible, but I am a bit vary of dynamically reconfiguring the priorities in this way, because it makes the engine harder to profile, debug and reason about. Instead I have chosen a simpler solution for now. Each job is given a priority and the highest priority jobs are performed first. The priorities are not fixed by the engine but configured per-game to match its typical performance loads.

This seems like a resonable first approach. When we have more actual game performance data it would be interesting to compare this with the performance of a completely dynamic scheduler.

In the current implementation, all tasks are posted to and fetched from a global task queue. There are no per thread task queues and thus no task stealing. At our current level of task granularity (heavy jobs are split into a maximum of 5 * thread_count tasks) the global task queue should not be a bottleneck. And a finer task granularity won't improve core utilization. When we start to have >32 cores the impact of the global queue may start to become significant, but until then I'd rather keep the system as simple as possible.

OS context switching still hits us occasionally in this system. For example one of the animation blending tasks in the profiler screenshot takes longer than it should:

I have an idea for minimizing the impact of such context switches that I may try out in the future. If a task is purely functional (idempotent) then it doesn't matter how many times we run the task. So if we detect a situation where a large part of the system is waiting for a task on the critical path (that has been switched out by the OS) we can allocate other threads to run the same task. As soon as any of the threads has completed the task we can continue.

I haven't implemented this because it complicates the model by introducing two different completion states for tasks. One where some thread has completed the task (and dependent jobs can run) and another where all threads that took on the task have completed it (and buffers allocated for the task can be freed). Also, context switching is mainly a problem on PC which isn't our most CPU constrained platform anyway.