Friday, February 12, 2010

The Blob and I

Having resource data in a single binary blob has many advantages over keeping it in a collection of scattered objects:
  • Shorter load times. We can just stream the entire blob from disk to memory.
  • Cache friendly. Related objects are at close locations in memory.
  • DMA friendly. An entire blob can easily be transferred to a co-processor.
In past engines I've used placement new and pointer patching to initialize C++ objects from a loaded blob. To save a resource with this system all the objects are allocated after each other in memory, then their pointers are converted to local pointers (offsets from the start of the blob). Finally all the allocated data is written raw to disk.

When loading, first the raw data blob is loaded from disk. Then placement new is used with a special constructor to create the root object at the start of the blob. The constructor takes care of pointer-patching, converting the offsets back to pointers. Let's look at an example:

class A
{
  int _x;
  B *_b;

public:
  A(int x, B *b) : _x(x), _b(b) {}
  A(char* base) {
  _b = (B*)( (char *)_b + (base - (char *)0) );
  new (_b) B(base);
  }
  ...
};
...
A *a = new (blob) A(blob);

Note that the constructor does not initialize _x. a is placement new:ed into an area that already contains an A object with the right value for _x (the saved value). By not initializing _x we make sure that it keeps its saved value. The constructor does three things:
  • Initializes the vtable pointer of a. This is done "behind the scenes" by C++ when we call new. It is necessary for us to be able to use a as an A object, since the vtable pointer of A saved in the file during data compilation will typically not match the vtable pointer of A in the runtime.
  • Pointer patches _b, converting it from an offset from the blob base to its actual memory location.
  • Placement new:s B into place so that B also gets the correct vtable, patched pointers, etc. Of course B's constructor may in turn create other objects.
Like many "clever" C++ constructs this solution gives a smug sense of satisfaction. Imagine that we are able to do this using our knowledge of vtables, placement new, etc. Truly, we are Gods that walk the earth!

Of course it doesn't stay this simple. For the solution to be complete it must also be able to handle base class pointers (call a different new based on the "real" derived class of the object, which must be stored somewhere), arrays and collection classes (we can't use std::vector, etc because they don't fit into our clever little scheme).

Lately, I've really come to dislike these kinds of C++ "framework" solutions that require that every single class in a project conform to a particular world view (implement a special constructor, a special save() function, etc). It tends to make the code very coupled and rigid. God forbid you ever had to change anything in the serialization system, because now the entire WORLD depends on it. The special little placement constructors creep in everywhere and pollute a lot of classes that don't really want to care about serialization. This makes the entire code base complicated and ugly.

Also, it should be noted that naively "blobbing" a collection of scattered objects by just concatenating them in memory does not necessarily lead to optimal memory access patterns. If the memory access order does not match the serialization order there can still be a lot of jumping around in memory. The serialization order with this kind of solution tends to be depth-first and can be tricky to change. (Since the entire WORLD depends on the serialization system!)

In the BitSquid engine I use a much simpler approach to resource blobs. The BitSquid engine is data-centric rather than class-centric. The data design is done first -- laid out in simple structs, optimized for the typical access patterns and DMA transfers. Then functions are defined that operate on the data. Classes are used to organize higher level systems, not in the low level processing intensive systems or resource definitions. Inheritance is very rarely used. (Virtual function calls are always cache unfriendly since they resolve to different code locations for each object. It is better to keep objects sorted by type and then you don't really need virtual calls.)

I believe this "old-school" C-like approach not only gives better performance, but also in many cases a better design. A looser coupling between data and processing makes it easier to modify things and move them around. And deep, bad inheritance structures are the main source of unnecessary coupling in C++ programs.

Since the resource data is just simple structs, not classes with virtual functions, we can just write it to disk and read it back as we please. We don't need to initialize any vtable pointers, so we don't need to call new on the data.

The problem with pointer patching is solved in the simplest way possible -- I don't use pointers in the resource data. Instead, I just use offsets all the time, both in memory and on disk. For example, the resource data for our particle systems looks something like this (simplified):



Yes, having offsets in the resource data instead of pointers means that I occasionally need to do a pointer add to find the memory location of an object. I'm sure someone will balk at this "unnecessary" computation, but I can't see it having any significant performance impact whatsoever. (If you have to do it a lot, then you are jumping around in memory a lot and then that is the main source of your performance problem.)

The advantage is that since I'm only storing offsets I don't need to do any pointer patching at all. I can move the data around in memory as I like, make copies of it, concatenate it to other blobs to make bigger blobs, save it to disk and read it back with a single operation and no need for pre- or post-processing. There is no complicated "serialization framework". No system in the engine needs to care about how any other system stores or reads it data.

As in many other cases the data-centric approach gives a solution that is simpler, faster, more flexible and more modular.

Tuesday, January 19, 2010

Content Repositories and Databases

I've been toying with the idea of replacing game content repositories (Perforce, Subversion) with something else. After all, nobody really likes content repositories -- they are slow, non-intuitive, give rise to merge problems, etc. Version control systems were primarily designed for code, not for content, and that shows. So what could replace them? One option is to use a central database. There are a number of superficial advantages to that approach:

  • Simpler -- no need to update or check-in.
  • Changes are immediately visible to everyone.
  • No merge issues.
  • Collaborative editing (several designers working on the same level) is possible.
But we would loose all the nice features of version control:
  • Accountability, history tracking and reversion.
  • Branching and tagging.
  • Having local, uncommitted changes in a working copy.
How necessary are those features? I would say that they are essential. But I also have a small nagging doubt that maybe this opinion is just the result of my own prejudices as a programmer. After all, people in many industries do lots of serious collaborative work using databases without branching, reversion or working copies. Still, I'm not ready to take the plunge and give up on version control features. (Though if anyone has tried it, I would certainly like to hear about it.)

Having those features by necessity implies some of the complexities associated with version control. For example, if we want a local working copy we need some explicit check-in/update mechanism. If we don't need a local copy we can just make the editor do svn update, svn commit on each change and the repository will be as "immediate" as a database.

Collaborative editing depends more on how the editor is implemented than on the storage backend. Regardless of whether we are using a database or a repository the editor will at some point have to fetch and display the changes made by other users as well as submit the changes made by the local user. With a repository backend, svn update and svn commit could be used for that purpose.

The only issue then is to avoid merge conflicts as much as possible, since they force the user to interact with the svn update command and ruin the collaborative editing experience. Fortunately, that should be relatively easy. At BitSquid, we store most of our data in JSON-like structures. With a JSON-aware 3-way-merger, conflicts will only arise if the same field in the same JSON-object is changed, which should happen rarely.

So, no great new way of storing content. Instead I just have to write a 3-way JSON-merger to protect the content people from merge conflicts. And then start working on the collaborative level editor...

Friday, December 11, 2009

Events

An event system can be both useful and dangerous. Useful, because it allows you to create loose couplings between systems in the engine (an animation foot step generates a sound), which makes a more modular design possible and prevents different systems from polluting each other's interfaces.

Dangerous, because the loose coupling can sometimes hide the logical flow of the application and make it harder to understand, by obliterating call stacks and adding confusing layers of indirection. This is especially true the more "features" are added to the event system. For example, a typical nightmare event system could consist of:
  • A global EventDispatcher singleton where everyone can post events, and everyone can listen to events, provided they (multiply) inherit from the EventPublisher and EventSubscriber interface classes.
  • Multiple listeners per event with a priority order and an option for a listener to say that it has fully processed an event and that it shouldn't be sent to the other listeners.
  • An option for posting delayed events, that should be delivered "in the future".
  • The possibility to block all events of a certain type during the processing of an event.
  • Additional horrors...
So much is wrong here: Global objects with too much responsibility that everything needs to tie into. Forcing all classes into a heavy-handed inheritance structure (no I don't want all my objects to inherit EventPublisher, EventDispatcher, Serializable, GameObject, etc). Strange control flow affecting commands providing spooky "action at a distance" (who blocked my event this time?).

Instead, I believe that the key to a successful event system is to make it as simple and straightforward as possible. You really don't need the "advanced" and "powerful" features. Such complex functionality should be implemented in high-level C or script code, where it can be properly examined, debugged, analyzed, etc. Not in a low level event manager.

Note also that callbacks/delegates cannot completely replace events. While an event will probably generate some kind of callback as the final stage of its processing, we also need to be able to represent the event as an encapsulated data object. That is the only way to store it in a list for example. It is also the only way to pass it from one processing thread to another, which is crucial for a multithreaded engine.

So, with this background, let's look at how events are treated in the BitSquid engine. In the BitSquid engine an event is just a struct:

struct CollisionEvent
{
    Actor *actors[2];
    Vector3 where;
};

An event stream is a blob of binary data consisting of concatenated event structs. Each event struct in the blob is preceded by a header that specifies the event type (an integer uniquely identifying the event) and the size of the event struct:

[header 1][event 1][header 2][event 2] ... [header n][event n]

Since the size of each event is included, an event consumer that processes an event stream can simply skip over the events it doesn't understand or isn't interested in.

There is no global event dispatcher in the engine (globals are bad). Instead each system that can generate events produces its own event stream. So, each frame the physics system (for instance) generates a stream of physics events. A higher level system can extract the event stream and consume the events, taking appropriate actions for each event.

For example, the world manager connects physics events to script callbacks. It consumes the event list from the physics subsystem. For each event, it checks if the involved entity has a script callback mapped for the event type. If it has, the world manager converts the event struct to a Lua table and calls the callback. Otherwise, the event is skipped.

In this way we get the full flexibility and loose coupling of an event system without any of the drawbacks of traditional heavy-weight event systems. The system is completely modular (no global queues or dispatchers) and thread friendly (each thread can produce its own event stream and events can be posted to different threads for processing). It is also very fast, since event streams are just cache-friendly blobs of data that are processed linearly.

Friday, November 20, 2009

The BitSquid low level animation system

In the BitSquid engine we differ between the low level and the high level animation system. The low level system has a simple task: given animation data, find the bone poses at a time t. The high level system is responsible for blending animations, state machines, IK, etc.

Evaluation of animation data is a memory intensive task, so to maximize performance means:
  • Touch as little memory as possible (i.e., compress the animations as much as possible)
  • Touch memory in a cache friendly way (i.e., linearly)
In the BitSquid engine we do animation compression by curve fitting and data quantization.

There are a lot of different possible ways to do curve fitting. Since we are curve fitting for compression it doesn't really matter what method we use as long as (a) we can keep the error below a specified threshold, (b) the curve representation is small (good compression rate), (c) the curve is reasonably smooth and (d) it does not take too long to evaluate.

In the BitSquid engine we currently use a hermite spline with implicitly computed derivatives. I.e., we represent the curve with time and data points: (t_1, D_1), (t_2, D_2), ..., (t_n, D_n) and evaluate the curve at the time T in the interval t_i ... t_i+1, with t = (T - t_i) / (t_i+1 - t_i) by




This formulation gives pretty good compression rates, but I haven't investigate all the possible alternatives (there are a lot!). It is possible that you could achieve better rates with some other curve. An advantage of this formulation is that it only uses the original data points of the curve and scaling constants in the range 0-1, which makes it easy to understand  the effects of quantization.

To do the curve fitting we just check the error in all curve intervals, find the interval D_i D_i+1 with the largest error and split it in half by introducing a new data point at (t_i + t_i+1)/2. We repeat this until the error in all intervals is below a specified threshold value. Again, it is possible that more careful selection of split points could give slightly better compression rates, but we haven't bothered. Note also that we can support curve discontinuities by just inserting two different data points for the same time point.

Animation compression can be done either in local space or in global space. The advantage of keeping the animations in global space is that there is no error propagation through the bone hierarchy, which means that you can use larger error thresholds when compressing the animations. On the other hand, the movement of a bone in global space is typically more complicated. (For a closed fist on a moving arm, the fingers will have no movement in local space, but a lot of movement in global space.) Since a more complicated movement is harder to compress, it might be that the global representation is more expensive, even though you can use a higher threshold. (I haven't actually tried this and compared - so much to do, so little time.)

Also, if you are going to do any animation blending you will probably want to translate back to local space anyhow (unless you blend in global space). For this reason, the BitSquid engine does the compression in local space.

For Vector3 quantization we use 16 bits per component and the range -10 m to 10 m which gives a resolution of 0.3 mm.

For quaternions we use 2 bits to store the index of the largest component, then 10 bits each to store the value of the remaining three components. We use the knowledge that 1 = x^2 + y^2 + z^2 + w^2 to restore the largest component, so we don't actually have to store its value. Since we don't store the largest component we know that the remaining ones must be in the range (-1/sqrt(2), 1/sqrt(2)) (otherwise, one of them would be largest). So we use the 10 bits to quantize a value in that range, giving us a precision of 0.0014.

So, to summarize, that gives us 48 bits per Vector3 curve point and 32 bits per quaternion curve point, plus 16 bits for the time stamp. Now the only thing remaining is to package all these curve points for all the bones in a cache friendly way. This will be the topic of another blog post, since this one is already long enough.

Friday, October 23, 2009

Picking a scripting language

We are planning to make the BitSquid engine largely scripting language agnostic. We will expose a generic scripting interface from the engine and it should be relatively easy to bind that to whatever scripting language you desire.

Still, we have to pick some language to use for our own internal projects and recommend to others. I'm currently considering three candidates:

C/C++

  • Use regular C/C++ for scripting.
  • Run it dynamically either by recompiling and relinking DLLs or by running an x86 interpreter in the game engine and loading compiled libs directly.
  • + Static typing
  • + Syntax checking & compiling can be done with an ordinary compiler
  • + When releasing the game we can compile to machine code and get full native speed
  • - C is not that nice for scripting
  • - Huge performance differences between "fully compiled" and "interactive" code makes it difficult for the gameplay programmers to do performance estimates.
Lua
  • Lua has the same feature set as Python and Ruby, but is smaller, more elegant and faster.
  • Other scripting langues such as Squirrel, AngelScript offer reference counting and static typing, but are not as well known / used
  • + Dynamic, elegant, small
  • + Something of a standard as a game scripting language
  • + LuaJIT is very fast
  • - Non-native objects are forced to live on the heap
  • - Garbage collection can be costly for a realtime app
  • - Speed can be an issue compared to native code
  • - Cannot use LuaJIT on consoles
Mono
  • Use the Mono runtime and write scripts in C#, Boo, etc.
  • + Static typing
  • + Popular, fast
  • - Huge, scary runtime
  • - Garbage collection
  • - Requires license to run on console
  • - Can probably not JIT on console

First profiler screenshot



We now have the BitSquid thread profiler up and running. The profiler is a C# application that receives profiler events from the engine over a TCP pipe.

The screen shot above shows a screen capture from a test scene with 1 000 individually animated 90-bone characters running on a four core machine. The black horizontal lines are the threads. The bars are profiler scopes. Multiple bars below each other represent nested scopes (so Application::update is calling MyGame::update for instance). Color represents the core that the scope started running on (we do not detect core switches within scopes).

In the screen shot above, you can see AnimationPlayer::update starting up 10 animation_player_kernel jobs to evaluate the animations. Similarly SceneGraphManager::update runs five parallel jobs to update the scene graph. SceneGraphAnimators only copies the animation data from the animation output into the scene graphs. But even this takes some time, since we are copying 90 000 matrices.

(Of course if we would make a 1 000 people crowd in a game we would use clever instancing, rather than run 1 000 animation and scene graph evaluations. This workload was just used to test the threading.)

Wednesday, October 14, 2009

Parallel rendering

I've spent the last week designing and implementing the low-level parts of the renderer used in our new engine. One of the key design principles of the engine is to go as wide / parallel as possible whenever possible. To be able to do that in a clean and efficient way a good data streaming model with minimal pointer chasing is key.


With the rendering I've tackled that by splitting the batch processing in three passes: batch gathering, merge-n-sort and display list building.


In the batch gathering pass we walk over the visible objects (objects that have survived visibility culling) and let them queue their draw calls to a RenderContext. A RenderContext is a platform independent package stream that holds all data needed for draw calls (and other render jobs/events/state changes etc). This step is easily divided into any number of jobs, by letting each job have its own RenderContext.


After the batch gathering is done we have all data needed to draw the scene in n number of RenderContexts. The purpose of the merge-n-sort step is to take those RenderContexts, merge them to one while at the same time sorting all batches into the desired order (with respect to "layers", minimizing state changes, depth sorting etc).


We now have one sorted package stream containing all the draw calls that we can send off to the rendering back-end. At this point we can again go wide and build the display list in parallel. Here's a small sketch illustrating the data flow:





Red sections belongs to the platform independent renderer. Blue sections belongs to the rendering back-end (in this illustration D3D11).