Friday, November 20, 2009

The BitSquid low level animation system

In the BitSquid engine we differ between the low level and the high level animation system. The low level system has a simple task: given animation data, find the bone poses at a time t. The high level system is responsible for blending animations, state machines, IK, etc.

Evaluation of animation data is a memory intensive task, so to maximize performance means:
  • Touch as little memory as possible (i.e., compress the animations as much as possible)
  • Touch memory in a cache friendly way (i.e., linearly)
In the BitSquid engine we do animation compression by curve fitting and data quantization.

There are a lot of different possible ways to do curve fitting. Since we are curve fitting for compression it doesn't really matter what method we use as long as (a) we can keep the error below a specified threshold, (b) the curve representation is small (good compression rate), (c) the curve is reasonably smooth and (d) it does not take too long to evaluate.

In the BitSquid engine we currently use a hermite spline with implicitly computed derivatives. I.e., we represent the curve with time and data points: (t_1, D_1), (t_2, D_2), ..., (t_n, D_n) and evaluate the curve at the time T in the interval t_i ... t_i+1, with t = (T - t_i) / (t_i+1 - t_i) by




This formulation gives pretty good compression rates, but I haven't investigate all the possible alternatives (there are a lot!). It is possible that you could achieve better rates with some other curve. An advantage of this formulation is that it only uses the original data points of the curve and scaling constants in the range 0-1, which makes it easy to understand  the effects of quantization.

To do the curve fitting we just check the error in all curve intervals, find the interval D_i D_i+1 with the largest error and split it in half by introducing a new data point at (t_i + t_i+1)/2. We repeat this until the error in all intervals is below a specified threshold value. Again, it is possible that more careful selection of split points could give slightly better compression rates, but we haven't bothered. Note also that we can support curve discontinuities by just inserting two different data points for the same time point.

Animation compression can be done either in local space or in global space. The advantage of keeping the animations in global space is that there is no error propagation through the bone hierarchy, which means that you can use larger error thresholds when compressing the animations. On the other hand, the movement of a bone in global space is typically more complicated. (For a closed fist on a moving arm, the fingers will have no movement in local space, but a lot of movement in global space.) Since a more complicated movement is harder to compress, it might be that the global representation is more expensive, even though you can use a higher threshold. (I haven't actually tried this and compared - so much to do, so little time.)

Also, if you are going to do any animation blending you will probably want to translate back to local space anyhow (unless you blend in global space). For this reason, the BitSquid engine does the compression in local space.

For Vector3 quantization we use 16 bits per component and the range -10 m to 10 m which gives a resolution of 0.3 mm.

For quaternions we use 2 bits to store the index of the largest component, then 10 bits each to store the value of the remaining three components. We use the knowledge that 1 = x^2 + y^2 + z^2 + w^2 to restore the largest component, so we don't actually have to store its value. Since we don't store the largest component we know that the remaining ones must be in the range (-1/sqrt(2), 1/sqrt(2)) (otherwise, one of them would be largest). So we use the 10 bits to quantize a value in that range, giving us a precision of 0.0014.

So, to summarize, that gives us 48 bits per Vector3 curve point and 32 bits per quaternion curve point, plus 16 bits for the time stamp. Now the only thing remaining is to package all these curve points for all the bones in a cache friendly way. This will be the topic of another blog post, since this one is already long enough.

Friday, October 23, 2009

Picking a scripting language

We are planning to make the BitSquid engine largely scripting language agnostic. We will expose a generic scripting interface from the engine and it should be relatively easy to bind that to whatever scripting language you desire.

Still, we have to pick some language to use for our own internal projects and recommend to others. I'm currently considering three candidates:

C/C++

  • Use regular C/C++ for scripting.
  • Run it dynamically either by recompiling and relinking DLLs or by running an x86 interpreter in the game engine and loading compiled libs directly.
  • + Static typing
  • + Syntax checking & compiling can be done with an ordinary compiler
  • + When releasing the game we can compile to machine code and get full native speed
  • - C is not that nice for scripting
  • - Huge performance differences between "fully compiled" and "interactive" code makes it difficult for the gameplay programmers to do performance estimates.
Lua
  • Lua has the same feature set as Python and Ruby, but is smaller, more elegant and faster.
  • Other scripting langues such as Squirrel, AngelScript offer reference counting and static typing, but are not as well known / used
  • + Dynamic, elegant, small
  • + Something of a standard as a game scripting language
  • + LuaJIT is very fast
  • - Non-native objects are forced to live on the heap
  • - Garbage collection can be costly for a realtime app
  • - Speed can be an issue compared to native code
  • - Cannot use LuaJIT on consoles
Mono
  • Use the Mono runtime and write scripts in C#, Boo, etc.
  • + Static typing
  • + Popular, fast
  • - Huge, scary runtime
  • - Garbage collection
  • - Requires license to run on console
  • - Can probably not JIT on console

First profiler screenshot



We now have the BitSquid thread profiler up and running. The profiler is a C# application that receives profiler events from the engine over a TCP pipe.

The screen shot above shows a screen capture from a test scene with 1 000 individually animated 90-bone characters running on a four core machine. The black horizontal lines are the threads. The bars are profiler scopes. Multiple bars below each other represent nested scopes (so Application::update is calling MyGame::update for instance). Color represents the core that the scope started running on (we do not detect core switches within scopes).

In the screen shot above, you can see AnimationPlayer::update starting up 10 animation_player_kernel jobs to evaluate the animations. Similarly SceneGraphManager::update runs five parallel jobs to update the scene graph. SceneGraphAnimators only copies the animation data from the animation output into the scene graphs. But even this takes some time, since we are copying 90 000 matrices.

(Of course if we would make a 1 000 people crowd in a game we would use clever instancing, rather than run 1 000 animation and scene graph evaluations. This workload was just used to test the threading.)

Wednesday, October 14, 2009

Parallel rendering

I've spent the last week designing and implementing the low-level parts of the renderer used in our new engine. One of the key design principles of the engine is to go as wide / parallel as possible whenever possible. To be able to do that in a clean and efficient way a good data streaming model with minimal pointer chasing is key.


With the rendering I've tackled that by splitting the batch processing in three passes: batch gathering, merge-n-sort and display list building.


In the batch gathering pass we walk over the visible objects (objects that have survived visibility culling) and let them queue their draw calls to a RenderContext. A RenderContext is a platform independent package stream that holds all data needed for draw calls (and other render jobs/events/state changes etc). This step is easily divided into any number of jobs, by letting each job have its own RenderContext.


After the batch gathering is done we have all data needed to draw the scene in n number of RenderContexts. The purpose of the merge-n-sort step is to take those RenderContexts, merge them to one while at the same time sorting all batches into the desired order (with respect to "layers", minimizing state changes, depth sorting etc).


We now have one sorted package stream containing all the draw calls that we can send off to the rendering back-end. At this point we can again go wide and build the display list in parallel. Here's a small sketch illustrating the data flow:





Red sections belongs to the platform independent renderer. Blue sections belongs to the rendering back-end (in this illustration D3D11).

Tuesday, October 13, 2009

Simplified JSON notation

JSON is human-editable, but not necessarily human-friendly. A typical JSON configuration file:

{
    "ip" : "127.0.0.1",
    "port" : 666
}

A more Lua-inspired syntax is friendlier:

ip = "127.0.0.1"
port = 666

This syntax corresponds 1-1 with regular JSON syntax and can be trivially converted back and forth with the following rules:

  • Assume an object definition at the root level (no need to surround entire file with { } ).
  • Commas are optional
  • Quotes around object keys are optional if the keys are valid identifiers
  • Replace : with =

On the other hand, all syntax wars are pointless and will only send us into an early grave.

Multithreaded gameplay

How do we multithread gameplay without driving gameplay programmers insane?

My current idea is:
  • Do all gameplay processing as events reacting to stuff (such as collide_with_pickup_object), not through a generic update() call.
  • Each event concerns a number of entities (e.g., [player, ammo_pack]). The processing function for an event is allowed to touch the entities it concerns freely, but not any other entities.
  • Each frame, consider all events. Let two entities being in the same event define an equivalence relation between those two entities. The corresponding equivalence classes then define "islands" of entities that can be processed safely on separate cores.
  • Assign each island to a core, process the events for that island one by one on the core.
  • Provide a thread-safe interface to any global entitites that the event processors may need to touch for effect spawning, sound play, etc. (Preferrably through a queue so that the global entities don't have to be touched directly from the event processors.)
Some concerns:
  • Will the islands become "too big". I.e., if almost everything interacts with the player, there is a risk that everything ends up in a single big "player island".
  • Will it be reasonable for gameplay programmers to write code that follows these restrictions.

Friday, October 2, 2009

Two way serialization function

A trick to avoid having to keep the serialization code for input and output in sync is to use the same code for both input and output:

struct Object {
template <>
STREAM & serialize(STREAM & stream) {
return stream & a & b & c;
}
int a, b, c;
};

Here we have used & as our serialization operator. We could use any operator we like.

We then just implement the operator to do the right thing for our input and output streams:

template < > InputArchive & operator &(InputArchive &a, int &v) {
a.read(&v, sizeof(v));
return a;
}

template < > OutputArchive & operator & (OutputArchive &a, int &v) {
a.write(&v, sizeof(v));
return a;
}

These are both template specializations of a generic streaming template.

template <>
STREAM & operator &(STREAM & stream, T & t) {
t.serialize(stream);
}

Now we can stream all kinds of types either by implementing serialize in the type or by defining a template specialization of operator & for that type.