Monday, September 17, 2012

A Data-Oriented, Data-Driven System for Vector Fields - Part 1

A vector field is a function that assigns a vector value to each point in 3D space. Vector fields can be used to represent things like wind (the vector field specifies the wind velocity at each point in space), water, magnetism, etc.

To me, wind is the most interesting use case. I want a system that can be used for physics (trees, tumble weed, paper cups), particles (leaves, sparks, smoke) and graphics (grass). I also want the system to be capable of handling both global effects (wind blowing through the entire level) and local effects (explosions, air vents, landing helicopters, rising hot air from fires, etc). But I don't want to limit the system to only handling wind. I imagine that once the system is in place, it could be put to other interesting uses as well.

There are a number of things that make this an interesting non-trivial design challenge:

  • Vector fields represent a global shared state. All systems (particles, physics, etc) should react to the same wind. This can create strong couplings between unrelated systems, which we want to avoid.

  • The system must be fast. We want to be able to make large particle effects that are affected by wind. As a design goal, let's say that it should be able to handle at least 10 000 queries / frame.

  • As stated above, the system must be flexible enough to handle both global wind and a large variety of different local effects (air vents, fans, etc).

I'll outline the system in a series of articles. Let's start by thinking a bit about how we can represent the vector field in a way that allows for fast queries.

1. Use a functional representation

Storing the vector value for every point in 3D space at a decent resolution would require huge amounts of memory. It would also be very expensive to update. If we wanted to change the global wind direction, we would have to loop over all those points and change the value.

So, instead, we will use a functional representation. We will express the field as some closed function F(p, t) that gives us the field vector at point p in space at the time t.

For example, we could express a global wind that oscillates in the x-direction as:

F(p, t) = Vector3(sin(t), 0, 0)

The closed function form allows us to evaluate the vector field at any point in space and time.

Note that even with a functional form as the main representation, we can still interact with grid based representations. For example, we can render some section of the F(p, t) function to a texture for use on a GPU. Similarly, if we have some grid based wind data that we want to add to the simulation, we could use that as part of the F(p, t) expression:

F(p, t) = Vector3(sin(t), 0, 0) + sample_grid(grid, p)

2. Ignore the time coordinate

The vector field function F(p, t) is a function of both space and time. The wind varies throughout the level and if we look at any one point, the wind at that point varies over time.

But in practice, we treat the p and t coordinates very differently. We start at some time t_0 and then evaluate F(p, t_0) for thousands of different p values. Then we move on to t_1 and do the same thing.

We can make use of the fact that t remains constant for a large number of evaluations to simplify the function. For example at t=0.5 the function:

F(p, t) = sin(p.x) * sin(p.y) * cos(t)

simplifies to:

G(p) = sin(p.x) * sin(p.y) * 0.8776

which is cheaper to evaluate.

Taking this approach a step further, it makes sense to split our system in two parts -- a high level system that knows about time and every frame produces a new G(p) for the current time, and a low level system that ignores time completely and just computes G(p). Since the high level system only runs once per frame it can afford to do all kinds of complicated but interesting stuff, like constant folding, optimization, etc.

For the low level system we have reduced the problem to evaluating G(p).

3. Express the field as a superposition of individual effects

To make it possible for the field to contain both global effects (world wind) and local effects (air vents, explosions) we express it as a superposition of individual effect functions:

G(p) = G_1(p) + G_2(p) + ... + G_n(p)

Here G_i(p) represents each individual effect. A base wind could be expressed as just a constant:

G_0(p) = Vector3(2.1, 1.4, 0)

A turbulence function could add a random component

G_1(p) = turbulence(seed, p, 4)

An explosion effect could create a wind with a speed of 100 m/s outwards from the center of the explosion in a sphere with radius 4.0 meter around the explosion center:

G_2(p) = sphere(p,c,4) * normalize(p-c) * 100

Here sphere(p,c,4) is a spherical support function that defines the range of the effect. It is 1 if ||p - c|| <= 4.0 and 0 otherwise.

Note again that we have stripped out the time component. At the higher level, this might be an expanding sphere with decreasing wind speeds, but at the low level we only care what it looks like at this instance.

Similar functions can be added for other local effects.

4. Use the AABB to cull local fields

If we have a lot of local effects (explosions, etc), evaluating G(p) will be pretty expensive.

We can reduce the cost by only evaluating the local effects that are close enough to our particle system to matter.

I.e., instead of evaluating G(p) for all particles, we first intersect the AABB of each G_i(p)'s support with the AABB of our particle system.

That gives us a simpler function G'(p) that we can then evaluate for each particle.

If we wanted to, we could use the wavelength of the field for further simplifications. If the scale at which a field effect changes is much larger than our AABB, we can replace that effect with a Taylor series expansion. Similarly, if an effect oscillates at a scale much smaller than the size of our particles, we can replace it with its average value.

Next time

Next time I will look at how we can efficiently evaluate arbitrary functions, such as:

G(p) = Vector3(1,1,0) + turbulence(seed, p, 2) + sphere(p, c, 4)

for a huge number of particle positions p.

This has also been posted to The Bitsquid blog.

Monday, September 3, 2012

A new way of organizing header files

Recently, I've become increasingly dissatisfied with the standard C++ way of organizing header files (one .h file and one .cpp file per class) and started experimenting with alternatives.

I have two main problems with the ways headers are usually organized.

First, it leads to long compile times, especially when templates and inline functions are used. Fundamental headers like array.h and vector3.h get included by a lot of other header files that need to use the types they define. These, in turn, get included by other files that need their types. Eventually you end up with a messy nest of header files that get included in a lot more translation units than necessary.

Sorting out such a mess once it has taken root can be surprisingly difficult. You remove an #include statement somewhere and are greeted by 50 compile errors. You have to fix these one by one by inserting missing #include statements and forward declarations. Then you notice that the Android release build is broken and needs additional fixes. This introduces a circular header dependency that needs to be resolved. Then it is on to the next #include line -- remove it, rinse and repeat. After a day of this mind-numbingly boring activity you might have reduced your compile time by four seconds. Hooray!

Compile times have an immediate and important effect on programmer productivity and through general bit rot they tend to grow over time. There are many things that can increase compile times, but relatively few forces that work in the opposite direction.

It would be a lot better if we could change the way we work with headers, so that we didn't get into this mess to begin with.

My second problem is more philosophical. The basic idea behind object-oriented design is that data and the functions that operate on it should be grouped together (in the same class, in the same file). This idea has some merits -- it makes it easier to verify that class constraints are not broken -- but it also leads to problems. Classes get coupled tightly with concepts that are not directly related to them -- for example things like serialization, endian-swapping, network synchronization and script access. This pollutes the class interface and makes reuse and refactoring harder.

Class interfaces also tend to grow indefinitely, because there is always "more useful stuff" that can be added. For example, a string class (one of my pet peeves) could be extended with functionality for tokenization, path manipulation, number parsing, etc. To prevent "class bloat", you could write this code as external functions instead, but this leads to a slightly strange situation where a class has some "canonized" members and some second-class citizens. It also means that the class must export enough information to allow any kind of external function to be written, which kind of breaks the whole encapsulation idea.

In my opinion, it is much cleaner to organize things by functionality than by type. Put the serialization code in one place, the path manipulation code in another place, etc.

My latest idea about organization is to put all type declarations for all structs and classes in a single file (say types.h):

struct Vector3 {
 float x, y, z;
};

template <class T>
class Array<T> {
public:
 Array() : _capacity(0), _size(0), _data(0) {}
 ~Array() {free(_data);}
 unsigned _capacity;
 unsigned _size;
 T *_data;
};

class IFileSystem;
class INetwork;

Note that types.h has no function declarations, but it includes the full data specification of any struct or class that we want to use "by value". It also has forward declarations for classes that we want to use "by reference". (These classes are assumed to have pure virtual interfaces. They can only be created by factory functions.)

Since types.h only contains type definitions and not a ton of inline code, it ends up small and fast to compile, even if we put all our types there.

Since it contains all type definitions, it is usually the only file that needs to be included by external headers. This means we avoid the hairy problem with a big nest of headers that include other headers. We also don’t have to bother with inserting forward declarations in every header file, since the types we need are already forward declared for us in types.h.

We put the function declarations (along with any inline code) in the usual header files. So vector3.h would have things like:

inline Vector3 operator+(const Vector3 &a, const Vector3 &b)
{
 Vector3 res;
 res.x = a.x + b.x;
 res.y = a.y + b.y;
 res.z = a.z + b.z;
 return res;
}

.cpp files that wanted to use these operations would include vector3.h. But .h files and other .cpp files would not need to include the file. The file gets included where it is needed and not anywhere else.

Similarly, array.h would contain thinks like:

template <class T>
void push_back(Array<T> &a, const T &item)
{
 if (a._size + 1 > a._capacity)
  grow(a);
 a._data[a._size++] = item;
}

Note that types.h only contains the constructor and the destructor for Array<T>, not any other member functions.

Furthermore, I prefer to design classes so that the "zero-state" where all members are zeroed is always a valid empty state for the class. That way, the constructor becomes trivial, it just needs to zero all member variables. We can also construct arrays of objects with a simple memset().

If a class needs a more complicated empty state, then perhaps it should be an abstract interface-class instead of a value class.

For IFileSystem, file_system.h defines the virtual interface:

class IFileSystem
{
 virtual bool exists(const char *path) = 0;
 virtual IFile *open_read(const char *path) = 0;
 virtual IFile *open_write(const char *path) = 0;
 ...
};

IFileSystem *make_file_system(const char *root);
void destroy_file_system(IFileSystem *fs);

Since the “open structs” in types.h can be accessed from anywhere, we can grop operations by what they do rather than by what types they operate on. For example, we can put all the serialization code in serialization.h and serialization.cpp. We can create a file path.h that provides path manipulation functions for strings.

An external project can also "extend" any of our classes by just writing new methods for it. These methods will have the same access to the Vector3 data and be called in exactly the same way as our built-in ones.

The main drawback of this model is that internal state is not as "protected" as in standard object-oriented design. External code can "break" our objects by manipulating members directly instead of using methods. For example, a stupid programmer might try to change the size of an array by manipulating the _size field directly, instead of using the resize() method.

Naming conventions can be used to mitigate this problem. In the example above, if a type is declared with class and the members are preceded by an underscore, the user should not manipulate them directly. If the type is declared as a struct, and the members do not start with an underscore, it is OK to manipulate them directly. Of course, a stupid programmer can still ignore this and go ahead and manipulate the members directly anyway. On the other hand, there is no end to the things a stupid programmer can do to destroy code. The best way to protect against stupid programmers is to not hire them.

I haven’t yet written anything really big in this style, but I've started to nudge some files in the Bitsquid codebase in this direction, and so far the experience has been positive.

Saturday, August 18, 2012

Cleaning bad code

Guess what! You've just inherited a stinking, steaming pile of messy old code. Congratulations! It's all yours.

Bad code can code can come from all kinds of places. Middleware, the internet, perhaps even your own company.

You know that nice guy in the corner that nobody had time to check up on? Guess what he was doing all that time. Churning out bad code.

Or remember that module someone wrote years ago, just before she left the company. That module that twenty different people have then added hacks, patches and bug fixes to, without really understanding what they were doing. Yup, that one.

Or what about that open source thing you downloaded that you knew was horrible, but it solved a very specific and quite hairy problem that would have taken you ages to do by yourself.

Bad code doesn't have to be a problem, as long as it's not misbehaving, and nobody pokes their bloody nose in it. Unfortunately, that state of ignorant bliss rarely lasts. A bug will be discovered. A feature requested. A new platform released. Now you have to dig into that horrible mess and try to clean it up. This article offers some humble advice for that unfortunate situation.

0. Is it worth doing?

The first thing you need to ask yourself is whether the code is worth cleaning. I'm of the opinion that when it comes to code cleaning you should either karate do "yes", or karate do "no". Either you assume full responsibility for the code and rework it until you end up with something that you are actually happy to maintain and proud to have in your codebase.

Or you decide that even though the code looks horrible, it isn't cost-effective to take time out of your busy schedule to fix it. So instead you just do the smallest change possible that solves your current problem.

In other words, you either regard the code as yours or theirs.

There are merits to both alternatives. Good programmers get an itch when they see bad code. They bring out their torches and pitchforks and chant: "Unclean! Unclean!" And that is a good instinct.

But cleaning code is also a lot of work. It is easy to underestimate the time it takes. It can be nearly as time consuming as rewriting the whole thing from scratch. And it doesn't bring any short term benefits. Two weeks cleaning code won't add any new features to the game, but it might give you some new bugs.

On the other hand, the long term effects of never cleaning your code can be devastating. Entropy is the code-killer.

So, never an easy choice. Some things to consider are:

  • How many changes do you expect to make to the code?

    Is it just this one small bug that you need to fix, or is this code that you expect to return to many times to tweak and tune and add new features. If it's just this one bug, then perhaps it is best to let sleeping dogs lie. However, if this is a module that you will need to mess around with a lot, then spending some time to clean it up now, will save a lot of headache later.

  • Will you need/want to import upstream changes?

    Is this an open source project that is under active development? If so, and you want to pull the changes made upstream you can't make any big changes to the code or you will be in merge hell every time you pull. So just be a nice team player, accept its idiosyncrasies and send patches with your bug fixes to the maintainer.

  • How much work is it?

    How many lines of code can you realistically clean in a day? An order of magnitude estimate says more than 100 and less than 10 000, so let's say 1 000. So if the module has 30 000 lines, you might be looking at a month of work. Can you spend that? Is it worth it?

  • Is it a part of your core functionality?

    If what the module does is something peripheral, like say font rendering or image loading, you might not care that it is messy. You might swap out the whole thing for something else in the future, who knows. But you should own the code that relates to your core competence.

  • How bad is it?

    If the code is just slightly bad, then perhaps you can live with it. If it is mind-numbingly, frustratingly incomprehensibly bad, then perhaps something needs to be done.

1. Get a test case

Seriously cleaning a piece of code means messing around with it a lot. You will break things.

If you have a decent test case with good coverage you will immediately know what has broken and you can usually quite quickly figure out what stupid mistake you just made. The time and anxiety this saves over the course of the cleaning process is just ridiculous. Get a test case. It's the first thing you should do.

Unit tests are best, but all code is not amenable to to unit testing. (Test fanatics, send your hate mail now!) If unit tests are too cumbersome, use an integration test instead. For example, fire up a game level and run the character through a specific set of actions related to the code you are cleaning.

Since such tests ate more time consuming, it might not make sense to run it after every change you make, which would be ideal. But as you put every single change you make into source control, it's not so bad. Run the test every once in a while (e.g., every five changes). When it discovers a problem you can do a binary search of those last few commits to find out which one caused the problem.

If you discover an issue that wasn't detected by your test, make sure that you add that to the test, so that you capture it in the future.

2. Use source control

Do people still have to be told to use source control? I sure hope not.

For cleaning work it is absolutely crucial. You will be making lots and lots of small changes to the code. If something breaks you want to be able to look back in the revision history and find out where it broke.

Also, if you are anything like me, you will sometimes start down a refactoring path (like removing a stupid class) and realize after a while that it wasn't such a good idea, or, that it was a good idea, but that everything would be a lot simpler if you did something else first. So you want to be able to quickly revert everything you just did and begin anew.

Your company should have a source control system in-place that allows you to do these changes in a separate branch and commit as much as you like without disturbing anybody else.

But even if it doesn't, you should still use source control. In that case, download mercurial (or git), create a new repository and put the code that you checked out of your company's stupid system there. Do your changes in that repository, committing as you go. When you are done you can merge everything back into the stupid system.

Cloning the repository into a sensible source control system only takes a few minutes. It is absolutely worth it. If you don't know mercurial, spend an hour to learn it. You will be happy you did. Or if you prefer, spend 30 hours to learn git instead. (I kid! Not really. Nerd fight now!)

3. Make one (small) change at a time

There are two ways of improving bad code: revolution and reform. The revolution method is to burn everything with fire and rewrite it from scratch. The reform method is to refactor the code with one small change at a time without ever breaking it.

This article is about the reform method. I'm not saying that revolutions never are necessary. Sometimes things are so bad that they just need to go. But people who get frustrated with the slow pace of reform and advocate revolution often fail to realize the full complexity of the problem and thus don't give the existing system enough credit for the things it does.

Joel Spolsky has written a classic article about this without falling into the trap of making strained political metaphors.

The best way of reforming code is to make one minimal change at a time, test it and commit it. When the change is small it is easier to understand its consequences and make sure that it doesn't affect the existing functionality. If something goes wrong, you only have a small amount of code that you need to check. If you start doing a change and realize that it is bad, you won't loose much work by reverting to the last commit. If you notice after a while that something has gone subtly wrong, a binary search in the revision history will let you find the small change that introduced the problem.

A common mistake is to do more than one thing at the same time. For example, while getting rid of an unnecessary level of inheritance you might notice that the API methods are not as orthogonal as you would like them to be and start to rearrange them. Don't! Get rid of the inheritance first, commit that and then fix the API.

Smart programmers organize the way they work so that they don't have to be that smart.

Try to find a path that takes you from what the code is now to what you want it to be in a sequence of small steps. For example, in one step you might rename the methods to give them more sane names. In the next, you might change some member variables to function parameters. Then you reorder some algorithms so that they are clearer. And so on.

If you start doing a change and realize that it was a bigger change than you originally thought, don't be afraid to revert and find a way of doing the same thing in smaller, simpler steps.

4. Don't clean and fix at the same time

This is a corollary to (3), but important enough to get its own point.

It is a common problem. You start to look at a module because you want to add some new functionality. Then you notice that the code is really badly organized, so you start reorganizing it at the same time as you are adding the new functionality.

The problem with this is that cleaning and fixing has diametrically opposite goals. When you clean, you want to make the code look better without changing its functionality. When you fix, you want to change its functionality to something better. If you clean and fix at the same time it becomes very hard to make sure that your cleaning didn't indadvertedly change something.

Do the cleaning first. Then, when you have a nice clean base to work with, add the new functionality.

5. Remove any functionality that you are not using

The time it takes to clean is proportional to the amount of code, its complexity and its messiness.

If there is any functionality in the code that you are currently not using and don't plan to be using in the foreseeable future -- get rid of it. That will both reduce the amount of code you will have to go through and its complexity (by getting rid of unnecessary concepts and dependencies). You will be able to clean faster and the end result will be simpler.

Don't save code because "who knows, you might need it some day". Code is costly -- it needs to be ported, bug checked, read and understood. The less code you have, the better. In the unlikely event that you do need the old code, you can always find it in the source repository.

6. Delete most of the comments

Bad code rarely has good comments. Instead, they are often:

// Pointless:

 // Set x to 3
 x = 3;

// Incomprehensible:

 // Fix for CB (aug)
 pos += vector3(0, -0.007, 0);

// Sowing fear and doubt:

 // Really we shouldn't be doing this
 t = get_latest_time();

// Downright lying:

 // p cannot be NULL here
 p->set_speed(0.7);

Read through the code. If a comment doesn't make sense to you and doesn't further your understanding of the code -- get rid of it. Otherwise you will just waste mental energy on trying to understand that comment on each future reading of the code.

The same goes for dead code that has been commented or #ifdef'ed out. Get rid of it. It's there in the source repository if you need it.

Even when comments are correct and useful, remember that you will be doing a lot of refactoring of the code. The comments may no longer be correct when you are done. And there is no unit test in world that can tell you if you have "broken the comments".

Good code needs few comments because the code itself is clearly written and easy to understand. Variables with good names do not need comments explaining their purpose. Functions with clear inputs and outputs and no special cases or gotchas require little explanation. Simple, well written algorithms can be understood without comments. Asserts document expectations and preconditions.

In many cases the best thing to do is just to get rid of all old comments, focus on making the code clear and readable, and then add back whatever comments are needed -- now reflecting the new API and your own understanding of the code.

7. Get rid of shared mutable state

Shared mutable state is the single biggest problem when it comes to understanding code, because it allows for spooky "action at a distance", where one piece of code changes how a completely different piece of code behaves. People often say that multithreading is difficult. But really, it is the fact that the threads share mutable state that is the problem. If you get rid of that, multithreading is not so complex.

Since your goal is to write high-performant software, you won't be able to get rid of all mutable state, but your code can still benefit enormously from reducing it as much as possible. Strive for programs that are "almost functional" and make sure you know exactly what state you are mutating where and why.

Shared mutable state can come from several different places:

  • Global variables. The classic example. By now everybody surely knows that global variables are bad. But note (and this is a distinction that people sometimes fail to make), that it is only shared mutable state that is problematic. Global constants are not bad. Pi is not bad. Sprintf is not bad.

  • Objects -- big bags of fun. Objects are a way for a large number of functions (the methods) to implicitly share a big bag of mutable state (the members). If a lazy programmer needs to pass some information around between methods, she can just make a new member that they can read and write as they please. It's almost like a global variable. How fun! The more members and the more methods an object has, the bigger this problem is.

  • Megafunctions. You have heard about them. These mythic creatures that dwell in the deepest recesses of the darkest codebases. Broken programmers talk about them in dusky bars, their sanity shattered by their encounters: "I just kept scrolling and scrolling. I couldn't believe my eyes. It was 12 000 lines long."

    When functions are big enough, their local variables are almost as bad as global variables. It becomes impossible to tell what effect a change to a local variable might have 2 000 lines further down in the code.

  • Reference and pointer parameters. Reference and pointer parameters that are passed without const can be used to subtly share mutable state between the caller, the callee and anyone else who might be passed the same pointer.

Here are some practical ideas for getting rid of shared mutable state:

  • Split big functions into smaller ones.

  • Split big objects into smaller ones by grouping members that belong together.

  • Make members private.

  • Change methods to be const and return the result instead of mutating state.

  • Change methods to be static and take their arguments as parameters instead of reading them from shared state.

  • Get rid of objects entirely and implement the functionality as pure functions without side effects.

  • Make local variables const.

  • Change pointer and reference arguments to const.

8. Get rid of unnecessary complexity

Unnecessary complexity is often a result of over-engineering -- where the support structures (for serialization, reference counting, virtualized interfaces, abstract factories, visitors, etc) dwarf the code that performs the actual functionality.

Sometimes over-engineering occurs because software projects start out with a lot more ambitious goals than what actually gets implemented. More often, I think, it reflects the ambitions/esthetics of a programmer who has read books on design patterns and the waterfall model and believes that over-engineering makes a product "solid" and "high-quality".

Often, the heavy, rigid, overly complex model that results is unable to adapt to feature requests that were not anticipated by the designer. Those features are then implemented as hacks, bolt-ons and backdoors on top of the ivory tower resulting in a schizophrenic mix of absolute order and utter chaos.

The cure against over-engineering is YAGNI -- you are not gonna need it! Only build the things that you know you need. Add more complicated stuff when you need it, not before.

Some practical ideas for cleaning out of unnecessary complexity:

  • Remove the functionality you are not using (as suggested above).

  • Simplify necessary concepts, and get rid of unneeded ones.

  • Remove unnecessary abstractions, replace with concrete implementations.

  • Remove unnecessary virtualization and simplify object hierarchies.

  • If only one setting is ever used, get rid of the possibility of running the module in other configurations.

9. That is all

Now go clean your room!

Saturday, August 4, 2012

A simpler design for asynchronous APIs

Accessing Internet services, e.g. to fetch a web page or to store data on a leaderboard, requires an asynchronous API. You send a request and then, at some later point, you receive a reply.

Asynchronous APIs are trickier to design than synchronous ones. You can't simply return the result of the operation, since it isn't ready yet. Instead you have to wait until it is done and then send it to the caller through some other channel. This often results in designs that are needlessly complicated and cumbersome to work with.

Callbacks

The most common approach is perhaps to use callbacks. You make the asynchronous request and when it completes the callback is called. The callback can either be a global system-wide callback, or (which is nicer) a callback that you supply when you make the asynchronous call.

leaderboard->set_score(100, set_score_cb, my_user_data);

void set_score_cb(SetScoreResult *result, void *user_data)
{
   ...
}

I have already mentioned in a previous article that I'm not too fond of callbacks and that I prefer polling in most cases. Badly designed polling can be expensive, but in the case of asynchronous network operations we wouldn't expect to have more than a dozen or so in-flight at any one time, which means the cost of polling is negligible.

Callbacks tend to make code worse. There are several reasons.

First, you usually have little control over when a callback happens. This means that it can happen at a time that isn't very suitable to you. For cleanliness, you may want to do all your leaderboard processing in your update_leaderboard() function. But the callback might be called outside update_leaderboard(), messing up all your carefully laid plans.

Second, it can be tricky to know what you can and cannot do in a callback. The code that calls you might make some assumptions that you inadvertently violate. These things can sometimes be really tricky to spot. Consider something as simple as:

int n = _leaderboard_operations.size();
for (int i=0; i!=n; ++i) {
 if (done(_leaderboard_operations[i]))
  do_callback(_leaderboard_operations[i]);
}

This looks perfectly innocent. But if the callback happens to do something that changes the _leaderboard_operations vector, for example by posting a new request or removing an old one, the code can blow up with memory access errors. I have been bitten by things like this many times. By now, every time I see a callback a warning clock goes off in my head: "danger, danger -- there is a callback here, remember that when you make a callback anything can happen".

Sometimes it can be necessary to double buffer data to get rid of bugs like this.

Third, callbacks always happen in the wrong context. You get the callback in some "global", "top-level" context, and from there you have to drill down to the code that actually knows what to do with the information. (Typically by casting the user_data pointer to some class and calling a member function on it.) This makes the code hard to follow.

In other words, callbacks lead to hard-to-read code, hard-to-follow code flow, subtle bugs, redundant boilerplate forwarding stubs and instruction cache misses. Bleh!

Request objects

Another common approach is to have some sort of request object that represents the asynchronous operation. Something like:

SetScoreRequest *request = _leaderboard->set_score(100);
...
if (request->is_done()) {
 bool success = request->result();
 delete request;
}

Or perhaps, using the C++11 concepts of promises and futures (I have only a passing acquaintance with C++11, so forgive me if I mess something up):

std::promise<bool> *promise = new std::promise<bool>();
_leaderboard->set_score(100, promise);
...
std::future<bool> future = promise->get_future();
if (future.valid()) {
 bool success = future.get();
 _leaderboard->forget_promise(promise);
 delete promise;
}

This is a lot better than the callback approach, but still in my view, overly complicated. It is clearly a design based on the object-oriented philosophy of -- when in doubt, make more objects.

But these extra objects don't really do much. They just act as pointless intermediaries that pass some information back and forth between our code and the _leaderboard object. And they are a hassle for the caller to keep track of. She must store them somewhere and make sure to delete them when she is done to avoid memory leaks.

Furthermore, if we want to expose this API to a scripting language, such as Lua, we have to expose these extra objects as well.

ID tokens

As readers of my previous articles know, I'm a big fan of using IDs. Instead of exposing internal system objects to the caller of an API, I prefer to give the caller IDs that uniquely identifies the objects and provide functions for obtaining information about them.

That way, I am free to organize my internal data however I like. And it is easier to see when the state of my objects might mutate, since all calls go through a single API.

With this approach the interface would look something like this:

unsigned set_score(int value);
enum SetScoreResult {SSR_IN_PROGRESS, SSR_SUCCESS, SSR_FAILURE};
SetScoreResult set_score_result(unsigned id);

Note that there are no objects that the user must maintain and release. The ID can easily be manipulated by a scripting layer. If the user doesn't need to know if the operation succeeded, she can just throw away the returned ID.

In this API I don't have any method for freeing tokens. I don't want to force the user to do that, since it is both a hassle (the user must track all IDs and decide who owns them) and error prone (easy to forget to release an ID).

But obviously, we must free tokens somehow. We can't store the results of the set_score() operations forever. If we did, we would eventually run out of memory.

There are several ways you could approach this problem. My preferred solution in this particular case is to just have a fixed limit on the number of operations that we remember. Since we don't expect more than a dozen simultaneous operations, if we make room for 64, we have plenty of slack and still use only 64 bytes of memory. A modest amount by any standard.

We can keep the results in a round-robin buffer:

/// Maximum number of requests whose result we remember.
static const int MAX_IN_FLIGHT = 64;

/// The result of the last MAX_IN_FLIGHT requests.
char results[MAX_IN_FLIGHT];

/// Number of requests that have been made.
unsigned num_requests;

SetScoreResult set_score_result(unsigned id)
{
 // If more than MAX_IN_FLIGHT requests have been made after this one,
 // the information about it is lost.
 if (num_requests - id > MAX_IN_FLIGHT)
  return SSR_NO_INFORMATION;

 return results[id % MAX_IN_FLIGHT];
}

This means that you can only ask about the result of the last 64 operations. On the other hand, this solution uses very little memory, does not allocate anything, has very quick lookups and doesn't require the user to explicitly free tokens.

To me, this added simpleness and flexibility outweighs the disadvantage of having a limit on the maximum number of in flight operations that we support.

Implicit APIs

In many cases, the best solution to asynchronous conundrums is to redesign the API to abstract away the entire concept of asynchronous operations, so that the user doesn't even have to bother with it.

This can require some creative rethinking in order to focus on what it is the user really wants to do. For example, for our example, we might come up with this:

/// Sets the score to the specified value. This is an asynchronous operation.
/// You can use acknowledged_score() to find out when it has completed.
void set_score(int score);

/// Returns the last score that has been acknowledged by the server.
int acknowledged_score();

This is probably all that the user needs to know.

Now we have really simplified the API. The user still needs to be aware that set_score() isn't propagated to the server immediately, but she doesn't at all have to get involved in what asynchronous operations are performed and how they progress.

This kind of radical rewrite might not be possible (or even desirable) for all asynchronous systems. You have to balance the value of high-level abstractions and simplifications against the need for low-level control. But it is almost always worth exploring the possibility since it can lead to interesting ideas and dramatically simplified APIs.

For example, the interface for an asynchronous web fetcher could be as simple as:

const char *fetch(const char *url);

If called with an URL that hadn't been fetched yet, the function would issue a request for the URL and return NULL. Once the data was available, the function would return it. On the next call, the data would be freed. To fetch a web page, you would just repeatedly call the function with an URL until you got a reply.

Quite fetching, wouldn't you say?

Tuesday, July 3, 2012

Matrices, Rotation, Scale and Drifting

If you are using Matrix4x4s to store your node transforms and want to support scaling you are facing an annoying numerical problem: rotating a node causes its scale to drift from the original value.

The cause of drifting

Drifting happens because in a Matrix4x4 the rotation and the scale are stored together in the upper left 3x3 part of the matrix:

This means that if we want to change the rotation of a Matrix4x4 without affecting the scale we must extract the scale and reapply it:

void set_rotation(Matrix4x4 &pose, const Quaternion &rot)
{
     Vector3 s = scale(pose);
     Matrix3x3 rotm = matrix3x3(rot);
     scale(rotm, s);
     set_3x3(pose, rotm);
}

The problem here is that since floating point computation is imprecise, scale(pose) is not guaranteed to be exactly the same before this operation as after. Numerical errors will cause a very small difference. So even though we only intended to rotate the node we have inadvertently made it ever so slightly bigger (or smaller).

Does it matter? Sure, it is annoying that an object that we didn't want to have any scaling at all suddenly has a scale of 1.0000001, but surely such a small change would be impercievable and couldn't affect gameplay.

True, if we only rotated the object once. However, if we are dealing with an animated or spinning object we will be changing its rotation every frame. So if the error is 0.0000001 the first frame, it might be 0.0000002 the second frame and 0.0000003 the third frame.

Note that the error growth is linear rather than geometric because the error in each iteration is proportional to the current scale, not to the current error. I. e., to (1 + e) rather than e. We can assume that 1 >> e, because otherwise we already have a clearly visible error.

I ran a test using our existing math code. Rotating a transform using the method described above yields the following result:

Error Frames Time (at 60 Hz)
0.000001 202 3 s
0.000002 437 7 s
0.000005 897 15 s
0.000010 1654 28 s
0.000020 3511 58 s
0.000050 8823 2 min
0.000100 14393 4 min
0.000200 24605 7 min
0.000500 52203 15 min
0.001000 100575 28 min

As you can see, after 28 minutes we have an error of 0.1 %. At this point, it starts to get noticeable.

You could debate if this is something that needs fixing. Maybe you can live with the fact that objects grow by 0.1 % every half hour, because your game sessions are short and the small scale differences will never be noted. However, since Bitsquid is a general purpose engine, we need a better solution to the problem.

At this point, you might be asking yourself why this problem only happens when we introduce scaling. Don't we have the same issue with just translation and rotation? No, because translation and rotation are stored in completely separate parts of the matrix:

Setting the rotation doesn't touch any of the position elements and can't introduce errors in them, and vice versa.

Solutions to scale drifting

I can think of four possible solutions to this problem:

  • Store rotation and scale separately

  • Always set rotation and scale together

  • Quantize the scale values

  • Prevent systematic errors

Solution 1: Store rotation and scale separately

The root cause of the problem is that rotation and scale are jumbled together in the Matrix4x4. We can fix that by separating them. So instead of using a Matrix4x4 we would store our pose as:

struct Pose {
      Vector3 translation;
      Matrix3x3 rotation;
      Vector3 scale;
};

With the pose stored like this, changing the rotation does not touch the scale values, so we have eliminated the problem of drifting.

Note that this representation is actually using slightly less memory than a Matrix4x4 -- 15 floats instead of 16. (We could reduce the storage space even further by storing the rotation as a quaternion, but then it would be more expensive to convert it to matrix form.)

However, the representation is not as convenient as a Matrix4x4. We can't compose it or compute its inverse with regular matrix operations, as we can do for a Matrix4x4. We could write custom operations for that, or we could just convert this representation to a temporary Matrix4x4 whenever we needed those operations.

Converting to a Matrix4x4 requires initializing the 16 floats (some with values from the pose) and 9 floating point multiplications (to apply the scale). What kind of a performance impact would this have?

I would guess that the part of the codebase that would be most affected would be the scene graph local-to-world transformation. With this solution, you would want to store the local transform as a Pose and the world transform as a Matrix4x4. The local-to-world transform requires about 36 multiplications and 36 additions (says my quick estimate). So adding a temp Matrix4x4 conversion would take you from 72 to 81 FLOPS.

So a very rough estimate is that this change would make your scene graph transforms about 12 % more expensive. Likely, the real value is less than that since you probably have additional overhead costs that are the same for both methods. And of course, the scene graph transforms are just one small (and parallelizable) part of what your engine does. We rarely spend more than 2 % of our frame time there, meaning the total performance hit is something like 0.2 %.

I think that is a quite reasonable price to pay for a neat solution to the problem of drifting, but you may disagree of course. Also, perhaps the use of Matrix4x4s is so ingrained in your code base that it is simply not possible to change it. So let's look at the other possible solutions.

Solution 2: Always set rotation and scale together

The fundamental problem with set_rotation() is that we try to change just the orientation of the node without affecting the scale. Extracting the scale and reapplying it is what causes the drifting.

If we don't allow the user to just change the rotation, but force him to always set the scale and the rotation together, the problem disappears:

void set_rotation_and_scale(Matrix4x4 &pose, const Quaternion &rot, const Vector3 &s)
{
    Matrix3x3 rotm = matrix3x3(rot);
    scale(rotm, s);
    set_3x3(pose, rotm);
}

Since we have eliminated the step where we extract the scale and feed it back, we have rid ourselves of the feedback loop that caused runaway drifting. Of course, we haven't completely eliminated the problem, because nothing prevents the user from emulating what we did in set_rotation() and recreating the feedback loop:

Vector3 s = scale(pose);
set_rotation_and_scale(pose, new_rotation, s);

Now the drifting problem is back with a vengeance, reintroduced by the user.

To prevent drifting the user must take care not to create such feedback loops. I.e., she can never extract the scale from the matrix. Instead she must store the scale at some other place (separate from the matrix) so that she can always feed the matrix with the correct scale value.

What we have done is essentially to move the burden of keeping track of the scale of objects from the transform (the Matrix4x4) to the user of the transform. This prevents drifting and doesn't have any performance costs, but it is pretty inconvenient for the user to have to track the scale of objects manually. Also, it is error prone, since the user who is not 100 % certain of what she is doing can accidentally recreate the feedback loop that causes drifting.

Solution 3: Quantize the scale values

If none of the two options presented so far seem palpable to you, there is actually a third possibility.

Consider what would happen if we changed the Vector3 scale(const Matrix4x4 &) function so that it always returned integer values.

Calling set_rotation() as before would introduce an error to the scale and set it to, say 1.0000001. But the next time we ran set_rotation() and asked for the scale it would be rounded to the nearest integer value, so it would be returned as 1 -- the correct value. Applying the new rotation would again introduce an error and change the value to 1.0000001, but then again, the next time the function ran, the value returned would be snapped back to 1.

So by rounding the returned scale to fixed discrete values we prevent the feedback loop. We still get small errors in the scale, but without the runaway effect they are unnoticeable. (Small errors occur everywhere, for example in the scene graph transforms. That's the nature of floating point computation. It is not the small errors that are the problem but the mechanisms that can cause them to result in visible effects.)

Of course, if we round to integer values we can only scale an object by 1, 2, 3, etc. Not by 0.5, for instance. But we can fix that by using some other set of discrete numbers for the scale. For example, we could round to the nearest 0.0001. This would let us have scales of 0.9998, 0.9999, 1.0000, 1.0001, 1.0002, … Hopefully that is enough precision to cover all the different scales that our artists might want to use.

Drifting won't happen in this scheme, because the floating point errors will never be big enough to change the number to the next discrete value. (Unless you used really large scale values. If you want to support that -- typically not interesting, because things like texture and vertex resolution start to look wonky -- you could use a geometric quantization scheme instead of an arithmetic one.)

Snapping the scale values in this way might be OK for static scaling. But what if you want to smoothly change the scaling with an animation? Won't the discrete steps cause visible jerks in the movement?

Actually not. Remember that it is only the value returned by scale() that is quantized, the user is still free to set_scale() to any non-quantized value. When the scale is driven by an animation, it is fed from an outside source. We don't need to read it from the matrix and reapply it. So the quantization that happens in scale() never comes into play.

So amazingly enough, this hacky solution of snapping the scale to a fixed set of discrete values actually seems to work for most real world problems. There might be situations where it would cause trouble, but I can't really come up with any.

Solution 4: Prevent systematic errors

A final approach is to try to address how the numbers are drifting instead of stopping them from drifting. If you look at the table above you see that the errors are growing linearly. That is not what you would expect if the errors were completely random.

If the errors in each iteration were completely random, you would get a random walk process where the total error would be e * sqrt(N) rather than e * N, where e is the error from one iteration and N the number of iterations. The fact that the error grows linearly tells us that our computation has a systematic bias -- the error is always pushed in one particular direction.

If we could get rid of this systematic bias and get a truly random error, the accumulated error would grow much more slowly, the square root makes all the difference. For example, for the error to grow to 0.1 % it would take 5.2 years rather than 28 minutes. At that point, we might be ok with the drifting.

I haven't thought that much about what would be needed to get rid of the systematic bias in the set_rotation() function. It's a pretty tricky problem that requires a deep understanding of what happens to all the floating point numbers as they travel through the equations.

Conclusion

In the Bitsquid engine we have so far gone with #2, as a make-shift until we decided on the best permanent solution to this problem. After reviewing the options in this article I think we will most likely go with #1. #3 is an interesting hack and I think it would work almost everywhere, but I'm willing to pay the slight performance price for the cleaner and clearer solution of #1.

Tuesday, June 19, 2012

Hack Day Report

Last Friday, we had our second hack day (aka do-what-you-want day, aka google day) at the office.

Different companies seem to take different approaches to hack days. At some places it just means that you can spend a certain percentage of your working week on your own projects. We wanted something that was a bit more focused and felt more like a special event, so we used the following approach:

  • People were encouraged to pick tasks that could be completed, or taken to a "proof-of-concept" level in a single day. The goal was that at the end of the day you should have something interesting to show/tell your colleagues.

  • It is ok to fail of course. Failure is often interesting. Trying crazy ideas with a significant risk of spectacular failure is part of the charm of a hack day.

  • A couple of days before the event, everbody presented their projects. The idea was to get everybody to start thinking about the topics, so that we could help each other with ideas and suggestions.

  • We ate breakfast together in the morning to start the discussions and get everybody in the spirit of the event. At the end of the day, we rounded off with a couple of beers.

  • We avoided Skype, email and meetings during the day, so that we could focus 100 % on the projects.

  • A couple of days after the events we had a small show & tell, where everybody could present what they had learned.

Results

A number of interesting projects came out of this hack day:

  • Tobias and Mats created an improved highlighting system for indicating selected objects in the level editor. (Highlighting the OOBB works well for small objects, but for big things like landscapes and sub-levels, it is just confusing.)

  • Jim looked into a cross-platform solution for capturing screen shots and videos on target machines and transmitting them over the network.

  • Andreas created a Lua profiling tool, that can dynamically enable and disable profiling for any Lua function by hot-patching the code with profiler calls.

  • Finally, I rewrote the collision algorithm for our particle systems.

Being an egotistical bastard, I will focus on my own project.

Particle collision is one of those annoying things that it is difficult to find a good general solution to, for two reasons:

  • It ties together two completely different systems (particles and physics), creating an ugly coupling between them. Since the solution must have decent performance, the coupling must be done at a fairly low level, which makes it even worse.

  • Particles can have very different collision requirements. Some effects need a massive amount of particles (e. g., sparks), but don't care that much about collision quality. As long as most of them bounce somewhat accurately, it is OK. Other effects may have just a single particle (e. g., a bullet casing). Performance doesn't matter at all, but if it doesn't bounce right you will surely notice. Handling both effects in the same system is a challenge. Having different systems for different effects is another kind of challenge.

My previous attempts at implementing particle collision have all been based on first cutting out a slice of the physics world around the particle effect and then trying to find a fast representation of the collision shapes in that world slice.

The problem with this approach is that there are a lot of variables to tweak and tune:

  • How big should the world slice be?

  • How much detail should there be in the simplified representation? More detail is slower, but gives better collision results.

  • What kind of representation should we use?

  • How should we handle dynamic/moving objects? How often should the world slice be updated?

I've tried a lot of different representations: a triangle soup, a collection of half-spheres, a height field, but none of them has given completely satisfactory results. Often, parameters that work for one effect at one location fail for a different effect at a different location. Both performance and behavior are hard to predict.

The main idea for the new approach came from a Naughty Dog presentation at GDC. Instead of trying to create a shared collision model for all particles, we give each particle its own collision model, and we store it inside the particle itself, together with the other particle data.

Of course, it would be expensive to store a complicated collision model inside every particle, so we use the simplest model possible: a plane. We can represent that by a normal and an offset from origin. So with this approach, the data for a particle might look something like this:

struct Particle {
 Vector3 position;
 Vector3 velocity;
 Color8 color;
 Vector3 collision_plane_normal;
 float collision_plane_offset;
};

(Side note: Our particle data doesn't actually look like this, we use a "structure-of-arrays" approach rather than an "array-of-structures" and we don't have a fixed set of fields, each effect has its own set.)

Note that we don't bother with any flag for indicating whether there is plane or not. If there is no collision, we just put the collision plane far enough below the origin.

With this approach the collision test is super fast -- just a dot product and a compare. It is also really easy to parallelize the test or run it off-CPU, since it just uses local particle data and doesn't need to access any shared memory.

With this method we have divided the original collision problem into two simpler ones:

  • Collision test against a plane. (Trivial.)

  • Finding a suitable collision plane for each particle.

This means that if we want to, we can use different approaches for finding the collision planes for different effects. E.g., for static effects we could hard code the collision plane and avoid collision queries completely.

Generally, we can find a suitable collision plane for a particle by raycasting along its trajectory. If we didn't have any performance constraints, we could do a raycast for every particle every frame. That way we would always know what surface the particle would hit next, which means that we would get perfect collision behavior.

Of course, we can't actually do that. Raycasts are comparatively expensive and we want to be able to support large numbers of particles.

To control the performance, I exposed a parameter that lets the effect designer control how many raycasts per frame an effect is a allowed to make. A typical value of 1.0 means that every frame, one particle in the effect is picked at random, a raycast is performed along that particles trajectory and its collision plane is updated with the result.

Note that with this solution, the work is always evenly distributed over the duration of the effect. That is a lot nicer than what you typically get with the "world slice" approach where there is a big chunk of work in the beginning when you cut out the world slice.

Astute readers will have noticed a fatal flaw with the design as it has been presented so far: it can't possibly work for very many particles. If we have an effect with 1 000 particles and do a raycast every frame, it will take 33 seconds before every particle has found its collision normal. By then, they will long since have fallen through the floor.

So, if we want to use this approach for large numbers of particles we must be able to somehow reuse the collision results. Typically, an effect will have bundles of particles traveling in approximately the same direction. When one such particle has done a raycast and found a collision, we want to be able to share the result with its neighbors somehow.

I wanted to find a solution to this without having to create a complicated collision representation, because that would bring back many of the problems I had with the "world slice" approach. Eventually, I decided that since what we want to do is to cache a collision query of the form:

(position, direction) -> collision_plane

The simplest possible thing would be to store the results in a hash. Hashes are nice, predictable data structures with well known performance characteristics.

To be able to hash on position and direction we must quantize them to integer values. We can quantize the position by dividing the world into cells of a certain width and height:

 const float cell_side = 0.5f;
 const float cell_height = 2.0f;
 int ix = position.x / cell_side;
 int iy = position.y / cell_side;
 int iz = position.z / cell_height;
 uint64 key = HASH_3(ix, iy, iz);

In this example, I use a higher resolution along the xy-axes than along the z-axes, because typically that is where the more interesting features are. HASH_3() is a macro that performs the first three rounds of the murmur_hash algorithm.

To quantize the direction we can use a similar approach. I decided to quantize the direction to just six different values, depending on along which principal axis the particle is mostly traveling:

 unsigned id;
 if (fabsf(dir.x) >= fabsf(dir.y) && fabsf(dir.x) >= fabsf(dir.z))
  id = dir.x > 0 ? 0 : 1;
 else if (fabsf(dir.y) >= fabsf(dir.z))
  id = dir.y > 0 ? 2 : 3;
 else
  id = dir.z > 0 ? 4 : 5;
 key = key ^ id;

Now that we have computed a quantized representation of (position, direction), we can use that as lookup value into our hash, both for storing and fetching values:

 struct CollisionPlane {
  Vector3 normal;
  float offset;
 };
 HashMap<uint64, CollisionPlane> _cache;

(Side note: Unless I'm worried about hash function collisions, I prefer to hash my keys before I insert them in the HashMap and just use a HashMap<uint64,T> instead of HashMap<MyComplicatedKeyStruct,T>. That way the hash map uses less memory and lookups can be done with a simple modulo operation.)

Whenever I do a particle raycast I store the result in the cache. When particles are spawned they lookup their collision plane in the cache. Particles also query the cache every time they bounce, since that typically means they will be traveling in a new direction.

I have a maximum size that the cache is allowed to use. When the cache reaches the maximum size, older entries are thrown out.

Results

The system gives high quality results for effects with few particles (because you get lots of raycasts per particle) and is still able to handle massive amounts of particles. The performance load is evenly distributed and it doesn't need any special cases for dynamic objects.

There are some drawbacks. The cache requires some tweaking. Since it can only store one collision plane for each quantization cell it will miss important features if the cells are too big. On the other hand, if the cells are too small, we need lots of entries in the cache to represent the world, which means more memory and slower lookups.

Since we only have one collision normal per particle, there are some things that the particles just can't do. For example, they can never come to rest at the bottom of a V-shape, because they will always only be colliding with one of the planes in the V. Overall, they will behave pretty badly in corners, where several collision planes with different normals meet. Some of these issues could be fixed by storing more than one collision plane in the particle, but I don't think it is worth it. I prefer the simpler approach and having particles that in some tricky situations can fall through the ground.

Compared to the old collision code, the new code is simpler, runs faster and looks better.

All in all, I would say that the hack day was a success. We had great fun and produced some useful stuff. We will definitely do more days like this in the future. Not too often though. I think it is important that these days feel like a special treat. If they become too mundane, something important is lost. Once a month or so, would be ideal, I think.

Sunday, May 20, 2012

Playing (with) Video

So you want to play some video? Shouldn't be too hard, right? Just download some video playing library and call the play_video() function. Easy-peasy-lemon-squeezy.

Well, you have to make sure that the video is encoded correctly, that the library works on all platforms and plays nice with your memory, file, sound and streaming abstractions, and that the audio and video doesn't desynchronize, which for some inexplicable reason seems to be a huge problem.

But this is just technical stuff. We can deal with that. What is worse is that video playback is also a legal morass.

There are literally thousands of broad patents covering different aspects of video decompression. If you want to do some video coding experiments of your own you will have to read, understand and memorize all these patents so that you can carefully tip-toe your code and algorithms around them.

Of course, if you had a big enough pool of patents of your own you might not have to care as much, since if someone sued you, you could sue them right back with something from your own stockpile. Mutually assured destruction through lawyers. Ah, the wonderful world of software patents.

So, creating your own solution is pretty much out of the question. You have to pick one of the existing alternatives and do the best you can with it. In this article I'm going to look at some different options and discuss the advantages and drawbacks of each one:

  • Just say no

  • Bink

  • Platform specific

  • H.264

  • WebM

There are other alternatives that didn't make it to this list, such as Dirac, Theora, and DivX. I've decided to focus on these five, since in my view H.264 is the best of the commercial formats and WebM the most promising of the "free" ones.

An initial idea might be: Why not just do whatever it is VLC does? Everybody's favorite video player plays pretty much whatever you throw at it and is open source software.

Unfortunately that doesn't work, for two reasons. First, VLC:s code is a mix of GPL and LGPL stuff. Even if you just use the LGPL parts you will run into trouble on platforms that don't support dynamic linking. Second, the VLC team doesn't really care about patents and just infringe away. You can probably not afford to do the same. (As a result, there is a very real threat that VLC might be sued out of existence.)

A quick introduction

Before we start looking at the alternatives I want to say something short about what a video file is, since there is some confusion in the matter, even among educated people.

A video file has three main parts:

  • Video data (H.264, DivX, Theora, VP8, ...)

  • Audio data (MP3, AAC, Vorbis, ...)

  • A container format (Avi, Mkv, MP4, Ogg, ...)

The container format is just a way of packing together the audio and video data in a single file, together with some additional information.

The simplest possible container format would be to just concatenate the audio data to the video data and be done with it. But typically we want more functionality. We want to be able to stream the content, i. e. start playing it before we have downloaded the whole file, which means that audio and video data must be multiplexed. We also want to be able to quickly seek to specific time codes, so we may need an index for that. We might also want things like audio tracks in different languages, subtitling, commentary, DVD menus, etc. Container formats can become quite intricate once you start to add all this stuff.

A common source of confusion is that the extension of a video file (.avi, .mkv, .mp4, .ogg) only tells you the container format, not the codecs used for the audio and video data in the container. So a video player may fail to play a file even though it understands the container format (because it doesn't understand what's inside it).

Option 1: Just say no

Who says there has to be video in a game? The alternative is to do all cut scenes, splash screens, logos, etc in-game and use the regular renderer for everything. As technology advances and real-time visuals come closer and closer in quality to offline renders, this becomes an increasingly attractive option. It also has a number of advantages:

  • You can re-use the in-game content.

  • Production is simpler. If you change something you don't have to re-render the entire movie.

  • You don't have to decide on resolution and framerate, everything is rendered at the user's settings.

  • You can dynamically adapt the content, for example dress the players in their customized gear.

  • Having everything be "in-game visuals" is good marketing.

If I was making a game I would do everything in-game. But I'm not, I'm making an engine. And I can't really tell my customers what they can and cannot do. The fact is that there are a number of legitimate reasons for using video:

  • Some scenes are too complex to be rendered in-game.

  • Producing videos can be simpler than making in-game content, since it is easier to outsource. Anybody can make a video, but only the core team can make in-game content and they may not have much time left on their hands.

  • Playing a video while streaming in content can be used to hide loading times. An in-game scene could be used in the same way, but a high-fidelity in-game scene might require too much memory, not leaving enough for the content that is streaming in.

As engine developers it seems we should at least provide some way of playing video, even if we recommend to our customers to do their cutscenes in-game.

Option 2: Bink

Bink from RAD game tools is as close as you can get to a de facto standard in the games industry, being used in more than 5800 games on 14 different platforms.

The main drawback of Bink is the pricing. At $ 8500 per platform per game it is not exactly expensive, but for a smaller game targeting multiple platforms that is still a noticeable sum.

Many games have quite modest video needs. Perhaps they will just use the video player for a 30 second splash screen at the start of the game and nothing more. Paying $ 34 000 to get that on four platforms seems excessive.

At Bitsquid our goal has always been to develop an engine that works for both big budget and small budget titles. This means that all the essential functionality of an engine (animation, sound, gui, video, etc) should be available to the licensees without any additional licensing costs (above what they are already paying for an engine). Licensees who have special interest in one particular area may very well choose to integrate a special middleware package to fulfill their needs, but we don't want to force everybody to do that.

So, in terms of video, this means that we want to include a basic video player without the $ 8500 price tag of Bink. That video player may not be as performant as Bink in terms of memory and processor use, but it should work well enough for anyone who just wants to play a full screen cutscene or splash screen when the CPU isn't doing much else. People who want to play a lot of video in CPU taxing situations can still choose to integrate Bink. For them, the price and effort will be worth it.

Option 3: Platform specific

One approach to video playing is to not develop a platform-independent library but instead use the video playing capabilities inherent in each platform. For example, Windows has Windows Media Foundation, MacOS has QuickTime, etc.

Using the platform's own library has several advantages. It is free to use, even for proprietary formats, because the platform manufacturers have already payed the license fees for the codecs. (Note though, that for some formats you need a license not just for the player, but for the distribution of content as well.) The implementation is already there, even if the APIs are not the easiest to use.

The biggest advantage is that on low-end platforms, using the built-in platform libraries can give you access to special video decoding hardware. For example, many phones have built-in H.264 decoding hardware. This means you can play video nearly for free, something that otherwise would be very costly on a low-end CPU.

But going platform specific also has a lot of drawbacks. If you target many platforms you have your work cut out for you in integrating all their different video playing backends. It adds an additional chunk of work that you need to do whenever you want to add a new platform. Furthermore, it may be tricky to support the same capabilities on all different platforms. Do they all support the same codecs, or do you have to encode the videos specifically for each platform? Do all platforms support "play to texture" or can you only play the videos full screen? What about the sound? Can you extract that from the video and position it as a regular source that reverbs through your 3D sound world? Some platforms (i.e. Vista) have almost no codecs installed by default, forcing you to distribute codecs together with your content.

Since we are developing a generic engine we want to cover as many platforms as possible and minimize the effort required to move a project from one platform to another. For that reason, we need a platform independent library as the primary implementation. But we might want to complement it with platform specific libraries for low end platforms that have built-in decoding hardware.

Option 4: H.264 (MPEG-4, AVC)

Over the last few years H.264 has emerged as the most popular commercial codec. It is used in Blu-ray players, video cameras, on iTunes, YouTube, etc. If you want a codec with good tool support and high quality, H.264 is the best choice.

However, H.264 is covered by patents. Patents that need to be licensed if you want to use H.264 without risking a lawsuit.

The H.264 patents are managed by an entity known as MPEG LA. They have gathered all the patents that they believe pertain to H.264 in "patent pool" that you can license all at once, with a single agreement. That patent pool contains 1700 patents. Yes, you read that right. The act of encoding/decoding a H.264 file is covered by 1700 patents. You can find the list in all its 97 page glory at http://www.mpegla.com/main/programs/avc/Documents/avc-att1.pdf.

I am not a lawyer, as they say on Slashdot, but this is my best understanding of how this patent game works:

  • Buying a license from MPEG LA gives you the right to use the 1700 patents in the pool.

  • This doesn't mean you can't be sued for patent infringement. Anyone that holds a patent which is not one of the 1700 in the pool could claim that H.264 infringes on it and sue you. That seems unlikely, MPEG LA has made an effort to gather all relevant patents, but there is no way to be certain.

  • MPEG LA doesn't by itself go after people who use H.264 without a license, that is up to the holders of the 1700 patents in the pool.

The licensing terms of H.264 are irritating, but not necessarily a big financial burden:

  • If you distribute an encoder or decoder you can distribute 100 000 copies for free, then you have to pay $ 0.20 per unit.

  • If you distribute a H.264 encoded movie, it is free if it is shorter than 12 minutes, then you have to pay $ 0.02 per copy.

Note that unlike the case with other popular codecs such as MP3, it is not just the decoder/encoder that you need to license, you also need a license just for distributing H.264 content.

From what I've been able to discern, but don't take my word for it, a game that only plays a fixed set of movies/cutscenes would not be regarded as a general decoder (even though it contains decoding software), but rather as content, which means you would pay $0.02 per copy sold if you had more than 12 minutes of video in your game and nothing otherwise (you would still need to obtain a license though).

Of course, if you support H.264, you may also want to support AAC, the standard audio format that accompanies it. AAC is covered by a separate licensing body (Via Licensing) that has its own licensing terms. I haven't investigated them in any great detail.

You have to decide for yourself how well these terms sit with you. At Bitsquid we finally decided that if we should have a standard video playing facility, it should be one that people could use without thinking too much about patents and licensing (to the extent that is possible).

Option 5: VP8 (WebM)

VP8 is a "free" video codec owned by Google. It is covered by patents, but Google has granted free use of those patents and also provides a BSD licensed library libvpx for encoding and decoding video files. The format is also endorsed by the Free Software Foundation.

It is generally acknowledged that when it comes to quality, VP8 is not quite as good as H.264, though the differences are not enormous. So you are trading some visual quality for the convenience of a license free format.

There has been some discussion (most of it about a year ago) about whether VP8 really is unencumbered by patents. MPEG LA claimed that it knew several patents that VP8 was infringing and showed interest in creating a "patent pool" for VP8 similar to the one it holds for H.264. Nothing has come of that yet and MPEG LA has not disclosed which patents it thinks VP8 is infringing, which means that Google cannot really respond to the allegations. It is hard to know how much stock to put in this and whether anything will come of it.

You could argue that with this potential "threat" against VP8, it would be better to use another "free" alternative such as Dirac or Theora. However, there is not much evidence that they would fare better. Everyone who makes a "free" codec tries their best to make sure that they don't infringe on any patents. But with thousands of these patents around, each open to legal interpretation, there is just no way to be sure.

This is just the sad state of affairs of software patents. And you are not safe with the commercial formats either. Even if you have licensed the 1700 patents in the MPEG LA patent pool, someone can still sue you for violating patent number 1701. No one in this business offers indemnification against patent suits. Not Google. Not MPEG LA. Not Bink (I think).

It all becomes a question of risk. Bink has been around a long time without being sued, which is reassuring. VP8 hasn't been around that long.

Will there be patent claims made against VP8? Maybe. Who knows. Similar threats were made against Theora, but nothing happened there. If it does happen, Google will most likely fight back and the whole thing will drag on in the courts. Will there be patent claims against you for using VP8? Seems extremely unlikely. Games are not interesting enough targets. Video decoding is not our main business, and we can easily switch technology if needed. Phone manufacturers, YouTube and TV companies are more interesting targets.

Do you have to care about any of this at all? Up to you to decide.

Our conclusion

None of these alternatives are really attractive, it is more about picking the least worst than finding the best, which is frustrating for a perfectionist with self-worth issues. Good cases can be made for all of them. This is what we have decided:

  • We will provide VP8 decoding on all platforms through libvpx. All other things equal, the "most free" format will give us most flexibility in the long run.

  • We will not (at least not right now) support matroska or other advanced container formats. Instead, we will play the videos from simple IVF streams. Sound will be played as Vorbis files through our regular sound system so we get positioning, reverb, etc.

  • If needed, we will complement this basic approach with platform-specific libraries that take advantage of the hardware decoders on low-end platforms (phones and tablets).

  • Customers that need to play a lot of movies while doing other CPU intense tasks and that aren't happy with the performance of libvpx are recommended to look into Bink.

  • Customers that are worried about "patent risk" with VP8 are recommended to do whatever their lawyers tell them to do. (Use Bink, a platform specific library, obtain a H.264 license or avoid video all together.)