Saturday, August 18, 2012

Cleaning bad code

Guess what! You've just inherited a stinking, steaming pile of messy old code. Congratulations! It's all yours.

Bad code can code can come from all kinds of places. Middleware, the internet, perhaps even your own company.

You know that nice guy in the corner that nobody had time to check up on? Guess what he was doing all that time. Churning out bad code.

Or remember that module someone wrote years ago, just before she left the company. That module that twenty different people have then added hacks, patches and bug fixes to, without really understanding what they were doing. Yup, that one.

Or what about that open source thing you downloaded that you knew was horrible, but it solved a very specific and quite hairy problem that would have taken you ages to do by yourself.

Bad code doesn't have to be a problem, as long as it's not misbehaving, and nobody pokes their bloody nose in it. Unfortunately, that state of ignorant bliss rarely lasts. A bug will be discovered. A feature requested. A new platform released. Now you have to dig into that horrible mess and try to clean it up. This article offers some humble advice for that unfortunate situation.

0. Is it worth doing?

The first thing you need to ask yourself is whether the code is worth cleaning. I'm of the opinion that when it comes to code cleaning you should either karate do "yes", or karate do "no". Either you assume full responsibility for the code and rework it until you end up with something that you are actually happy to maintain and proud to have in your codebase.

Or you decide that even though the code looks horrible, it isn't cost-effective to take time out of your busy schedule to fix it. So instead you just do the smallest change possible that solves your current problem.

In other words, you either regard the code as yours or theirs.

There are merits to both alternatives. Good programmers get an itch when they see bad code. They bring out their torches and pitchforks and chant: "Unclean! Unclean!" And that is a good instinct.

But cleaning code is also a lot of work. It is easy to underestimate the time it takes. It can be nearly as time consuming as rewriting the whole thing from scratch. And it doesn't bring any short term benefits. Two weeks cleaning code won't add any new features to the game, but it might give you some new bugs.

On the other hand, the long term effects of never cleaning your code can be devastating. Entropy is the code-killer.

So, never an easy choice. Some things to consider are:

  • How many changes do you expect to make to the code?

    Is it just this one small bug that you need to fix, or is this code that you expect to return to many times to tweak and tune and add new features. If it's just this one bug, then perhaps it is best to let sleeping dogs lie. However, if this is a module that you will need to mess around with a lot, then spending some time to clean it up now, will save a lot of headache later.

  • Will you need/want to import upstream changes?

    Is this an open source project that is under active development? If so, and you want to pull the changes made upstream you can't make any big changes to the code or you will be in merge hell every time you pull. So just be a nice team player, accept its idiosyncrasies and send patches with your bug fixes to the maintainer.

  • How much work is it?

    How many lines of code can you realistically clean in a day? An order of magnitude estimate says more than 100 and less than 10 000, so let's say 1 000. So if the module has 30 000 lines, you might be looking at a month of work. Can you spend that? Is it worth it?

  • Is it a part of your core functionality?

    If what the module does is something peripheral, like say font rendering or image loading, you might not care that it is messy. You might swap out the whole thing for something else in the future, who knows. But you should own the code that relates to your core competence.

  • How bad is it?

    If the code is just slightly bad, then perhaps you can live with it. If it is mind-numbingly, frustratingly incomprehensibly bad, then perhaps something needs to be done.

1. Get a test case

Seriously cleaning a piece of code means messing around with it a lot. You will break things.

If you have a decent test case with good coverage you will immediately know what has broken and you can usually quite quickly figure out what stupid mistake you just made. The time and anxiety this saves over the course of the cleaning process is just ridiculous. Get a test case. It's the first thing you should do.

Unit tests are best, but all code is not amenable to to unit testing. (Test fanatics, send your hate mail now!) If unit tests are too cumbersome, use an integration test instead. For example, fire up a game level and run the character through a specific set of actions related to the code you are cleaning.

Since such tests ate more time consuming, it might not make sense to run it after every change you make, which would be ideal. But as you put every single change you make into source control, it's not so bad. Run the test every once in a while (e.g., every five changes). When it discovers a problem you can do a binary search of those last few commits to find out which one caused the problem.

If you discover an issue that wasn't detected by your test, make sure that you add that to the test, so that you capture it in the future.

2. Use source control

Do people still have to be told to use source control? I sure hope not.

For cleaning work it is absolutely crucial. You will be making lots and lots of small changes to the code. If something breaks you want to be able to look back in the revision history and find out where it broke.

Also, if you are anything like me, you will sometimes start down a refactoring path (like removing a stupid class) and realize after a while that it wasn't such a good idea, or, that it was a good idea, but that everything would be a lot simpler if you did something else first. So you want to be able to quickly revert everything you just did and begin anew.

Your company should have a source control system in-place that allows you to do these changes in a separate branch and commit as much as you like without disturbing anybody else.

But even if it doesn't, you should still use source control. In that case, download mercurial (or git), create a new repository and put the code that you checked out of your company's stupid system there. Do your changes in that repository, committing as you go. When you are done you can merge everything back into the stupid system.

Cloning the repository into a sensible source control system only takes a few minutes. It is absolutely worth it. If you don't know mercurial, spend an hour to learn it. You will be happy you did. Or if you prefer, spend 30 hours to learn git instead. (I kid! Not really. Nerd fight now!)

3. Make one (small) change at a time

There are two ways of improving bad code: revolution and reform. The revolution method is to burn everything with fire and rewrite it from scratch. The reform method is to refactor the code with one small change at a time without ever breaking it.

This article is about the reform method. I'm not saying that revolutions never are necessary. Sometimes things are so bad that they just need to go. But people who get frustrated with the slow pace of reform and advocate revolution often fail to realize the full complexity of the problem and thus don't give the existing system enough credit for the things it does.

Joel Spolsky has written a classic article about this without falling into the trap of making strained political metaphors.

The best way of reforming code is to make one minimal change at a time, test it and commit it. When the change is small it is easier to understand its consequences and make sure that it doesn't affect the existing functionality. If something goes wrong, you only have a small amount of code that you need to check. If you start doing a change and realize that it is bad, you won't loose much work by reverting to the last commit. If you notice after a while that something has gone subtly wrong, a binary search in the revision history will let you find the small change that introduced the problem.

A common mistake is to do more than one thing at the same time. For example, while getting rid of an unnecessary level of inheritance you might notice that the API methods are not as orthogonal as you would like them to be and start to rearrange them. Don't! Get rid of the inheritance first, commit that and then fix the API.

Smart programmers organize the way they work so that they don't have to be that smart.

Try to find a path that takes you from what the code is now to what you want it to be in a sequence of small steps. For example, in one step you might rename the methods to give them more sane names. In the next, you might change some member variables to function parameters. Then you reorder some algorithms so that they are clearer. And so on.

If you start doing a change and realize that it was a bigger change than you originally thought, don't be afraid to revert and find a way of doing the same thing in smaller, simpler steps.

4. Don't clean and fix at the same time

This is a corollary to (3), but important enough to get its own point.

It is a common problem. You start to look at a module because you want to add some new functionality. Then you notice that the code is really badly organized, so you start reorganizing it at the same time as you are adding the new functionality.

The problem with this is that cleaning and fixing has diametrically opposite goals. When you clean, you want to make the code look better without changing its functionality. When you fix, you want to change its functionality to something better. If you clean and fix at the same time it becomes very hard to make sure that your cleaning didn't indadvertedly change something.

Do the cleaning first. Then, when you have a nice clean base to work with, add the new functionality.

5. Remove any functionality that you are not using

The time it takes to clean is proportional to the amount of code, its complexity and its messiness.

If there is any functionality in the code that you are currently not using and don't plan to be using in the foreseeable future -- get rid of it. That will both reduce the amount of code you will have to go through and its complexity (by getting rid of unnecessary concepts and dependencies). You will be able to clean faster and the end result will be simpler.

Don't save code because "who knows, you might need it some day". Code is costly -- it needs to be ported, bug checked, read and understood. The less code you have, the better. In the unlikely event that you do need the old code, you can always find it in the source repository.

6. Delete most of the comments

Bad code rarely has good comments. Instead, they are often:

// Pointless:

 // Set x to 3
 x = 3;

// Incomprehensible:

 // Fix for CB (aug)
 pos += vector3(0, -0.007, 0);

// Sowing fear and doubt:

 // Really we shouldn't be doing this
 t = get_latest_time();

// Downright lying:

 // p cannot be NULL here
 p->set_speed(0.7);

Read through the code. If a comment doesn't make sense to you and doesn't further your understanding of the code -- get rid of it. Otherwise you will just waste mental energy on trying to understand that comment on each future reading of the code.

The same goes for dead code that has been commented or #ifdef'ed out. Get rid of it. It's there in the source repository if you need it.

Even when comments are correct and useful, remember that you will be doing a lot of refactoring of the code. The comments may no longer be correct when you are done. And there is no unit test in world that can tell you if you have "broken the comments".

Good code needs few comments because the code itself is clearly written and easy to understand. Variables with good names do not need comments explaining their purpose. Functions with clear inputs and outputs and no special cases or gotchas require little explanation. Simple, well written algorithms can be understood without comments. Asserts document expectations and preconditions.

In many cases the best thing to do is just to get rid of all old comments, focus on making the code clear and readable, and then add back whatever comments are needed -- now reflecting the new API and your own understanding of the code.

7. Get rid of shared mutable state

Shared mutable state is the single biggest problem when it comes to understanding code, because it allows for spooky "action at a distance", where one piece of code changes how a completely different piece of code behaves. People often say that multithreading is difficult. But really, it is the fact that the threads share mutable state that is the problem. If you get rid of that, multithreading is not so complex.

Since your goal is to write high-performant software, you won't be able to get rid of all mutable state, but your code can still benefit enormously from reducing it as much as possible. Strive for programs that are "almost functional" and make sure you know exactly what state you are mutating where and why.

Shared mutable state can come from several different places:

  • Global variables. The classic example. By now everybody surely knows that global variables are bad. But note (and this is a distinction that people sometimes fail to make), that it is only shared mutable state that is problematic. Global constants are not bad. Pi is not bad. Sprintf is not bad.

  • Objects -- big bags of fun. Objects are a way for a large number of functions (the methods) to implicitly share a big bag of mutable state (the members). If a lazy programmer needs to pass some information around between methods, she can just make a new member that they can read and write as they please. It's almost like a global variable. How fun! The more members and the more methods an object has, the bigger this problem is.

  • Megafunctions. You have heard about them. These mythic creatures that dwell in the deepest recesses of the darkest codebases. Broken programmers talk about them in dusky bars, their sanity shattered by their encounters: "I just kept scrolling and scrolling. I couldn't believe my eyes. It was 12 000 lines long."

    When functions are big enough, their local variables are almost as bad as global variables. It becomes impossible to tell what effect a change to a local variable might have 2 000 lines further down in the code.

  • Reference and pointer parameters. Reference and pointer parameters that are passed without const can be used to subtly share mutable state between the caller, the callee and anyone else who might be passed the same pointer.

Here are some practical ideas for getting rid of shared mutable state:

  • Split big functions into smaller ones.

  • Split big objects into smaller ones by grouping members that belong together.

  • Make members private.

  • Change methods to be const and return the result instead of mutating state.

  • Change methods to be static and take their arguments as parameters instead of reading them from shared state.

  • Get rid of objects entirely and implement the functionality as pure functions without side effects.

  • Make local variables const.

  • Change pointer and reference arguments to const.

8. Get rid of unnecessary complexity

Unnecessary complexity is often a result of over-engineering -- where the support structures (for serialization, reference counting, virtualized interfaces, abstract factories, visitors, etc) dwarf the code that performs the actual functionality.

Sometimes over-engineering occurs because software projects start out with a lot more ambitious goals than what actually gets implemented. More often, I think, it reflects the ambitions/esthetics of a programmer who has read books on design patterns and the waterfall model and believes that over-engineering makes a product "solid" and "high-quality".

Often, the heavy, rigid, overly complex model that results is unable to adapt to feature requests that were not anticipated by the designer. Those features are then implemented as hacks, bolt-ons and backdoors on top of the ivory tower resulting in a schizophrenic mix of absolute order and utter chaos.

The cure against over-engineering is YAGNI -- you are not gonna need it! Only build the things that you know you need. Add more complicated stuff when you need it, not before.

Some practical ideas for cleaning out of unnecessary complexity:

  • Remove the functionality you are not using (as suggested above).

  • Simplify necessary concepts, and get rid of unneeded ones.

  • Remove unnecessary abstractions, replace with concrete implementations.

  • Remove unnecessary virtualization and simplify object hierarchies.

  • If only one setting is ever used, get rid of the possibility of running the module in other configurations.

9. That is all

Now go clean your room!

Saturday, August 4, 2012

A simpler design for asynchronous APIs

Accessing Internet services, e.g. to fetch a web page or to store data on a leaderboard, requires an asynchronous API. You send a request and then, at some later point, you receive a reply.

Asynchronous APIs are trickier to design than synchronous ones. You can't simply return the result of the operation, since it isn't ready yet. Instead you have to wait until it is done and then send it to the caller through some other channel. This often results in designs that are needlessly complicated and cumbersome to work with.

Callbacks

The most common approach is perhaps to use callbacks. You make the asynchronous request and when it completes the callback is called. The callback can either be a global system-wide callback, or (which is nicer) a callback that you supply when you make the asynchronous call.

leaderboard->set_score(100, set_score_cb, my_user_data);

void set_score_cb(SetScoreResult *result, void *user_data)
{
   ...
}

I have already mentioned in a previous article that I'm not too fond of callbacks and that I prefer polling in most cases. Badly designed polling can be expensive, but in the case of asynchronous network operations we wouldn't expect to have more than a dozen or so in-flight at any one time, which means the cost of polling is negligible.

Callbacks tend to make code worse. There are several reasons.

First, you usually have little control over when a callback happens. This means that it can happen at a time that isn't very suitable to you. For cleanliness, you may want to do all your leaderboard processing in your update_leaderboard() function. But the callback might be called outside update_leaderboard(), messing up all your carefully laid plans.

Second, it can be tricky to know what you can and cannot do in a callback. The code that calls you might make some assumptions that you inadvertently violate. These things can sometimes be really tricky to spot. Consider something as simple as:

int n = _leaderboard_operations.size();
for (int i=0; i!=n; ++i) {
 if (done(_leaderboard_operations[i]))
  do_callback(_leaderboard_operations[i]);
}

This looks perfectly innocent. But if the callback happens to do something that changes the _leaderboard_operations vector, for example by posting a new request or removing an old one, the code can blow up with memory access errors. I have been bitten by things like this many times. By now, every time I see a callback a warning clock goes off in my head: "danger, danger -- there is a callback here, remember that when you make a callback anything can happen".

Sometimes it can be necessary to double buffer data to get rid of bugs like this.

Third, callbacks always happen in the wrong context. You get the callback in some "global", "top-level" context, and from there you have to drill down to the code that actually knows what to do with the information. (Typically by casting the user_data pointer to some class and calling a member function on it.) This makes the code hard to follow.

In other words, callbacks lead to hard-to-read code, hard-to-follow code flow, subtle bugs, redundant boilerplate forwarding stubs and instruction cache misses. Bleh!

Request objects

Another common approach is to have some sort of request object that represents the asynchronous operation. Something like:

SetScoreRequest *request = _leaderboard->set_score(100);
...
if (request->is_done()) {
 bool success = request->result();
 delete request;
}

Or perhaps, using the C++11 concepts of promises and futures (I have only a passing acquaintance with C++11, so forgive me if I mess something up):

std::promise<bool> *promise = new std::promise<bool>();
_leaderboard->set_score(100, promise);
...
std::future<bool> future = promise->get_future();
if (future.valid()) {
 bool success = future.get();
 _leaderboard->forget_promise(promise);
 delete promise;
}

This is a lot better than the callback approach, but still in my view, overly complicated. It is clearly a design based on the object-oriented philosophy of -- when in doubt, make more objects.

But these extra objects don't really do much. They just act as pointless intermediaries that pass some information back and forth between our code and the _leaderboard object. And they are a hassle for the caller to keep track of. She must store them somewhere and make sure to delete them when she is done to avoid memory leaks.

Furthermore, if we want to expose this API to a scripting language, such as Lua, we have to expose these extra objects as well.

ID tokens

As readers of my previous articles know, I'm a big fan of using IDs. Instead of exposing internal system objects to the caller of an API, I prefer to give the caller IDs that uniquely identifies the objects and provide functions for obtaining information about them.

That way, I am free to organize my internal data however I like. And it is easier to see when the state of my objects might mutate, since all calls go through a single API.

With this approach the interface would look something like this:

unsigned set_score(int value);
enum SetScoreResult {SSR_IN_PROGRESS, SSR_SUCCESS, SSR_FAILURE};
SetScoreResult set_score_result(unsigned id);

Note that there are no objects that the user must maintain and release. The ID can easily be manipulated by a scripting layer. If the user doesn't need to know if the operation succeeded, she can just throw away the returned ID.

In this API I don't have any method for freeing tokens. I don't want to force the user to do that, since it is both a hassle (the user must track all IDs and decide who owns them) and error prone (easy to forget to release an ID).

But obviously, we must free tokens somehow. We can't store the results of the set_score() operations forever. If we did, we would eventually run out of memory.

There are several ways you could approach this problem. My preferred solution in this particular case is to just have a fixed limit on the number of operations that we remember. Since we don't expect more than a dozen simultaneous operations, if we make room for 64, we have plenty of slack and still use only 64 bytes of memory. A modest amount by any standard.

We can keep the results in a round-robin buffer:

/// Maximum number of requests whose result we remember.
static const int MAX_IN_FLIGHT = 64;

/// The result of the last MAX_IN_FLIGHT requests.
char results[MAX_IN_FLIGHT];

/// Number of requests that have been made.
unsigned num_requests;

SetScoreResult set_score_result(unsigned id)
{
 // If more than MAX_IN_FLIGHT requests have been made after this one,
 // the information about it is lost.
 if (num_requests - id > MAX_IN_FLIGHT)
  return SSR_NO_INFORMATION;

 return results[id % MAX_IN_FLIGHT];
}

This means that you can only ask about the result of the last 64 operations. On the other hand, this solution uses very little memory, does not allocate anything, has very quick lookups and doesn't require the user to explicitly free tokens.

To me, this added simpleness and flexibility outweighs the disadvantage of having a limit on the maximum number of in flight operations that we support.

Implicit APIs

In many cases, the best solution to asynchronous conundrums is to redesign the API to abstract away the entire concept of asynchronous operations, so that the user doesn't even have to bother with it.

This can require some creative rethinking in order to focus on what it is the user really wants to do. For example, for our example, we might come up with this:

/// Sets the score to the specified value. This is an asynchronous operation.
/// You can use acknowledged_score() to find out when it has completed.
void set_score(int score);

/// Returns the last score that has been acknowledged by the server.
int acknowledged_score();

This is probably all that the user needs to know.

Now we have really simplified the API. The user still needs to be aware that set_score() isn't propagated to the server immediately, but she doesn't at all have to get involved in what asynchronous operations are performed and how they progress.

This kind of radical rewrite might not be possible (or even desirable) for all asynchronous systems. You have to balance the value of high-level abstractions and simplifications against the need for low-level control. But it is almost always worth exploring the possibility since it can lead to interesting ideas and dramatically simplified APIs.

For example, the interface for an asynchronous web fetcher could be as simple as:

const char *fetch(const char *url);

If called with an URL that hadn't been fetched yet, the function would issue a request for the URL and return NULL. Once the data was available, the function would return it. On the next call, the data would be freed. To fetch a web page, you would just repeatedly call the function with an URL until you got a reply.

Quite fetching, wouldn't you say?

Tuesday, July 3, 2012

Matrices, Rotation, Scale and Drifting

If you are using Matrix4x4s to store your node transforms and want to support scaling you are facing an annoying numerical problem: rotating a node causes its scale to drift from the original value.

The cause of drifting

Drifting happens because in a Matrix4x4 the rotation and the scale are stored together in the upper left 3x3 part of the matrix:

This means that if we want to change the rotation of a Matrix4x4 without affecting the scale we must extract the scale and reapply it:

void set_rotation(Matrix4x4 &pose, const Quaternion &rot)
{
     Vector3 s = scale(pose);
     Matrix3x3 rotm = matrix3x3(rot);
     scale(rotm, s);
     set_3x3(pose, rotm);
}

The problem here is that since floating point computation is imprecise, scale(pose) is not guaranteed to be exactly the same before this operation as after. Numerical errors will cause a very small difference. So even though we only intended to rotate the node we have inadvertently made it ever so slightly bigger (or smaller).

Does it matter? Sure, it is annoying that an object that we didn't want to have any scaling at all suddenly has a scale of 1.0000001, but surely such a small change would be impercievable and couldn't affect gameplay.

True, if we only rotated the object once. However, if we are dealing with an animated or spinning object we will be changing its rotation every frame. So if the error is 0.0000001 the first frame, it might be 0.0000002 the second frame and 0.0000003 the third frame.

Note that the error growth is linear rather than geometric because the error in each iteration is proportional to the current scale, not to the current error. I. e., to (1 + e) rather than e. We can assume that 1 >> e, because otherwise we already have a clearly visible error.

I ran a test using our existing math code. Rotating a transform using the method described above yields the following result:

Error Frames Time (at 60 Hz)
0.000001 202 3 s
0.000002 437 7 s
0.000005 897 15 s
0.000010 1654 28 s
0.000020 3511 58 s
0.000050 8823 2 min
0.000100 14393 4 min
0.000200 24605 7 min
0.000500 52203 15 min
0.001000 100575 28 min

As you can see, after 28 minutes we have an error of 0.1 %. At this point, it starts to get noticeable.

You could debate if this is something that needs fixing. Maybe you can live with the fact that objects grow by 0.1 % every half hour, because your game sessions are short and the small scale differences will never be noted. However, since Bitsquid is a general purpose engine, we need a better solution to the problem.

At this point, you might be asking yourself why this problem only happens when we introduce scaling. Don't we have the same issue with just translation and rotation? No, because translation and rotation are stored in completely separate parts of the matrix:

Setting the rotation doesn't touch any of the position elements and can't introduce errors in them, and vice versa.

Solutions to scale drifting

I can think of four possible solutions to this problem:

  • Store rotation and scale separately

  • Always set rotation and scale together

  • Quantize the scale values

  • Prevent systematic errors

Solution 1: Store rotation and scale separately

The root cause of the problem is that rotation and scale are jumbled together in the Matrix4x4. We can fix that by separating them. So instead of using a Matrix4x4 we would store our pose as:

struct Pose {
      Vector3 translation;
      Matrix3x3 rotation;
      Vector3 scale;
};

With the pose stored like this, changing the rotation does not touch the scale values, so we have eliminated the problem of drifting.

Note that this representation is actually using slightly less memory than a Matrix4x4 -- 15 floats instead of 16. (We could reduce the storage space even further by storing the rotation as a quaternion, but then it would be more expensive to convert it to matrix form.)

However, the representation is not as convenient as a Matrix4x4. We can't compose it or compute its inverse with regular matrix operations, as we can do for a Matrix4x4. We could write custom operations for that, or we could just convert this representation to a temporary Matrix4x4 whenever we needed those operations.

Converting to a Matrix4x4 requires initializing the 16 floats (some with values from the pose) and 9 floating point multiplications (to apply the scale). What kind of a performance impact would this have?

I would guess that the part of the codebase that would be most affected would be the scene graph local-to-world transformation. With this solution, you would want to store the local transform as a Pose and the world transform as a Matrix4x4. The local-to-world transform requires about 36 multiplications and 36 additions (says my quick estimate). So adding a temp Matrix4x4 conversion would take you from 72 to 81 FLOPS.

So a very rough estimate is that this change would make your scene graph transforms about 12 % more expensive. Likely, the real value is less than that since you probably have additional overhead costs that are the same for both methods. And of course, the scene graph transforms are just one small (and parallelizable) part of what your engine does. We rarely spend more than 2 % of our frame time there, meaning the total performance hit is something like 0.2 %.

I think that is a quite reasonable price to pay for a neat solution to the problem of drifting, but you may disagree of course. Also, perhaps the use of Matrix4x4s is so ingrained in your code base that it is simply not possible to change it. So let's look at the other possible solutions.

Solution 2: Always set rotation and scale together

The fundamental problem with set_rotation() is that we try to change just the orientation of the node without affecting the scale. Extracting the scale and reapplying it is what causes the drifting.

If we don't allow the user to just change the rotation, but force him to always set the scale and the rotation together, the problem disappears:

void set_rotation_and_scale(Matrix4x4 &pose, const Quaternion &rot, const Vector3 &s)
{
    Matrix3x3 rotm = matrix3x3(rot);
    scale(rotm, s);
    set_3x3(pose, rotm);
}

Since we have eliminated the step where we extract the scale and feed it back, we have rid ourselves of the feedback loop that caused runaway drifting. Of course, we haven't completely eliminated the problem, because nothing prevents the user from emulating what we did in set_rotation() and recreating the feedback loop:

Vector3 s = scale(pose);
set_rotation_and_scale(pose, new_rotation, s);

Now the drifting problem is back with a vengeance, reintroduced by the user.

To prevent drifting the user must take care not to create such feedback loops. I.e., she can never extract the scale from the matrix. Instead she must store the scale at some other place (separate from the matrix) so that she can always feed the matrix with the correct scale value.

What we have done is essentially to move the burden of keeping track of the scale of objects from the transform (the Matrix4x4) to the user of the transform. This prevents drifting and doesn't have any performance costs, but it is pretty inconvenient for the user to have to track the scale of objects manually. Also, it is error prone, since the user who is not 100 % certain of what she is doing can accidentally recreate the feedback loop that causes drifting.

Solution 3: Quantize the scale values

If none of the two options presented so far seem palpable to you, there is actually a third possibility.

Consider what would happen if we changed the Vector3 scale(const Matrix4x4 &) function so that it always returned integer values.

Calling set_rotation() as before would introduce an error to the scale and set it to, say 1.0000001. But the next time we ran set_rotation() and asked for the scale it would be rounded to the nearest integer value, so it would be returned as 1 -- the correct value. Applying the new rotation would again introduce an error and change the value to 1.0000001, but then again, the next time the function ran, the value returned would be snapped back to 1.

So by rounding the returned scale to fixed discrete values we prevent the feedback loop. We still get small errors in the scale, but without the runaway effect they are unnoticeable. (Small errors occur everywhere, for example in the scene graph transforms. That's the nature of floating point computation. It is not the small errors that are the problem but the mechanisms that can cause them to result in visible effects.)

Of course, if we round to integer values we can only scale an object by 1, 2, 3, etc. Not by 0.5, for instance. But we can fix that by using some other set of discrete numbers for the scale. For example, we could round to the nearest 0.0001. This would let us have scales of 0.9998, 0.9999, 1.0000, 1.0001, 1.0002, … Hopefully that is enough precision to cover all the different scales that our artists might want to use.

Drifting won't happen in this scheme, because the floating point errors will never be big enough to change the number to the next discrete value. (Unless you used really large scale values. If you want to support that -- typically not interesting, because things like texture and vertex resolution start to look wonky -- you could use a geometric quantization scheme instead of an arithmetic one.)

Snapping the scale values in this way might be OK for static scaling. But what if you want to smoothly change the scaling with an animation? Won't the discrete steps cause visible jerks in the movement?

Actually not. Remember that it is only the value returned by scale() that is quantized, the user is still free to set_scale() to any non-quantized value. When the scale is driven by an animation, it is fed from an outside source. We don't need to read it from the matrix and reapply it. So the quantization that happens in scale() never comes into play.

So amazingly enough, this hacky solution of snapping the scale to a fixed set of discrete values actually seems to work for most real world problems. There might be situations where it would cause trouble, but I can't really come up with any.

Solution 4: Prevent systematic errors

A final approach is to try to address how the numbers are drifting instead of stopping them from drifting. If you look at the table above you see that the errors are growing linearly. That is not what you would expect if the errors were completely random.

If the errors in each iteration were completely random, you would get a random walk process where the total error would be e * sqrt(N) rather than e * N, where e is the error from one iteration and N the number of iterations. The fact that the error grows linearly tells us that our computation has a systematic bias -- the error is always pushed in one particular direction.

If we could get rid of this systematic bias and get a truly random error, the accumulated error would grow much more slowly, the square root makes all the difference. For example, for the error to grow to 0.1 % it would take 5.2 years rather than 28 minutes. At that point, we might be ok with the drifting.

I haven't thought that much about what would be needed to get rid of the systematic bias in the set_rotation() function. It's a pretty tricky problem that requires a deep understanding of what happens to all the floating point numbers as they travel through the equations.

Conclusion

In the Bitsquid engine we have so far gone with #2, as a make-shift until we decided on the best permanent solution to this problem. After reviewing the options in this article I think we will most likely go with #1. #3 is an interesting hack and I think it would work almost everywhere, but I'm willing to pay the slight performance price for the cleaner and clearer solution of #1.

Tuesday, June 19, 2012

Hack Day Report

Last Friday, we had our second hack day (aka do-what-you-want day, aka google day) at the office.

Different companies seem to take different approaches to hack days. At some places it just means that you can spend a certain percentage of your working week on your own projects. We wanted something that was a bit more focused and felt more like a special event, so we used the following approach:

  • People were encouraged to pick tasks that could be completed, or taken to a "proof-of-concept" level in a single day. The goal was that at the end of the day you should have something interesting to show/tell your colleagues.

  • It is ok to fail of course. Failure is often interesting. Trying crazy ideas with a significant risk of spectacular failure is part of the charm of a hack day.

  • A couple of days before the event, everbody presented their projects. The idea was to get everybody to start thinking about the topics, so that we could help each other with ideas and suggestions.

  • We ate breakfast together in the morning to start the discussions and get everybody in the spirit of the event. At the end of the day, we rounded off with a couple of beers.

  • We avoided Skype, email and meetings during the day, so that we could focus 100 % on the projects.

  • A couple of days after the events we had a small show & tell, where everybody could present what they had learned.

Results

A number of interesting projects came out of this hack day:

  • Tobias and Mats created an improved highlighting system for indicating selected objects in the level editor. (Highlighting the OOBB works well for small objects, but for big things like landscapes and sub-levels, it is just confusing.)

  • Jim looked into a cross-platform solution for capturing screen shots and videos on target machines and transmitting them over the network.

  • Andreas created a Lua profiling tool, that can dynamically enable and disable profiling for any Lua function by hot-patching the code with profiler calls.

  • Finally, I rewrote the collision algorithm for our particle systems.

Being an egotistical bastard, I will focus on my own project.

Particle collision is one of those annoying things that it is difficult to find a good general solution to, for two reasons:

  • It ties together two completely different systems (particles and physics), creating an ugly coupling between them. Since the solution must have decent performance, the coupling must be done at a fairly low level, which makes it even worse.

  • Particles can have very different collision requirements. Some effects need a massive amount of particles (e. g., sparks), but don't care that much about collision quality. As long as most of them bounce somewhat accurately, it is OK. Other effects may have just a single particle (e. g., a bullet casing). Performance doesn't matter at all, but if it doesn't bounce right you will surely notice. Handling both effects in the same system is a challenge. Having different systems for different effects is another kind of challenge.

My previous attempts at implementing particle collision have all been based on first cutting out a slice of the physics world around the particle effect and then trying to find a fast representation of the collision shapes in that world slice.

The problem with this approach is that there are a lot of variables to tweak and tune:

  • How big should the world slice be?

  • How much detail should there be in the simplified representation? More detail is slower, but gives better collision results.

  • What kind of representation should we use?

  • How should we handle dynamic/moving objects? How often should the world slice be updated?

I've tried a lot of different representations: a triangle soup, a collection of half-spheres, a height field, but none of them has given completely satisfactory results. Often, parameters that work for one effect at one location fail for a different effect at a different location. Both performance and behavior are hard to predict.

The main idea for the new approach came from a Naughty Dog presentation at GDC. Instead of trying to create a shared collision model for all particles, we give each particle its own collision model, and we store it inside the particle itself, together with the other particle data.

Of course, it would be expensive to store a complicated collision model inside every particle, so we use the simplest model possible: a plane. We can represent that by a normal and an offset from origin. So with this approach, the data for a particle might look something like this:

struct Particle {
 Vector3 position;
 Vector3 velocity;
 Color8 color;
 Vector3 collision_plane_normal;
 float collision_plane_offset;
};

(Side note: Our particle data doesn't actually look like this, we use a "structure-of-arrays" approach rather than an "array-of-structures" and we don't have a fixed set of fields, each effect has its own set.)

Note that we don't bother with any flag for indicating whether there is plane or not. If there is no collision, we just put the collision plane far enough below the origin.

With this approach the collision test is super fast -- just a dot product and a compare. It is also really easy to parallelize the test or run it off-CPU, since it just uses local particle data and doesn't need to access any shared memory.

With this method we have divided the original collision problem into two simpler ones:

  • Collision test against a plane. (Trivial.)

  • Finding a suitable collision plane for each particle.

This means that if we want to, we can use different approaches for finding the collision planes for different effects. E.g., for static effects we could hard code the collision plane and avoid collision queries completely.

Generally, we can find a suitable collision plane for a particle by raycasting along its trajectory. If we didn't have any performance constraints, we could do a raycast for every particle every frame. That way we would always know what surface the particle would hit next, which means that we would get perfect collision behavior.

Of course, we can't actually do that. Raycasts are comparatively expensive and we want to be able to support large numbers of particles.

To control the performance, I exposed a parameter that lets the effect designer control how many raycasts per frame an effect is a allowed to make. A typical value of 1.0 means that every frame, one particle in the effect is picked at random, a raycast is performed along that particles trajectory and its collision plane is updated with the result.

Note that with this solution, the work is always evenly distributed over the duration of the effect. That is a lot nicer than what you typically get with the "world slice" approach where there is a big chunk of work in the beginning when you cut out the world slice.

Astute readers will have noticed a fatal flaw with the design as it has been presented so far: it can't possibly work for very many particles. If we have an effect with 1 000 particles and do a raycast every frame, it will take 33 seconds before every particle has found its collision normal. By then, they will long since have fallen through the floor.

So, if we want to use this approach for large numbers of particles we must be able to somehow reuse the collision results. Typically, an effect will have bundles of particles traveling in approximately the same direction. When one such particle has done a raycast and found a collision, we want to be able to share the result with its neighbors somehow.

I wanted to find a solution to this without having to create a complicated collision representation, because that would bring back many of the problems I had with the "world slice" approach. Eventually, I decided that since what we want to do is to cache a collision query of the form:

(position, direction) -> collision_plane

The simplest possible thing would be to store the results in a hash. Hashes are nice, predictable data structures with well known performance characteristics.

To be able to hash on position and direction we must quantize them to integer values. We can quantize the position by dividing the world into cells of a certain width and height:

 const float cell_side = 0.5f;
 const float cell_height = 2.0f;
 int ix = position.x / cell_side;
 int iy = position.y / cell_side;
 int iz = position.z / cell_height;
 uint64 key = HASH_3(ix, iy, iz);

In this example, I use a higher resolution along the xy-axes than along the z-axes, because typically that is where the more interesting features are. HASH_3() is a macro that performs the first three rounds of the murmur_hash algorithm.

To quantize the direction we can use a similar approach. I decided to quantize the direction to just six different values, depending on along which principal axis the particle is mostly traveling:

 unsigned id;
 if (fabsf(dir.x) >= fabsf(dir.y) && fabsf(dir.x) >= fabsf(dir.z))
  id = dir.x > 0 ? 0 : 1;
 else if (fabsf(dir.y) >= fabsf(dir.z))
  id = dir.y > 0 ? 2 : 3;
 else
  id = dir.z > 0 ? 4 : 5;
 key = key ^ id;

Now that we have computed a quantized representation of (position, direction), we can use that as lookup value into our hash, both for storing and fetching values:

 struct CollisionPlane {
  Vector3 normal;
  float offset;
 };
 HashMap<uint64, CollisionPlane> _cache;

(Side note: Unless I'm worried about hash function collisions, I prefer to hash my keys before I insert them in the HashMap and just use a HashMap<uint64,T> instead of HashMap<MyComplicatedKeyStruct,T>. That way the hash map uses less memory and lookups can be done with a simple modulo operation.)

Whenever I do a particle raycast I store the result in the cache. When particles are spawned they lookup their collision plane in the cache. Particles also query the cache every time they bounce, since that typically means they will be traveling in a new direction.

I have a maximum size that the cache is allowed to use. When the cache reaches the maximum size, older entries are thrown out.

Results

The system gives high quality results for effects with few particles (because you get lots of raycasts per particle) and is still able to handle massive amounts of particles. The performance load is evenly distributed and it doesn't need any special cases for dynamic objects.

There are some drawbacks. The cache requires some tweaking. Since it can only store one collision plane for each quantization cell it will miss important features if the cells are too big. On the other hand, if the cells are too small, we need lots of entries in the cache to represent the world, which means more memory and slower lookups.

Since we only have one collision normal per particle, there are some things that the particles just can't do. For example, they can never come to rest at the bottom of a V-shape, because they will always only be colliding with one of the planes in the V. Overall, they will behave pretty badly in corners, where several collision planes with different normals meet. Some of these issues could be fixed by storing more than one collision plane in the particle, but I don't think it is worth it. I prefer the simpler approach and having particles that in some tricky situations can fall through the ground.

Compared to the old collision code, the new code is simpler, runs faster and looks better.

All in all, I would say that the hack day was a success. We had great fun and produced some useful stuff. We will definitely do more days like this in the future. Not too often though. I think it is important that these days feel like a special treat. If they become too mundane, something important is lost. Once a month or so, would be ideal, I think.

Sunday, May 20, 2012

Playing (with) Video

So you want to play some video? Shouldn't be too hard, right? Just download some video playing library and call the play_video() function. Easy-peasy-lemon-squeezy.

Well, you have to make sure that the video is encoded correctly, that the library works on all platforms and plays nice with your memory, file, sound and streaming abstractions, and that the audio and video doesn't desynchronize, which for some inexplicable reason seems to be a huge problem.

But this is just technical stuff. We can deal with that. What is worse is that video playback is also a legal morass.

There are literally thousands of broad patents covering different aspects of video decompression. If you want to do some video coding experiments of your own you will have to read, understand and memorize all these patents so that you can carefully tip-toe your code and algorithms around them.

Of course, if you had a big enough pool of patents of your own you might not have to care as much, since if someone sued you, you could sue them right back with something from your own stockpile. Mutually assured destruction through lawyers. Ah, the wonderful world of software patents.

So, creating your own solution is pretty much out of the question. You have to pick one of the existing alternatives and do the best you can with it. In this article I'm going to look at some different options and discuss the advantages and drawbacks of each one:

  • Just say no

  • Bink

  • Platform specific

  • H.264

  • WebM

There are other alternatives that didn't make it to this list, such as Dirac, Theora, and DivX. I've decided to focus on these five, since in my view H.264 is the best of the commercial formats and WebM the most promising of the "free" ones.

An initial idea might be: Why not just do whatever it is VLC does? Everybody's favorite video player plays pretty much whatever you throw at it and is open source software.

Unfortunately that doesn't work, for two reasons. First, VLC:s code is a mix of GPL and LGPL stuff. Even if you just use the LGPL parts you will run into trouble on platforms that don't support dynamic linking. Second, the VLC team doesn't really care about patents and just infringe away. You can probably not afford to do the same. (As a result, there is a very real threat that VLC might be sued out of existence.)

A quick introduction

Before we start looking at the alternatives I want to say something short about what a video file is, since there is some confusion in the matter, even among educated people.

A video file has three main parts:

  • Video data (H.264, DivX, Theora, VP8, ...)

  • Audio data (MP3, AAC, Vorbis, ...)

  • A container format (Avi, Mkv, MP4, Ogg, ...)

The container format is just a way of packing together the audio and video data in a single file, together with some additional information.

The simplest possible container format would be to just concatenate the audio data to the video data and be done with it. But typically we want more functionality. We want to be able to stream the content, i. e. start playing it before we have downloaded the whole file, which means that audio and video data must be multiplexed. We also want to be able to quickly seek to specific time codes, so we may need an index for that. We might also want things like audio tracks in different languages, subtitling, commentary, DVD menus, etc. Container formats can become quite intricate once you start to add all this stuff.

A common source of confusion is that the extension of a video file (.avi, .mkv, .mp4, .ogg) only tells you the container format, not the codecs used for the audio and video data in the container. So a video player may fail to play a file even though it understands the container format (because it doesn't understand what's inside it).

Option 1: Just say no

Who says there has to be video in a game? The alternative is to do all cut scenes, splash screens, logos, etc in-game and use the regular renderer for everything. As technology advances and real-time visuals come closer and closer in quality to offline renders, this becomes an increasingly attractive option. It also has a number of advantages:

  • You can re-use the in-game content.

  • Production is simpler. If you change something you don't have to re-render the entire movie.

  • You don't have to decide on resolution and framerate, everything is rendered at the user's settings.

  • You can dynamically adapt the content, for example dress the players in their customized gear.

  • Having everything be "in-game visuals" is good marketing.

If I was making a game I would do everything in-game. But I'm not, I'm making an engine. And I can't really tell my customers what they can and cannot do. The fact is that there are a number of legitimate reasons for using video:

  • Some scenes are too complex to be rendered in-game.

  • Producing videos can be simpler than making in-game content, since it is easier to outsource. Anybody can make a video, but only the core team can make in-game content and they may not have much time left on their hands.

  • Playing a video while streaming in content can be used to hide loading times. An in-game scene could be used in the same way, but a high-fidelity in-game scene might require too much memory, not leaving enough for the content that is streaming in.

As engine developers it seems we should at least provide some way of playing video, even if we recommend to our customers to do their cutscenes in-game.

Option 2: Bink

Bink from RAD game tools is as close as you can get to a de facto standard in the games industry, being used in more than 5800 games on 14 different platforms.

The main drawback of Bink is the pricing. At $ 8500 per platform per game it is not exactly expensive, but for a smaller game targeting multiple platforms that is still a noticeable sum.

Many games have quite modest video needs. Perhaps they will just use the video player for a 30 second splash screen at the start of the game and nothing more. Paying $ 34 000 to get that on four platforms seems excessive.

At Bitsquid our goal has always been to develop an engine that works for both big budget and small budget titles. This means that all the essential functionality of an engine (animation, sound, gui, video, etc) should be available to the licensees without any additional licensing costs (above what they are already paying for an engine). Licensees who have special interest in one particular area may very well choose to integrate a special middleware package to fulfill their needs, but we don't want to force everybody to do that.

So, in terms of video, this means that we want to include a basic video player without the $ 8500 price tag of Bink. That video player may not be as performant as Bink in terms of memory and processor use, but it should work well enough for anyone who just wants to play a full screen cutscene or splash screen when the CPU isn't doing much else. People who want to play a lot of video in CPU taxing situations can still choose to integrate Bink. For them, the price and effort will be worth it.

Option 3: Platform specific

One approach to video playing is to not develop a platform-independent library but instead use the video playing capabilities inherent in each platform. For example, Windows has Windows Media Foundation, MacOS has QuickTime, etc.

Using the platform's own library has several advantages. It is free to use, even for proprietary formats, because the platform manufacturers have already payed the license fees for the codecs. (Note though, that for some formats you need a license not just for the player, but for the distribution of content as well.) The implementation is already there, even if the APIs are not the easiest to use.

The biggest advantage is that on low-end platforms, using the built-in platform libraries can give you access to special video decoding hardware. For example, many phones have built-in H.264 decoding hardware. This means you can play video nearly for free, something that otherwise would be very costly on a low-end CPU.

But going platform specific also has a lot of drawbacks. If you target many platforms you have your work cut out for you in integrating all their different video playing backends. It adds an additional chunk of work that you need to do whenever you want to add a new platform. Furthermore, it may be tricky to support the same capabilities on all different platforms. Do they all support the same codecs, or do you have to encode the videos specifically for each platform? Do all platforms support "play to texture" or can you only play the videos full screen? What about the sound? Can you extract that from the video and position it as a regular source that reverbs through your 3D sound world? Some platforms (i.e. Vista) have almost no codecs installed by default, forcing you to distribute codecs together with your content.

Since we are developing a generic engine we want to cover as many platforms as possible and minimize the effort required to move a project from one platform to another. For that reason, we need a platform independent library as the primary implementation. But we might want to complement it with platform specific libraries for low end platforms that have built-in decoding hardware.

Option 4: H.264 (MPEG-4, AVC)

Over the last few years H.264 has emerged as the most popular commercial codec. It is used in Blu-ray players, video cameras, on iTunes, YouTube, etc. If you want a codec with good tool support and high quality, H.264 is the best choice.

However, H.264 is covered by patents. Patents that need to be licensed if you want to use H.264 without risking a lawsuit.

The H.264 patents are managed by an entity known as MPEG LA. They have gathered all the patents that they believe pertain to H.264 in "patent pool" that you can license all at once, with a single agreement. That patent pool contains 1700 patents. Yes, you read that right. The act of encoding/decoding a H.264 file is covered by 1700 patents. You can find the list in all its 97 page glory at http://www.mpegla.com/main/programs/avc/Documents/avc-att1.pdf.

I am not a lawyer, as they say on Slashdot, but this is my best understanding of how this patent game works:

  • Buying a license from MPEG LA gives you the right to use the 1700 patents in the pool.

  • This doesn't mean you can't be sued for patent infringement. Anyone that holds a patent which is not one of the 1700 in the pool could claim that H.264 infringes on it and sue you. That seems unlikely, MPEG LA has made an effort to gather all relevant patents, but there is no way to be certain.

  • MPEG LA doesn't by itself go after people who use H.264 without a license, that is up to the holders of the 1700 patents in the pool.

The licensing terms of H.264 are irritating, but not necessarily a big financial burden:

  • If you distribute an encoder or decoder you can distribute 100 000 copies for free, then you have to pay $ 0.20 per unit.

  • If you distribute a H.264 encoded movie, it is free if it is shorter than 12 minutes, then you have to pay $ 0.02 per copy.

Note that unlike the case with other popular codecs such as MP3, it is not just the decoder/encoder that you need to license, you also need a license just for distributing H.264 content.

From what I've been able to discern, but don't take my word for it, a game that only plays a fixed set of movies/cutscenes would not be regarded as a general decoder (even though it contains decoding software), but rather as content, which means you would pay $0.02 per copy sold if you had more than 12 minutes of video in your game and nothing otherwise (you would still need to obtain a license though).

Of course, if you support H.264, you may also want to support AAC, the standard audio format that accompanies it. AAC is covered by a separate licensing body (Via Licensing) that has its own licensing terms. I haven't investigated them in any great detail.

You have to decide for yourself how well these terms sit with you. At Bitsquid we finally decided that if we should have a standard video playing facility, it should be one that people could use without thinking too much about patents and licensing (to the extent that is possible).

Option 5: VP8 (WebM)

VP8 is a "free" video codec owned by Google. It is covered by patents, but Google has granted free use of those patents and also provides a BSD licensed library libvpx for encoding and decoding video files. The format is also endorsed by the Free Software Foundation.

It is generally acknowledged that when it comes to quality, VP8 is not quite as good as H.264, though the differences are not enormous. So you are trading some visual quality for the convenience of a license free format.

There has been some discussion (most of it about a year ago) about whether VP8 really is unencumbered by patents. MPEG LA claimed that it knew several patents that VP8 was infringing and showed interest in creating a "patent pool" for VP8 similar to the one it holds for H.264. Nothing has come of that yet and MPEG LA has not disclosed which patents it thinks VP8 is infringing, which means that Google cannot really respond to the allegations. It is hard to know how much stock to put in this and whether anything will come of it.

You could argue that with this potential "threat" against VP8, it would be better to use another "free" alternative such as Dirac or Theora. However, there is not much evidence that they would fare better. Everyone who makes a "free" codec tries their best to make sure that they don't infringe on any patents. But with thousands of these patents around, each open to legal interpretation, there is just no way to be sure.

This is just the sad state of affairs of software patents. And you are not safe with the commercial formats either. Even if you have licensed the 1700 patents in the MPEG LA patent pool, someone can still sue you for violating patent number 1701. No one in this business offers indemnification against patent suits. Not Google. Not MPEG LA. Not Bink (I think).

It all becomes a question of risk. Bink has been around a long time without being sued, which is reassuring. VP8 hasn't been around that long.

Will there be patent claims made against VP8? Maybe. Who knows. Similar threats were made against Theora, but nothing happened there. If it does happen, Google will most likely fight back and the whole thing will drag on in the courts. Will there be patent claims against you for using VP8? Seems extremely unlikely. Games are not interesting enough targets. Video decoding is not our main business, and we can easily switch technology if needed. Phone manufacturers, YouTube and TV companies are more interesting targets.

Do you have to care about any of this at all? Up to you to decide.

Our conclusion

None of these alternatives are really attractive, it is more about picking the least worst than finding the best, which is frustrating for a perfectionist with self-worth issues. Good cases can be made for all of them. This is what we have decided:

  • We will provide VP8 decoding on all platforms through libvpx. All other things equal, the "most free" format will give us most flexibility in the long run.

  • We will not (at least not right now) support matroska or other advanced container formats. Instead, we will play the videos from simple IVF streams. Sound will be played as Vorbis files through our regular sound system so we get positioning, reverb, etc.

  • If needed, we will complement this basic approach with platform-specific libraries that take advantage of the hardware decoders on low-end platforms (phones and tablets).

  • Customers that need to play a lot of movies while doing other CPU intense tasks and that aren't happy with the performance of libvpx are recommended to look into Bink.

  • Customers that are worried about "patent risk" with VP8 are recommended to do whatever their lawyers tell them to do. (Use Bink, a platform specific library, obtain a H.264 license or avoid video all together.)

Saturday, May 5, 2012

Embracing Dynamism

Are you stuck in static thinking? Do you see your program as a fixed collection of classes and functions with unchanging behavior.

While that view is mostly true for old school languages such as C++ and Java, the game is different for dynamic languages: Lua, JavaScript, Python, etc. That can be easy to forget if you spend most of your time in the static world, so in this article I'm going to show some of the tricks you can apply when everything is fluid and malleable.

At Bitsquid our dynamic language of choice is Lua. Lua has the advantage of being fast, fully dynamic, small, simple and having a traditional (i.e. non-LISP-y) syntax. We use Lua for most gameplay code and it interfaces with the engine through an API with exposed C functions, such as World.render() or Unit.set_position().

I will use Lua in all the examples below, but the techniques can be used in most dynamic languages.

1. Read-eval-print-loop

Dynamic languages can compile and execute code at runtime. In Lua, it is as simple as:

loadstring("print(10*10)")()

This can be used to implement a command console where you can type Lua code and directly execute it in the running game. This can be an invaluable debugging and tuning tool. For example if you need to debug some problem with the bazooka:

World.spawn_unit("bazooka", Unit.position(player))

Or tune the player's run speed:

Unit.set_data(player, "run_speed", 4.3)

2. Reload code

The console can be used for more than giving commands, you can also use it to redefine functions. If the gameplay code defines a scoring rule for kills:

function Player.register_kill(self, enemy)
 self.score = self.score + 10
end

you can use the console to redefine the function and change the rules:

function Player.register_kill(self, enemy)
 if enemy.type == "boss" then
  self.score = self.score + 100
 else
  self.score = self.score + 10
 end
end

Executing this code will replace the existing Player.register_kill function with the new one. All code that previously called the old function will now call the new one and the new scoring rules will apply immediately.

If you take some care with how you use the global namespace you can write your Lua code so that all of it is reloadable using this technique. Then the gameplay programmer can just edit the Lua files on disk and press a key to reload them in-game. The game will continue to run with the new gameplay code, without any need for a reboot. Pretty cool.

You can even get this to work for script errors. If there is an error in the Lua code, don't crash the game, just freeze it and allow the gameplay programmer to fix the error, reload the code and continue running.

3. Override system functions

The functions in the engine API don't have any special privileges, they can be redefined just as other Lua functions. This can be used to add custom functionality or for debugging purposes.

Say, for example, that you have some units that are mysteriously popping up all over the level. You know they are being spawned somewhere in the gameplay code, but you can't find where. One solution would be to override the World.spawn_unit function and print a stack trace whenever the offending unit is spawned:

old_spawn_unit = World.spawn_unit
function World.spawn_unit(type, position)
 if type == "tribble" then
  print "Tribble spawned by:"
  print_stack_trace()
 end
 old_spawn_unit(type, position)
end

Now, whenever a tribble is spawned by the script, a call stack will be printed and we can easily find who is doing the spawning.

Note that before we replace World.spawn_unit, we save the original function in the variable old_spawn_unit. This enables us to call old_spawn_unit() to do the actual spawning.

This technique could also be used to find all (potentially expensive) raycasts being done by the script.

4. Handle deprecated functions

Sometimes we need to deprecate functions in the engine API. It can be annoying to the people using the engine of course, but backwards compatibilty is the mother of stagnation. If you never throw away old code, you will eventually have a huge ugly code mess on your hands.

Luckily, since the script can create functions in the engine namespace, the script can provide the backwards compatibility when needed.

For example, we used to have a function PhysicsWorld.clear_kinematic(world, actor). That naming was inconsistent with some of our other functions so we changed it to Actor.set_kinematic(actor, false).

One way of dealing with this change would be to go through all the code in the project, find all uses of PhysicsWorld.clear_kinematic and change them to use Actor.set_kinematic instead. But another way would be to just implement PhysicsWorld.clear_kinematic in the script:

function PhysicsWorld.clear_kinematic(world, actor)
 Actor.set_kinematic(actor, false)
end

Now the rest of the code can go on using PhysicsWorld.clear_kinematic without even caring that the function has been removed from the engine API. You could even use a combination of the two strategies -- implementing the deprecated function in Lua for a quick fix, and then looking into removing the uses of it.

5. Dynamically inserting profiling

Top-down profiling with explicit profiler scopes is a good way of finding out where a game is spending most of its time. However, to be useful, explicit profiler scopes need to be inserted in all the "right" places (all potentially expensive functions).

In C we need to guess where these right places are before compiling the program. In Lua, we can just insert the profiler scopes dynamically. We can even create a function that adds profiling to any function we want:

function profile(class_name, method_name)
 local f = _G[class_name][method_name]
 _G[class_name][method_name] = function (...)
  Profiler.start(class_name .. "." .. method_name)
  f(...)
  Profiler.stop()
 end
end

When we call this function as profile('Player', 'update') it will first save the existing Player.update function and then replace it with a function that calls Profiler.start("Player.update") before calling the original function and Profiler.stop() before returning.

Using this techinque, we can dynamically add profiling to any function we want during our optimization session.

6. Tab completion

If you implement an interactive Lua console, it is nice to support tab completion, so the user doesn't have to remember all function names. But how do you build the list of callable functions to use with tab completion?

Using Lua of course! Just find all tables (i.e., classes) in the global namespace and all functions stored in those tables:

t = {}

for class_name,class in pairs(_G) do
 if type(class) == 'table' then
  for function_name,function in pairs(class) do
   if type(function) == 'function' then
    t[#t+1] = class_name .. '.' .. function_name
   end
  end
 end
end

After running this, t will contain the full list of function names.

7. Looping through all objects

By recursing through _G you can enumerate all reachable objects in the Lua runtime.

function enumerate(f)
 local seen = {}
 local recurse = function(t)
  if type(t) ~= 'table' then return end
  if seen[t] == true then return end
  f(t)
  seen[t] = true
  recurse(getmetatable(t))
  for k,v in pairs(t) do
   recurse(k)
   recurse(v)
  end
 end
 recurse(_G)
end

Calling enumerate(f) will call f(o) on all objects o in the runtime. (Assuming they are reachable from _G. Potentially, there could also be objects only reachable through Lua references held in C.)

Such an enumeration could be used for many things. For example, you could use it to print the health of every object in the game.

function print_health(o)
 if o.health then print(o.health) end
end
enumerate(print_health)

The technique could also be used for memory optimizations. You could loop through all Lua objects and find the memory used by each object type. Then you could focus your optimization efforts on the resource hogs.

Friday, April 20, 2012

Inheriting Velocity in Ragdolls

After a slew of abstract articles about C++ and code structuring I'd like to get back to some more meaty game engine stuff. So today I'll talk about ragdolls. In particular, how to preserve the momentum of animated objects, so that when you switch over to the ragdoll it continues to stumble forward in the same direction that the animation was moving, before crashing to a gruesome death.

So this is a small, but important problem. We want to somehow get the velocities of the animated objects and then apply them to the bodies in the ragdoll. The only snag is that animated objects typically don't know anything about velocities. Also, we need some way of matching up the physics bodies with the animated objects.

First, some background information. In the Bitsquid engine, physics, scene graph and animation are completely separate systems. We strongly believe in minimizing the couplings between different systems since that makes the engine easier to understand, reason about, modify, optimize and rewrite.

  • The physics system simulates a number of bodies, possibly connected by joints.

  • The scene graph handles local-to-world transforms for a collection of nodes in a hierarchy.

  • The animation system evaluates and blends animation curves for bones.

Bones and bodies hold references (just integer indices, really) to nodes in the scene graph and this how the systems communicate. After the animation has been evaluated, the resulting local transforms are written to the bones' nodes in the scene graph.

For keyframed physics (animated hit bodies), the animation drives the physics, which means the physics' bodies will read their world transforms from the corresponding nodes in the scene graph. For ragdolled physics, the world transforms of the bodies are written to the scene graph after the simulation has completed.

For partial ragdolls (such as a non-functioning, but still attached limb) or powered ragdolls (ragdolls driven by motors to achieve animation poses) it gets a little more involved (perhaps a topic for a future post), but the basic setup is the same.

Given this setup there are two ways of calculating the animation velocities:

  • We can calculate the velocities directly by differentiating the animation curves.

  • We can record a node's transform at two different time steps and compute the velocity from the difference.

The first approach is doable, but not very practical. Not only do we have to differentiate all the animation curves, we must also take into account how those velocities are affected by the blend tree and local-to-world transforms. And even if we do all that, we still don't account for movements from other sources than animation, such as scripted movements, IK or interactions with the character controller.

The second option is the more reasonable one. Now all we need is a way of obtaining the transforms from two different time steps. There are a number of possible options:

  • We could add an array of Matrix4x4:s to our scene graph's last_world where we store the last world transform of every object. So whenever we want to go to ragdoll we always have a last_world transform to calculate velocities from.

  • We could simulate the character backwards in time when we want to go to ragdoll and obtain a last_world transform that way.

  • We could delay the transition to ragdoll one frame, so that we have enough time to gather two world transforms for computing the velocity.

The first approach is conceptually simple, but costly. We are increasing the size of all our scene graphs by about 50 % (previously they contained local and world transforms, now they will also need last_world). In addition we must memcpy(last_world, world) before we compute new world transforms. That's a significant cost to pay all the time for something that happens very seldom (transition to ragdoll).

The second appraoch sounds a bit crazy, but some games actually already have this functionality. Servers in competetive multi-player fps games often need to rewind players in time in order to accurately determine if they were able to hit each other. Still, I find the approach to be a bit too complicated and invovled just to get a velocity.

The third aproach seems simple and cheap, but it violates one of our Bitsquid principles: Thou Shalt Not Have Any Frame Delays. Delaying something a frame can be a quick fix to many hairy problems, but it puts your game in a very weird transitional state where it at the same time both is and isn't (yet) something. The character isn't really a ragdoll yet, but it will be the next frame, whether I want to or not.

This new slightly self-contradictory state invites a host of bugs and before you know it, little logic pieces will start to seep into the code base "do this unless you are in the special transition-to-ragdoll state". Congratulations, you have just made your codebase a lot more complicated and bug prone.

If this is not enough, consider the poor sucker who just wants to write a routine that does A, B, C and D, when A, B and C requires frame delays. Suddenly what was supposed to be simple function got turned into a state machine that needs to run for four frames to produce it result.

The simple rule that actions should take place immediately protects against such insanity.

So three options, none of them especially palpable.

I actually went with the first one, to always compute and store last_world in the scene graph, but with a flag so that this is only used for units that actually need it (characters that can go to ragdoll). When there is no clear winner, I always pick the simplest solution, because it is a lot easier to optimize later if the need should arise. (We could for example track last_world only for the nodes which have a corresponding ragdoll actor. Also we could store last_world as (p,q) instead of as a matrix.)

For completion, given the two transforms, the code for compting the velocities will look something like this:

Vector3 p0 = translation(tm_0);
Vector3 p1 = translation(tm_1);
Vector3 velocity = (p1 - p0) / dt

Quaternion q0 = rotation(tm_0);
Quaternion q1 = rotation(tm_1);
Quaternion q = q1 * inverse(q0);
AxisAngle aa = q.decompose();
Vector3 angular_velocity = aa.axis * aa.angle / dt;