Friday, January 6, 2012

5 Tips for Programmer Productivity

1. Embrace the now-principle


If something takes less than five minutes to do, do it immediately.

It seems like the lazy option, but postponing something actually takes a lot of effort. The task needs to be written down somewhere. Then you need to track it and prioritize it with respect to other tasks. You will probably think about doing it lots of times, before you actually get down to doing it. And then you have to understand what you meant when you wrote it down and try to get back in that same mindset.

For small tasks it is just not worth it.

Instead, just do it. Fix the issue right now while you are already thinking about it. It is faster, simpler and saves you the agony of an ever growing todo-list.

2. Fix the cause, not just the symptom


Don’t just fix problems. Fix the processes that allowed the problems to occur, so that the same problems never occur again. See bugs not as nuisances, but as chances to improve your processes and increase the quality of your code.

If an artist tells you that she gets an ”Error when compiling unit” error, don’t just diagnose it and tell her: ”It’s because you have two nodes with the same name, that is not allowed.” At the very least fix the error message so that it says ”Error when compiling unit ‘bed’. Two nodes have the same name ‘pillow’. One of them must be renamed so that names are unique.” Even better, fix the exporter or the tool, so that it is impossible for the artist to create two nodes with the same name.

If you find an error that could have been caught by an assert, then add that assert so that it will find the error next time.

If someone asks you ”How can I configure the animation compression?”, don’t just answer them. Also write a short text that explains how it is done, and add that text to the documentation.

In this way, you are not just patching holes and fixing leaks, you are actively making things better. This not only pleases the people who come to you with problems, it also makes your work feel more meaningful.

3. Try not to break concentration while ”the computer is working”


The job of programming is filled with a lot of weird little micro pauses. The code is compiling. The console is rebooting. The level is loading. The client is connecting. Etc.

In the best of worlds, these pauses would not exist, and by all means, do all you can to get rid of them. Make your code build faster. Hot reload data and scripts. Make a tool for quickly setting up a bunch of PS3s for a network test. Etc.

But even with the best of efforts, some pauses will likely remain. The question is what to do with them. The temptation is to take a short break from programming and do something else: check mail, answer a Skype, read two paragraphs of an interesting article, update the twitter feed, etc.

For me, these constant mental context switches can put a real damper on productivity, since they make it impossible to maintain concentration and flow.

Nor are these micro-excursions particularly relaxing. Reading two paragraphs of a web page while constantly glancing at a progress bar on the other monitor is not something that soothes my mind. Quite contrary, it is much more stressful than remaining in the zen-like state of concentrated work. It is much better to take one real break than a hundred micro breaks.

So for both productivity and peace of mind I now make a conscious effort to stay focused on the problem at hand while ”the computer is working”. I have a separate text editor, unaffected by IDE freezes, where I can work on related tasks, such as:

  • Adding documentation
  • Refactoring and code review
  • Planning the next stage of implementation
  • Writing script code that tests the system

It is still an ongoing battle against the Lure of the Internet, but I find that when I manage to stay focused I am both more productive and more relaxed.

4. Use source control even more than you think you should


Source control is not just for source code. With modern distributed source control systems such as Mercurial and Git it is dead simple to create a source repository anywhere and then later (if needed) push it to a server for backup/sharing.

Do you have configuration and settings files for your text editor, IDE, etc? Put them in source control so that you can easily share them between your different machines. Do they need to be installed in special locations. Put them in source control together with a script that installs them in the right place.

Do you use any third party libraries such as zlib, LuaJIT or stb_vorbis? Check them into source control. That way, if you have to do any modifications (bug fixes, fixes for compiler warnings, platform fixes, your own personal optimizations, etc) you will know exactly what you have changed. If a new version of the library is released you can use the source control diff to see what has changed upstream and merge it with your own local changes.

Does an API come with sample code? Before you start playing with that sample code, check it into source control. That way, you can always revert the samples back to their pristine state, without having to reinstall the API. And if you find a bug in the APIs and manage to reproduce it in one of the samples, you can use the source control tool to produce a .patch file for the sample that you can send to the API manufacturers as part of your bug report. That will keep both you and them happy.

5. Monitor your builds


Set up a build server that continuously builds all your executables (engine, tools, exporters, ...) in all configurations (debug, development, release) for all platforms, so you know as soon as possible if something breaks. Fixing a problem right away is much easier than doing it two months down the line.

The build server doesn’t have to be a complicated thing. It is more important that it exists than that it has all the bells and whistles. If you don’t have time to do something advanced just write a script that compiles everything and reports the result. You can expand on that later.

Do the same for content. Write a script that loads all levels and spawns all units.

Use the report system that works best for you. We use Skype for internal real-time communication, so it makes sense to report broken builds over Skype. If e-mail or IRC works better for you, use that instead.

Wednesday, December 28, 2011

Code Share: Patch link.exe to ignore LNK4099

By default, Visual Studio's link.exe does not let you ignore the linker warning LNK4099 (PDB file was not found).

This can be a real nuisance when you have to link with third party libraries that reference (but do not come with) PDBs. You can get hundreds of linker warnings that you have no way of getting rid of.

The only way I've found of fixing the problem is to patch link.exe so that it allows warning 4099 to be ignored. Luckily, it is not as scary as it sounds. You only need to patch a single location to remove 4099 from a list of warnings that cannot be ignored. An outline of the procedure can be found here.

Following my general philosophy to write-a-script-for-it I wrote a short ruby script that does the patching. I'm sharing it here for everybody that want to do voodoo on their link.exe and get rid of the warning.

(Click here for pastebin version.)

# This ruby program will patch the linker executable (link.exe)
# so that linker warning LNK4099 is ignorable.
#
# Reference: http://www.bottledlight.com/docs/lnk4099.html

require "fileutils"

def link_exes()
 res = []
 res << File.join(ENV["VS90COMNTOOLS"], "../../VC/bin/link.exe") if ENV["VS90COMNTOOLS"]
 res << File.join(ENV["VS100COMNTOOLS"], "../../VC/bin/link.exe") if ENV["VS100COMNTOOLS"]
 res << File.join(ENV["XEDK"], "bin/win32/link.exe") if ENV["XEDK"]
 return res
end

def patch_link_exe(exe)
 data = nil
 File.open(exe, "rb") {|f| data = f.read}
 unpatched = [4088, 4099, 4105].pack("III")
 patched = [4088, 65535, 4105].pack("III")

 if data.scan(patched).size > 0
  puts "* Already patched #{exe}"
  return
 end

 num_unpatched = data.scan(unpatched).size
 raise "Multiple patch locations in #{exe}" if num_unpatched > 1
 raise "Patch location not found in #{exe}" if num_unpatched == 0

 offset = data.index(unpatched)
 puts "* Found patch location #{exe}:#{offset}"
 bak = exe + "-" + Time.now.strftime("%y%m%d-%H%M%S") + ".bak"
 puts "  Creating backup #{bak}"
 FileUtils.cp(exe, bak)
 puts "  Patching exe"
 data[offset,unpatched.size] = patched
 File.open(exe, "wb") {|f| f.write(data)}
 return true
end

link_exes.each do |exe|
 patch_link_exe(exe)
end

Thursday, December 22, 2011

Platform Specific Resources

I recently added a new feature to the BitSquid tool chain – support for source and destination platforms in the data compiler. What it means is that you can take the data for one platform (the source) and compile it to run on a different platform (the destination). So you can take the data for the mobile version of a game (with all its content optimizations) and compile it so that it runs on your development PC.

This is nice for two reasons. First, access to target hardware can be limited. In a perfect world, every artist would have a dev kit for every target platform. In practice, this might not be economically possible. It might not even be electrically possible (those main fuses can only take so much). Being able to preview and play console/handheld content on PC is better than nothing, in this less-than-perfect world.

Second, since all our editors use the engine for visualization, if we have specified a handheld device as our source platform, all the editors will automatically show the resources as they will appear on that device.

This new feature gives me a chance to talk a little bit about how we have implemented support for platform specific resources, something I haven’t touched on before in this blog.

The BitSquid Tech uses the regular file system for its source data. A resource is identified by its name and type, both of which are determined from the path to the source file:


Note that even though the name is a path, it is not treated as one, but as a unique identifier. It is hashed to a 64-bit integer by the engine and to refer to a resource you must always specify its full name (and get the same hash result). In the compiled data, the raw names don’t even exist anymore, the files are stored in flat directories indexed by the hash values.

In addition to name and type a resource can also have a number of properties. Properties are dot-separated strings that appear before the type in the file name:


Properties are used to indicate different variants of the same resource. So all these files represent variants of the same resource:

buttons.texture
buttons.ps3.texture
buttons.en.x360.texture
buttons.fr.x360.texture

The two most important forms of properties are platforms and languages.

Platform properties (x360, ps3, android, win32, etc) are used to provide platform specific versions of resources. This can be used for platform optimized versions of units and levels. Another use is for controller and button images that differ from platform to platform. Since BitSquid is scripted in Lua and Lua files are just a resource like any other, this can also be used for platform specific gameplay code:

PlayerController.android.lua

Language properties (en, fr, jp, it, sv, etc) are used for localization. Since all resources have properties, all resources can be localized.

But the property system is not limited to platforms and languages. A developer can make up whatever properties she needs and use them to provide different variants of resources:

bullet_hit.noblood.particle_effect
foilage.withkittens.texture

Properties can be resolved either at data compile time or at runtime.

Platform properties are resolved at compile time. When we compile for PS3 and a resource has ps3 specific variants, only those variants are included in the compiled data. (If the resource doesn’t have any ps3 variants, we include all variants that do not have a specified platform.)

Language properties and other custom properties are resolved at runtime. All variants are compiled to the runtime data. When running, the game can specify what resource variants it wants with a property preference order. The property preference order specifies the variants it wants to use, in order of preference.

Application.set_property_preference_order {”withkittens”, ”noblood”, ”fr”}

This means that the game would prefer to get a resource that has lots of kittens, no blood and is in French. But if it can’t get all that, it will rather have something that is kitten-full than blood-free. And it prefers a bloodless English resource to a bloody French one.

In other words, if we requested the resource buttons.texture with these settings, the engine would look for variants in the order:

buttons.withkittens.noblood.fr.texture
buttons.withkittens.noblood.texture
buttons.withkittens.fr.texture
buttons.withkittens.texture
buttons.noblood.fr.texture
buttons.noblood.texture
buttons.fr.texture
buttons.texture

To add support for different source and destination platforms to this system all I had to do was to add a feature that lets the data compiler use one platform for resolving properties and a different platform as the format for the runtime files it produces.

Thursday, December 8, 2011

A Pragmatic Approach to Performance

Is premature optimization the root of all evil? Or is the fix-it-later attitude to performance turning programmers from proud ”computer scientists” to despicable ”script kiddies”?

These are questions without definite answers, but in this article I’ll try to describe my own approach to performance. How I go about to ensure that my systems run decently, without compromising other goals, such as modularity, maintainability and flexibility.

§1 Programmer time is a finite resource

If you are writing a big program, some parts of the code will not be as fast as theoretically possible. Sorry, let me rephrase. If you are writing a big program, no part of the code will be as fast as theoretically possible. Yes, I think it is reasonable to assume that every single line of your code could be made to run a little tiny bit faster.

Writing fast software is not about maximum performance all the time. It is about good performance where it matters. If you spend three weeks optimizing a small piece of code that only gets called once a frame, then that’s three weeks of work you could have spent doing something more meaningful. If you had spent it on optimizing code that actually mattered, you could even have made a significant improvement to the game’s frame rate.

There is never enough time to add all the features, fix all the bugs and optimize all the code, so the goal should always be maximum performance for minimum effort.

§2 Don’t underestimate the power of simplicity

Simple solutions are easier to implement than complex solution. But that’s only the tip of the iceberg. The real benefits of simple solutions come in the long run. Simple solutions are easier to understand, easier to debug, easier to maintain, easier to port, easier to profile, easier to optimize, easier to parallelize and easier to replace. Over time, all these savings add up.

Using a simple solution can save so much time that even if it is slower than a more complex solution, as a whole your program will run faster, because you can use the time you saved to optimize other parts of the code. The parts that really matter.

I only use complex solutions when it is really justified. I.e. when the complex solution is significantly faster than the simple one (a factor 2 or so) and when it is in a system that matters (that consumes a significant percentage of the frame time).

Of course simplicity is in the eyes of the beholder. I think arrays are simple. I think POD data types are simple. I think blobs are simple. I don’t think class structures with 12 levels of inheritance are simple. I don’t think classes templated on 8 policy class parameters are simple. I don’t think geometric algebra is simple.

§3 Take advantage of the system design opportunity

Some people seem to think that to avoid ”premature optimization” you should design your systems without any regard to performance whatsoever. You should just slap something together and fix it later when you ”optimize” the code.

I wholeheartedly disagree. Not because I love performance for its own sake, but for purely pragmatic reasons.

When you design a system you have a clear picture in your head of how the different pieces fit together, what the requirements are and how often different functions get called. At that point, it is not much extra effort to take a few moments to think about how the system will perform and how you can setup the data structures so that it runs at fast as possible.

In contrast, if you build your system without considering performance and have to come in and ”fix it” at some later point, that will be much harder. If you have to rearrange the fundamental data structures or add multithreading support, you may have to rewrite the entire system almost from scratch. Only now the system is in production, so you may be restricted by the published API and dependencies to other systems. Also, you cannot break any of the projects that are using the system. And since it was several months since you (or someone else) wrote the code, you have to start by understanding all the thoughts that went into it. And all the little bug fixes and feature tweaks that have been added over time will most likely be lost in the rewrite. You will start again with a fresh batch of bugs.

So by just following our general guideline ”maximum efficiency with minimum effort”, we see that it is better to consider performance up front. Simply since that requires a lot less effort than fixing it later.

Within reason of course. The performance improvements we do up front are easier, but we are less sure that they matter in the big picture. Later, profile-guided fixes require more effort, but we know better where to focus our attention. As in whole life, balance is important.

When I design a system, I do a rough estimate of how many times each piece of code will be executed per frame and use that to guide the design:

  • 1-10 Performance doesn’t matter. Do whatever you want.
  • 100 Make sure it is O(n), data-oriented and cache friendly
  • 1000 Make sure it is multithreaded
  • 10000 Think really hard about what you are doing

I also have a few general guidelines that I try to follow when writing new systems:

  • Put static data in immutable, single-allocation memory blobs
  • Allocate dynamic data in big contiguous chunks
  • Use as little memory as possible
  • Prefer arrays to complex data structures
  • Access memory linearly (in a cache friendly way)
  • Make sure procedures run in O(n) time
  • Avoid ”do nothing” updates -- instead, keep track of active objects
  • If the system handles many objects, support data parallelism

By now I have written so many systems in this ”style” that it doesn’t require much effort to follow these guidelines. And I know that by doing so I get a decent baseline performance. The guidelines focus on the most important low-hanging fruit: algorithmic complexity, memory access and parallelization and thus give good performance for a relatively small effort.

Of course it is not always possible to follow all guidelines. For example, some algorithms really require more than O(n) time. But I know that when I go outside the guidelines I need to stop and think things through, to make sure I don’t trash the performance.

§4 Use top-down profiling to find bottlenecks

No matter how good your up front design is, your code will be spending time in unexpected places. The content people will use your system in crazy ways and expose bottlenecks that you’ve never thought about. There will be bugs in your code. Some of these bugs will not result in outright crashes, just bad performance. There will be things you haven’t really thought through.

To understand where your program is actually spending its time, a top down profiler is an invaluable tool. We use explicit profiler scopes in our code and pipe the data live over the network to an external tool that can visualize it in various ways:


An (old) screenshot of the BitSquid Profiler.


The top-down profiler tells you where your optimization efforts need to be focused. Do you spend 60 % of the frame time in the animation system and 0.5 % in the Gui. Then any optimizations you can make to the animations will really pay off, but what you do with the Gui won’t matter one iota.

With a top-down profiler you can insert narrower and narrower profiler scopes in the code to get to the root of a performance problem -- where the time is actually being spent.

I use the general design guidelines to get a good baseline performance for all systems and then drill down with the top-down profiler to find those systems that need a little bit of extra optimization attention.

§5 Use bottom-up profiling to find low-level optimization targets

I find that as a general tool, interactive top-down profiling with explicit scopes is more useful than a bottom-up sampling profiler.

But sampling profilers still have their uses. They are good at finding hotspot functions that are called from many different places and thus don’t necessary show up in a top-down profiler. Such hotspots can be a target for low-level, instruction-by-instruction optimizations. Or they can be an indication that you are doing something bad.

For example if strcmp() is showing up as a hotspot, then your program is being very very naughty and should be sent straight to bed without any cocoa.

A hotspot that often shows up in our code is lua_Vexecute(). This is not surprising. That is the main Lua VM function, a big switch statement that executes most of Lua’s opcodes. But it does tell us that some low level, platform specific optimizations of that function might actually result in real measurable performance benefits.

§6 Beware of synthetic benchmarks

I don’t do much synthetic benchmarking, i.e., looping the code 10 000 times over some made-up piece of data and measuring the execution time.

If I’m at a point where I don’t know whether a change will make the code faster or not, then I want to verify that with data from an actual game. Otherwise, how can I be sure that I’m not just optimizing the benchmark in ways that won’t carry over to real world cases.

A benchmark with 500 instances of the same entity, all playing the same animation is quite different from the same scene with 50 different unit types, all playing different animations. The data access patterns are completely different. Optimizations that improve one case may not matter at all in the other.

§7 Optimization is gardening

Programmers optimize the engine. Artists put in more stuff. It has always been thus. And it is good.

Optimization is not an isolated activity that happens at a specific time. It is a part of the whole life cycle: design, maintenance and evolution. Optimization is an ongoing dialog between artists and programmers about what the capabilities of the engine should be.

Managing performance is like tending a garden, checking that everything is ok, rooting out the weeds and finding ways for the plants to grow better.

It is the job of the artists to push the engine to its knees. And it is the job of the programmers’ job to bring it back up again, only stronger. In the process, a middle ground will be found where the games can shine as bright as possible.

Monday, November 7, 2011

An Example in Data-Oriented Design: Sound Parameters

The BitSquid sound system allows arbitrary parameters to be set on playing sounds:

force = 35.3
material = "wood"
weapon = "axe"

In the sound editor the sound designer can setup curves and switches that depend on these parameters. So, for example, the designer can choose to play different wav files for a weapon impact, depending on the weapon that was used and the material it hit. In addition the volume and pitch of the sound can be controlled by a curve connected to the force of the impact.

To implement this behavior, we need a way of representing such parameter sets in the engine. Since there can potentially be lots of playing sounds, we need a representation that is as efficient as possible.

If you did a by-the-book C++ design of this problem, you might end up with an abomination like this:

struct ParameterValue
{
 enum Type {STRING_TYPE, NUMERIC_TYPE};
 Type type;
 std::string string_value;
 float numeric_value;
};

typedef std::map<std::string, ParameterValue> Parameters;

struct SoundInstance
{
 // Other members...
 Parameters *parameters;
};

std::vector<SoundInstance> playing_sounds;

which would result in tons of pointer chasing, memory allocation and data copying.

So let’s fix it!

First, let’s get rid of the strings. Strings should almost only be used for text that is displayed to the end user. For everything else, they are usually a bad idea. In this case, since the only thing we need to do is match strings that are equal (find the parameter named ”material”, check if its is value ”wood”, etc) we can use a hash instead of the full string value:

struct ParameterValue
{
 enum Type {STRING_TYPE, NUMERIC_TYPE};
 Type type;
 union {
  IdString32 string_value;
  float numeric_value;
 };
};

typedef std::map<IdString32, ParameterValue> Parameters;

IdString32 is our type for representing hashed strings. It just stores a 4-byte string hash. Since it is a POD-type, we can put it in a union together with the numeric value. This takes the ParameterValue struct down to a manageable 8 bytes with no dynamic data allocation.

But we can actually make it even smaller, by just getting rid of the type:

union ParameterValue {
 IdString32 string_value;
 float numeric_value;
};

We can do this because when we access the parameter we know which type we want. If we are evaluating a curve, we want a numeric value. If we want to compare it to a hash, we want a string value. Getting rid of the type means we can’t assert() on type errors (if someone has done something silly like setting the ”material” to 3.5 or the ”force” to ”banana”). But other than that everything will work as before.

Next, let’s attack the map:

typedef std::map<IdString32, ParameterValue> Parameters;

Just like std::string, std::map should set off all kinds of warning bells in your head. std::map is almost never a good choice. Better alternatives are: linear search in a std::vector (for smallish maps), binary search in a sorted array (for larger, static maps) or hash_map.

In this case, we don’t expect there to be that many parameters set on a sound (<10 in the typical case), so linear search is fine:

struct Parameter {
IdString32 key;
union {
IdString32 string_value;
float numeric_value;
};
};

typedef std::vector<Parameter> Parameters;

struct SoundInstance
{
// Other members...
Parameters *parameters;
};

std::vector<SoundInstance> _playing_sounds;

A lot better than what we started with. But I’m still not 100 % satisfied.

I don’t like the fact that we have a vector of sound instances, and each of those contains a vector of parameters. Vectors-in-vectors raise performance warning flags for me. I like it when my data structures are just arrays of POD structs. Then I know that they are cache friendly and don’t put much strain on the memory system. 512 parameter vectors allocated on the heap for 512 playing sounds make me uneasy.

So what can we do? We could go to a fixed number of parameters:

struct SoundInstance
{
 // Other members...
 unsigned num_parameters;
 Parameter parameters[MAX_INSTANCE_PARAMETERS];
};

Now the SoundInstance is a POD and all the data is just one big happy blob.

The drawback of this approach is that you might need to set MAX_INSTANCE_PARAMETERS pretty high to be able to handle the most complicated sounds. This would waste some memory for all the sounds that use just one or two parameters.

Say you have 512 sounds and MAX_INSTANCE_PARAMETERS = 32, with 8 bytes in the Parameter struct that then totals to 131 K. Not terrible, but not a tuppence either.

There should be some way of doing better. But if we can’t use a dynamic vector, nor a static array, what can we then possibly use?

A linked list!

Regular linked list have horrible cache behavior and are best stayed away from. But we can achieve the benefits of linked lists while still having decent cache performance by putting the list in an array:

struct ParameterNode {
 IdString32 key;
 union {
  IdString32 string_value;
  float numeric_value;
 };
 ParameterNode *next;
};

ParameterNode nodes[MAX_PARAMETERS];

struct SoundInstance
{
 // Other members...
 ParameterNode *parameters;
};

std::vector<SoundInstance> playing_sounds;

Now we have all the parameters stored in a single memory blob. And instead of having a maximum number of parameters per sound, we have a total limit on the number of set parameters (which works much better when most sounds have few parameters). We could get rid of that limit as well if we needed to, by using a vector instead of an array to store the nodes and indices instead of pointers for the ”links”.

You can use many different strategies for allocating nodes from the array. My favorite method is to walk over the array until the next free node is found:

unsigned last_allocated = MAX_PARAMETERS-1;

Node *allocate_node()
{
 while (true) {
  last_allocated = (last_allocated + 1) % MAX_PARAMETERS;
  if (nodes[last_allocated].key == 0)
   break;
 }
 return &nodes[last_allocated];
}

Here, an empty key is used to indicate free nodes.

The advantage of this method is that nodes that are allocated at the same time end up in adjacent array slots. This means that all the parameters of a particular sound (which tend to get set at the same time) get stored next to each other in memory, which means they can be accessed without cache misses.

Sunday, October 23, 2011

Low Level Animation -- Part 2

Some time ago I wrote an article describing how animation compression is implemented in the BitSquid engine. In that article I made a vague promise that I would follow up with a description of how to pack the data in a cache-friendly way. Now, the time has come to deliver on that vague promise.

A quick recap: After curve fitting, each track of our animation consists of a number of curve points that describe the curve for each animation track:


By an animation track I mean the animation of a single parameter, typically the position or rotation of a bone.

The data for the track is a sequence of times and curve data:


Here t_i is the time of a curve point and A_i is the corresponding curve data.

To evaluate the curve at any particular point t we need the curve points both before and after the time t


Depending on what curve type you use (hermite, bezier, b-spline, etc) you might actually need more than two curve points to evaluate a segment, but that doesn’t really affect the discussion in this article, so for the sake of simplicity, let’s stick with two.

Note that the time points for the different tracks in the animation typically do not match up. For example, one curve may be completely flat and only require one sample at the start and one sample at the end. Another curve may be complicated and require lots of samples.

To simplify the discussion further, assume that the animation only contains two tracks (it is easy to generalize the solution to more tracks). We will call the curve points of one (t_i, A_i) and the curve points of the other (s_i, B_i):


How can we organize this data to be as cache friendly as possible?

The most natural approach is perhaps to sort the data first by track and then by time. Let’s see what this means for the cache. To evaluate the animation for some particular time t, we have to go into the data for each track at that time to look up the two neighboring curve points. Let’s assume that we have somehow cached our current position in each track, so that we don’t have to search for it, we will still have at least one cache miss for each track. A modern character can have over 100 bones, with two tracks per bone. That’s 200 cache misses for just a single frame of a single animation.

To do better, we need to organize the data by time somehow. But it is not immediately clear how. Just sorting the data by time won’t help, because then a flat curve with just two curve points, one at the beginning and one at the end, will have them at complete opposite ends of the data and no matter what we do we will get cache misses when touching them.

Let’s consider all the data we need to evaluate the tracks at time t. We need (t_i, A_i), (t_i+1, A_i+1) and (s_j, B_j), (s_j+1, B_j+1) where t_i <= t <= t_i+1 and s_j <= t <= s_j+1. This is our ”hot” data, because we will need to refer to it several times as we evaluate the curve at different points in time. In fact, we can keep using this same data until we reach whichever is smallest of t_i+1 and s_j+1. A general rule in memory access optimization is to keep the ”hot” data together, so let’s create an additional data structure, an array with the currently active curve points for a playing animation instance.


Now we’re getting somewhere. Not only have we significantly improved the cache behavior; as long as we don’t need to fetch new curve points we only need to refer to the active array, a single memory access. We have also decomposed our animation evaluation problem into two simpler tasks: evaluating curves and fetching new curve points. This makes our code both simpler and more flexible.

Let’s look at the second issue, fetching new curve points. In the example above, when we reach the time t_i+1 we will need to fetch the new curve point (t_i+2, A_i+2) and when we reach the time s_j+1 we will need to fetch (s_j+2, B_j+2).


Generalizing, we always need to fetch the point (t_i, A_i) at the time t_i-1, and we always need to fetch the point (s_i, B_i) at the time s_i-1. This is excellent, because since we now the time when each of our curve points will be needed we can put them all in a single stream of data which is sorted by the time when they will be needed.


This means that our animation player only needs to keep a single pointer into the animation stream. That pointer will always point to the next curve point that needs to be moved to the active list. As time is advanced, curve points are copied from the animation data into the active list and then the curve is evaluated.


Note the excellent cache behavior this gives us. To fetch new curve points, we just move a pointer forward in memory. And then, to evaluate the curves, we just need to access our active array, a single continuous memory block. This gives us a grand total of just two memory accesses.

Another nice property is that since we are now accessing the animation data as a stream (strictly linearly, from beginning to end) we can gzip it and get another factor two of compression. We can also easily stream it from disk.

One drawback of this system is that it only supports playing an animation forward, you cannot jump to a particular time in an animation without ”fast forwarding” through all intermediate curve points.

If you need support for jumping, the easiest way to achieve it is perhaps to add a separate index with jump frames. A jump frame consists of the state of the active array at some point in time, together with an offset into the data stream. In other words, all the state information that the animation player needs to jump to that time point and resume playing.

Using jump frames let’s you balance performance and memory use. If you add more jump frames you will use more memory but on the other hand, you will be able to find a jump frame closer to the time you actually want to go to which means less fast forwarding.

Saturday, October 8, 2011

Caring by Sharing: Header Hero


Compile times get worse over time, that is the second law of C++ programming dynamics. There are many small day-to-day changes that each exacerbate the problem slightly: The project grows. New header files get included. Clever templates get written. And so on. There are comparatively few forces that work in the other direction. Once an #include has been added, it stays.

The only exception is when some hero steps up, says Enough! and starts to crunch down on those header files. It is thankless menial work that offers few rewards, save the knowledge that you are contributing to the public good.

Today, I want to give something back to these unsung heroes, so I’ve made a small tool to make their drudgery a bit less... drudgery-ish? It is called Header Hero:


To run Header Hero you specify the directories where your .cpp files can be found as well as the directories to search for included headers. The program scans your .h and .cpp files to find all the include links. It presents the result in a summarized report that shows you what the worst headers are. You can think of it as a header file profiler.

You don’t need to specify all your include directories, but only the ones you have specified will be scanned.

I’ve focused on making the tool fast by caching as much information as possible and using a simple parser that just looks for #include patterns rather than running the real C preprocessor. The downside is that if you are using any fancy preprocessor tricks, they will most likely be missed. On the other hand, the tool can scan a huge project in seconds. And after the initial scan, new scans can be done in a fraction of that time.

The program produces a report that looks something like this:


At the top are some statistics, such as the total number of files and lines in the project. Total Parsed counts how many lines that would actually be parsed in a full recompile of the project. So, a header that is included by several .cpp files adds to that number every time. The Blowup Factor are the last two items divided. It specifies how many times, on average, each line gets parsed. A value of 35 means that on average, each line in our project is parsed 35 times. That seems quite a lot.

Below the summary are a list of the header files sorted by how many lines they contributed to the Total Parsed number. In other words, the size of that file multiplied by the number of times it was included.

Looking at the sample report above, it seems pretty reasonable. At the top we find big templated collection classes (map, set, string, vector) that have big header files and are used in a lot of places. Math (matrix4x4, vector3) and utility (critical_section, file_system) files also end up high on the list.

But when you dig into it, there are also things that seem a bit fishy. Set<T> is not a very popular collection class. Sets are used less than maps, and HashSet is usually preferable to Set. Why does it end up so high on the list? What is shader.h doing there? That seems too specialized to end up so high. And file_system.h? There shouldn’t be that much code that directly accesses the file system, only the resource loader needs to do that.

To answer those questions, you can click on any file in the report to get a detailed view of its relations:


In the middle we find the file we are looking at. To the left are the files that directly include it. The number after each file name specifies how many files that directly or indirectly include that file. To the right are the files included by the file. The numbers are all the files directly or indirectly included by those files. You can double click on any file name in the view to refocus on it.

Here we clearly see that the main culprit is data_compiler.h. It includes set.h and is in turn included by 316 other files. To fix the compile times we can make data_compiler.h not include set.h or we can try to reduce the number of files that include data_compiler.h (that number also seems high). If we also fix scene_graph.h we can really make a difference.

Breaking dependencies is a whole topic in itself, especially when it comes to templates and inlined code. Here are some quick tips though:

1) Predeclare the structs and classes that you use instead of including the header file. Don’t forget that you can predeclare templates and typedefs as well as regular classes:

class MyClass;
typedef int Id;
template <class T> class Vector;

2) Predeclared types can only be used as pointers and references. You can’t have a member variable of a type whose actual size is unknown. So you may have to change your member variables to pointers in order to get rid of the header dependency. You can also use the pimpl idiom, if you can live with the extra indirection and lack of inlining.

3) Switching from in-place variables to pointers can lead to bad memory access patterns. One way of fixing that is to placement new the object directly into a raw memory buffer.

// a.h

class B;

class A {
    A();
    B *_b;
    static const int SIZE_OF_B = 20;
    char _b_storage[SIZE_OF_B];
};

// a.cpp

#include ”b.h”

A::A()
{
    XASSERT(sizeof(B) == SIZE_OF_B);
    _b = new (_b_storage) B();
}

With this technique, you get the data for B stored inside A, without having to include the b.h header in a.h. But the code isn’t exactly easy to read, so you should only use this in desperate situations.

4) For files with small type definitions, but lots of inlined methods (e.g., matrix4x4.h), a good strategy is to split the file, so you have just the type in one file and all the methods in the other. Header files can then include just the type definition, while .cpp files pull in the whole shebang.

Using these techniques you can get rid of the header dependencies one by one, until you are back at reasonable compile times. Since a rescan takes just a fraction of a second it is easy to see how your changes affect the compile time. Just make sure you have your integration test running, it is easy to break build configurations when you are fiddling around with the headers.

Here is the result of about a day and a half of header optimization in our code base:


From 6 million to 4.3 million lines, that’s not too shabby. We can now do a complete rebuild in 37 seconds on a reasonably modern machine. With this tool we can hopefully keep that number.

You can download the C# source code here. Feel free to do whatever you like with it: