Wednesday, December 28, 2011

Code Share: Patch link.exe to ignore LNK4099

By default, Visual Studio's link.exe does not let you ignore the linker warning LNK4099 (PDB file was not found).

This can be a real nuisance when you have to link with third party libraries that reference (but do not come with) PDBs. You can get hundreds of linker warnings that you have no way of getting rid of.

The only way I've found of fixing the problem is to patch link.exe so that it allows warning 4099 to be ignored. Luckily, it is not as scary as it sounds. You only need to patch a single location to remove 4099 from a list of warnings that cannot be ignored. An outline of the procedure can be found here.

Following my general philosophy to write-a-script-for-it I wrote a short ruby script that does the patching. I'm sharing it here for everybody that want to do voodoo on their link.exe and get rid of the warning.

(Click here for pastebin version.)

# This ruby program will patch the linker executable (link.exe)
# so that linker warning LNK4099 is ignorable.
#
# Reference: http://www.bottledlight.com/docs/lnk4099.html

require "fileutils"

def link_exes()
 res = []
 res << File.join(ENV["VS90COMNTOOLS"], "../../VC/bin/link.exe") if ENV["VS90COMNTOOLS"]
 res << File.join(ENV["VS100COMNTOOLS"], "../../VC/bin/link.exe") if ENV["VS100COMNTOOLS"]
 res << File.join(ENV["XEDK"], "bin/win32/link.exe") if ENV["XEDK"]
 return res
end

def patch_link_exe(exe)
 data = nil
 File.open(exe, "rb") {|f| data = f.read}
 unpatched = [4088, 4099, 4105].pack("III")
 patched = [4088, 65535, 4105].pack("III")

 if data.scan(patched).size > 0
  puts "* Already patched #{exe}"
  return
 end

 num_unpatched = data.scan(unpatched).size
 raise "Multiple patch locations in #{exe}" if num_unpatched > 1
 raise "Patch location not found in #{exe}" if num_unpatched == 0

 offset = data.index(unpatched)
 puts "* Found patch location #{exe}:#{offset}"
 bak = exe + "-" + Time.now.strftime("%y%m%d-%H%M%S") + ".bak"
 puts "  Creating backup #{bak}"
 FileUtils.cp(exe, bak)
 puts "  Patching exe"
 data[offset,unpatched.size] = patched
 File.open(exe, "wb") {|f| f.write(data)}
 return true
end

link_exes.each do |exe|
 patch_link_exe(exe)
end

Thursday, December 22, 2011

Platform Specific Resources

I recently added a new feature to the BitSquid tool chain – support for source and destination platforms in the data compiler. What it means is that you can take the data for one platform (the source) and compile it to run on a different platform (the destination). So you can take the data for the mobile version of a game (with all its content optimizations) and compile it so that it runs on your development PC.

This is nice for two reasons. First, access to target hardware can be limited. In a perfect world, every artist would have a dev kit for every target platform. In practice, this might not be economically possible. It might not even be electrically possible (those main fuses can only take so much). Being able to preview and play console/handheld content on PC is better than nothing, in this less-than-perfect world.

Second, since all our editors use the engine for visualization, if we have specified a handheld device as our source platform, all the editors will automatically show the resources as they will appear on that device.

This new feature gives me a chance to talk a little bit about how we have implemented support for platform specific resources, something I haven’t touched on before in this blog.

The BitSquid Tech uses the regular file system for its source data. A resource is identified by its name and type, both of which are determined from the path to the source file:


Note that even though the name is a path, it is not treated as one, but as a unique identifier. It is hashed to a 64-bit integer by the engine and to refer to a resource you must always specify its full name (and get the same hash result). In the compiled data, the raw names don’t even exist anymore, the files are stored in flat directories indexed by the hash values.

In addition to name and type a resource can also have a number of properties. Properties are dot-separated strings that appear before the type in the file name:


Properties are used to indicate different variants of the same resource. So all these files represent variants of the same resource:

buttons.texture
buttons.ps3.texture
buttons.en.x360.texture
buttons.fr.x360.texture

The two most important forms of properties are platforms and languages.

Platform properties (x360, ps3, android, win32, etc) are used to provide platform specific versions of resources. This can be used for platform optimized versions of units and levels. Another use is for controller and button images that differ from platform to platform. Since BitSquid is scripted in Lua and Lua files are just a resource like any other, this can also be used for platform specific gameplay code:

PlayerController.android.lua

Language properties (en, fr, jp, it, sv, etc) are used for localization. Since all resources have properties, all resources can be localized.

But the property system is not limited to platforms and languages. A developer can make up whatever properties she needs and use them to provide different variants of resources:

bullet_hit.noblood.particle_effect
foilage.withkittens.texture

Properties can be resolved either at data compile time or at runtime.

Platform properties are resolved at compile time. When we compile for PS3 and a resource has ps3 specific variants, only those variants are included in the compiled data. (If the resource doesn’t have any ps3 variants, we include all variants that do not have a specified platform.)

Language properties and other custom properties are resolved at runtime. All variants are compiled to the runtime data. When running, the game can specify what resource variants it wants with a property preference order. The property preference order specifies the variants it wants to use, in order of preference.

Application.set_property_preference_order {”withkittens”, ”noblood”, ”fr”}

This means that the game would prefer to get a resource that has lots of kittens, no blood and is in French. But if it can’t get all that, it will rather have something that is kitten-full than blood-free. And it prefers a bloodless English resource to a bloody French one.

In other words, if we requested the resource buttons.texture with these settings, the engine would look for variants in the order:

buttons.withkittens.noblood.fr.texture
buttons.withkittens.noblood.texture
buttons.withkittens.fr.texture
buttons.withkittens.texture
buttons.noblood.fr.texture
buttons.noblood.texture
buttons.fr.texture
buttons.texture

To add support for different source and destination platforms to this system all I had to do was to add a feature that lets the data compiler use one platform for resolving properties and a different platform as the format for the runtime files it produces.

Thursday, December 8, 2011

A Pragmatic Approach to Performance

Is premature optimization the root of all evil? Or is the fix-it-later attitude to performance turning programmers from proud ”computer scientists” to despicable ”script kiddies”?

These are questions without definite answers, but in this article I’ll try to describe my own approach to performance. How I go about to ensure that my systems run decently, without compromising other goals, such as modularity, maintainability and flexibility.

§1 Programmer time is a finite resource

If you are writing a big program, some parts of the code will not be as fast as theoretically possible. Sorry, let me rephrase. If you are writing a big program, no part of the code will be as fast as theoretically possible. Yes, I think it is reasonable to assume that every single line of your code could be made to run a little tiny bit faster.

Writing fast software is not about maximum performance all the time. It is about good performance where it matters. If you spend three weeks optimizing a small piece of code that only gets called once a frame, then that’s three weeks of work you could have spent doing something more meaningful. If you had spent it on optimizing code that actually mattered, you could even have made a significant improvement to the game’s frame rate.

There is never enough time to add all the features, fix all the bugs and optimize all the code, so the goal should always be maximum performance for minimum effort.

§2 Don’t underestimate the power of simplicity

Simple solutions are easier to implement than complex solution. But that’s only the tip of the iceberg. The real benefits of simple solutions come in the long run. Simple solutions are easier to understand, easier to debug, easier to maintain, easier to port, easier to profile, easier to optimize, easier to parallelize and easier to replace. Over time, all these savings add up.

Using a simple solution can save so much time that even if it is slower than a more complex solution, as a whole your program will run faster, because you can use the time you saved to optimize other parts of the code. The parts that really matter.

I only use complex solutions when it is really justified. I.e. when the complex solution is significantly faster than the simple one (a factor 2 or so) and when it is in a system that matters (that consumes a significant percentage of the frame time).

Of course simplicity is in the eyes of the beholder. I think arrays are simple. I think POD data types are simple. I think blobs are simple. I don’t think class structures with 12 levels of inheritance are simple. I don’t think classes templated on 8 policy class parameters are simple. I don’t think geometric algebra is simple.

§3 Take advantage of the system design opportunity

Some people seem to think that to avoid ”premature optimization” you should design your systems without any regard to performance whatsoever. You should just slap something together and fix it later when you ”optimize” the code.

I wholeheartedly disagree. Not because I love performance for its own sake, but for purely pragmatic reasons.

When you design a system you have a clear picture in your head of how the different pieces fit together, what the requirements are and how often different functions get called. At that point, it is not much extra effort to take a few moments to think about how the system will perform and how you can setup the data structures so that it runs at fast as possible.

In contrast, if you build your system without considering performance and have to come in and ”fix it” at some later point, that will be much harder. If you have to rearrange the fundamental data structures or add multithreading support, you may have to rewrite the entire system almost from scratch. Only now the system is in production, so you may be restricted by the published API and dependencies to other systems. Also, you cannot break any of the projects that are using the system. And since it was several months since you (or someone else) wrote the code, you have to start by understanding all the thoughts that went into it. And all the little bug fixes and feature tweaks that have been added over time will most likely be lost in the rewrite. You will start again with a fresh batch of bugs.

So by just following our general guideline ”maximum efficiency with minimum effort”, we see that it is better to consider performance up front. Simply since that requires a lot less effort than fixing it later.

Within reason of course. The performance improvements we do up front are easier, but we are less sure that they matter in the big picture. Later, profile-guided fixes require more effort, but we know better where to focus our attention. As in whole life, balance is important.

When I design a system, I do a rough estimate of how many times each piece of code will be executed per frame and use that to guide the design:

  • 1-10 Performance doesn’t matter. Do whatever you want.
  • 100 Make sure it is O(n), data-oriented and cache friendly
  • 1000 Make sure it is multithreaded
  • 10000 Think really hard about what you are doing

I also have a few general guidelines that I try to follow when writing new systems:

  • Put static data in immutable, single-allocation memory blobs
  • Allocate dynamic data in big contiguous chunks
  • Use as little memory as possible
  • Prefer arrays to complex data structures
  • Access memory linearly (in a cache friendly way)
  • Make sure procedures run in O(n) time
  • Avoid ”do nothing” updates -- instead, keep track of active objects
  • If the system handles many objects, support data parallelism

By now I have written so many systems in this ”style” that it doesn’t require much effort to follow these guidelines. And I know that by doing so I get a decent baseline performance. The guidelines focus on the most important low-hanging fruit: algorithmic complexity, memory access and parallelization and thus give good performance for a relatively small effort.

Of course it is not always possible to follow all guidelines. For example, some algorithms really require more than O(n) time. But I know that when I go outside the guidelines I need to stop and think things through, to make sure I don’t trash the performance.

§4 Use top-down profiling to find bottlenecks

No matter how good your up front design is, your code will be spending time in unexpected places. The content people will use your system in crazy ways and expose bottlenecks that you’ve never thought about. There will be bugs in your code. Some of these bugs will not result in outright crashes, just bad performance. There will be things you haven’t really thought through.

To understand where your program is actually spending its time, a top down profiler is an invaluable tool. We use explicit profiler scopes in our code and pipe the data live over the network to an external tool that can visualize it in various ways:


An (old) screenshot of the BitSquid Profiler.


The top-down profiler tells you where your optimization efforts need to be focused. Do you spend 60 % of the frame time in the animation system and 0.5 % in the Gui. Then any optimizations you can make to the animations will really pay off, but what you do with the Gui won’t matter one iota.

With a top-down profiler you can insert narrower and narrower profiler scopes in the code to get to the root of a performance problem -- where the time is actually being spent.

I use the general design guidelines to get a good baseline performance for all systems and then drill down with the top-down profiler to find those systems that need a little bit of extra optimization attention.

§5 Use bottom-up profiling to find low-level optimization targets

I find that as a general tool, interactive top-down profiling with explicit scopes is more useful than a bottom-up sampling profiler.

But sampling profilers still have their uses. They are good at finding hotspot functions that are called from many different places and thus don’t necessary show up in a top-down profiler. Such hotspots can be a target for low-level, instruction-by-instruction optimizations. Or they can be an indication that you are doing something bad.

For example if strcmp() is showing up as a hotspot, then your program is being very very naughty and should be sent straight to bed without any cocoa.

A hotspot that often shows up in our code is lua_Vexecute(). This is not surprising. That is the main Lua VM function, a big switch statement that executes most of Lua’s opcodes. But it does tell us that some low level, platform specific optimizations of that function might actually result in real measurable performance benefits.

§6 Beware of synthetic benchmarks

I don’t do much synthetic benchmarking, i.e., looping the code 10 000 times over some made-up piece of data and measuring the execution time.

If I’m at a point where I don’t know whether a change will make the code faster or not, then I want to verify that with data from an actual game. Otherwise, how can I be sure that I’m not just optimizing the benchmark in ways that won’t carry over to real world cases.

A benchmark with 500 instances of the same entity, all playing the same animation is quite different from the same scene with 50 different unit types, all playing different animations. The data access patterns are completely different. Optimizations that improve one case may not matter at all in the other.

§7 Optimization is gardening

Programmers optimize the engine. Artists put in more stuff. It has always been thus. And it is good.

Optimization is not an isolated activity that happens at a specific time. It is a part of the whole life cycle: design, maintenance and evolution. Optimization is an ongoing dialog between artists and programmers about what the capabilities of the engine should be.

Managing performance is like tending a garden, checking that everything is ok, rooting out the weeds and finding ways for the plants to grow better.

It is the job of the artists to push the engine to its knees. And it is the job of the programmers’ job to bring it back up again, only stronger. In the process, a middle ground will be found where the games can shine as bright as possible.

Monday, November 7, 2011

An Example in Data-Oriented Design: Sound Parameters

The BitSquid sound system allows arbitrary parameters to be set on playing sounds:

force = 35.3
material = "wood"
weapon = "axe"

In the sound editor the sound designer can setup curves and switches that depend on these parameters. So, for example, the designer can choose to play different wav files for a weapon impact, depending on the weapon that was used and the material it hit. In addition the volume and pitch of the sound can be controlled by a curve connected to the force of the impact.

To implement this behavior, we need a way of representing such parameter sets in the engine. Since there can potentially be lots of playing sounds, we need a representation that is as efficient as possible.

If you did a by-the-book C++ design of this problem, you might end up with an abomination like this:

struct ParameterValue
{
 enum Type {STRING_TYPE, NUMERIC_TYPE};
 Type type;
 std::string string_value;
 float numeric_value;
};

typedef std::map<std::string, ParameterValue> Parameters;

struct SoundInstance
{
 // Other members...
 Parameters *parameters;
};

std::vector<SoundInstance> playing_sounds;

which would result in tons of pointer chasing, memory allocation and data copying.

So let’s fix it!

First, let’s get rid of the strings. Strings should almost only be used for text that is displayed to the end user. For everything else, they are usually a bad idea. In this case, since the only thing we need to do is match strings that are equal (find the parameter named ”material”, check if its is value ”wood”, etc) we can use a hash instead of the full string value:

struct ParameterValue
{
 enum Type {STRING_TYPE, NUMERIC_TYPE};
 Type type;
 union {
  IdString32 string_value;
  float numeric_value;
 };
};

typedef std::map<IdString32, ParameterValue> Parameters;

IdString32 is our type for representing hashed strings. It just stores a 4-byte string hash. Since it is a POD-type, we can put it in a union together with the numeric value. This takes the ParameterValue struct down to a manageable 8 bytes with no dynamic data allocation.

But we can actually make it even smaller, by just getting rid of the type:

union ParameterValue {
 IdString32 string_value;
 float numeric_value;
};

We can do this because when we access the parameter we know which type we want. If we are evaluating a curve, we want a numeric value. If we want to compare it to a hash, we want a string value. Getting rid of the type means we can’t assert() on type errors (if someone has done something silly like setting the ”material” to 3.5 or the ”force” to ”banana”). But other than that everything will work as before.

Next, let’s attack the map:

typedef std::map<IdString32, ParameterValue> Parameters;

Just like std::string, std::map should set off all kinds of warning bells in your head. std::map is almost never a good choice. Better alternatives are: linear search in a std::vector (for smallish maps), binary search in a sorted array (for larger, static maps) or hash_map.

In this case, we don’t expect there to be that many parameters set on a sound (<10 in the typical case), so linear search is fine:

struct Parameter {
IdString32 key;
union {
IdString32 string_value;
float numeric_value;
};
};

typedef std::vector<Parameter> Parameters;

struct SoundInstance
{
// Other members...
Parameters *parameters;
};

std::vector<SoundInstance> _playing_sounds;

A lot better than what we started with. But I’m still not 100 % satisfied.

I don’t like the fact that we have a vector of sound instances, and each of those contains a vector of parameters. Vectors-in-vectors raise performance warning flags for me. I like it when my data structures are just arrays of POD structs. Then I know that they are cache friendly and don’t put much strain on the memory system. 512 parameter vectors allocated on the heap for 512 playing sounds make me uneasy.

So what can we do? We could go to a fixed number of parameters:

struct SoundInstance
{
 // Other members...
 unsigned num_parameters;
 Parameter parameters[MAX_INSTANCE_PARAMETERS];
};

Now the SoundInstance is a POD and all the data is just one big happy blob.

The drawback of this approach is that you might need to set MAX_INSTANCE_PARAMETERS pretty high to be able to handle the most complicated sounds. This would waste some memory for all the sounds that use just one or two parameters.

Say you have 512 sounds and MAX_INSTANCE_PARAMETERS = 32, with 8 bytes in the Parameter struct that then totals to 131 K. Not terrible, but not a tuppence either.

There should be some way of doing better. But if we can’t use a dynamic vector, nor a static array, what can we then possibly use?

A linked list!

Regular linked list have horrible cache behavior and are best stayed away from. But we can achieve the benefits of linked lists while still having decent cache performance by putting the list in an array:

struct ParameterNode {
 IdString32 key;
 union {
  IdString32 string_value;
  float numeric_value;
 };
 ParameterNode *next;
};

ParameterNode nodes[MAX_PARAMETERS];

struct SoundInstance
{
 // Other members...
 ParameterNode *parameters;
};

std::vector<SoundInstance> playing_sounds;

Now we have all the parameters stored in a single memory blob. And instead of having a maximum number of parameters per sound, we have a total limit on the number of set parameters (which works much better when most sounds have few parameters). We could get rid of that limit as well if we needed to, by using a vector instead of an array to store the nodes and indices instead of pointers for the ”links”.

You can use many different strategies for allocating nodes from the array. My favorite method is to walk over the array until the next free node is found:

unsigned last_allocated = MAX_PARAMETERS-1;

Node *allocate_node()
{
 while (true) {
  last_allocated = (last_allocated + 1) % MAX_PARAMETERS;
  if (nodes[last_allocated].key == 0)
   break;
 }
 return &nodes[last_allocated];
}

Here, an empty key is used to indicate free nodes.

The advantage of this method is that nodes that are allocated at the same time end up in adjacent array slots. This means that all the parameters of a particular sound (which tend to get set at the same time) get stored next to each other in memory, which means they can be accessed without cache misses.

Sunday, October 23, 2011

Low Level Animation -- Part 2

Some time ago I wrote an article describing how animation compression is implemented in the BitSquid engine. In that article I made a vague promise that I would follow up with a description of how to pack the data in a cache-friendly way. Now, the time has come to deliver on that vague promise.

A quick recap: After curve fitting, each track of our animation consists of a number of curve points that describe the curve for each animation track:


By an animation track I mean the animation of a single parameter, typically the position or rotation of a bone.

The data for the track is a sequence of times and curve data:


Here t_i is the time of a curve point and A_i is the corresponding curve data.

To evaluate the curve at any particular point t we need the curve points both before and after the time t


Depending on what curve type you use (hermite, bezier, b-spline, etc) you might actually need more than two curve points to evaluate a segment, but that doesn’t really affect the discussion in this article, so for the sake of simplicity, let’s stick with two.

Note that the time points for the different tracks in the animation typically do not match up. For example, one curve may be completely flat and only require one sample at the start and one sample at the end. Another curve may be complicated and require lots of samples.

To simplify the discussion further, assume that the animation only contains two tracks (it is easy to generalize the solution to more tracks). We will call the curve points of one (t_i, A_i) and the curve points of the other (s_i, B_i):


How can we organize this data to be as cache friendly as possible?

The most natural approach is perhaps to sort the data first by track and then by time. Let’s see what this means for the cache. To evaluate the animation for some particular time t, we have to go into the data for each track at that time to look up the two neighboring curve points. Let’s assume that we have somehow cached our current position in each track, so that we don’t have to search for it, we will still have at least one cache miss for each track. A modern character can have over 100 bones, with two tracks per bone. That’s 200 cache misses for just a single frame of a single animation.

To do better, we need to organize the data by time somehow. But it is not immediately clear how. Just sorting the data by time won’t help, because then a flat curve with just two curve points, one at the beginning and one at the end, will have them at complete opposite ends of the data and no matter what we do we will get cache misses when touching them.

Let’s consider all the data we need to evaluate the tracks at time t. We need (t_i, A_i), (t_i+1, A_i+1) and (s_j, B_j), (s_j+1, B_j+1) where t_i <= t <= t_i+1 and s_j <= t <= s_j+1. This is our ”hot” data, because we will need to refer to it several times as we evaluate the curve at different points in time. In fact, we can keep using this same data until we reach whichever is smallest of t_i+1 and s_j+1. A general rule in memory access optimization is to keep the ”hot” data together, so let’s create an additional data structure, an array with the currently active curve points for a playing animation instance.


Now we’re getting somewhere. Not only have we significantly improved the cache behavior; as long as we don’t need to fetch new curve points we only need to refer to the active array, a single memory access. We have also decomposed our animation evaluation problem into two simpler tasks: evaluating curves and fetching new curve points. This makes our code both simpler and more flexible.

Let’s look at the second issue, fetching new curve points. In the example above, when we reach the time t_i+1 we will need to fetch the new curve point (t_i+2, A_i+2) and when we reach the time s_j+1 we will need to fetch (s_j+2, B_j+2).


Generalizing, we always need to fetch the point (t_i, A_i) at the time t_i-1, and we always need to fetch the point (s_i, B_i) at the time s_i-1. This is excellent, because since we now the time when each of our curve points will be needed we can put them all in a single stream of data which is sorted by the time when they will be needed.


This means that our animation player only needs to keep a single pointer into the animation stream. That pointer will always point to the next curve point that needs to be moved to the active list. As time is advanced, curve points are copied from the animation data into the active list and then the curve is evaluated.


Note the excellent cache behavior this gives us. To fetch new curve points, we just move a pointer forward in memory. And then, to evaluate the curves, we just need to access our active array, a single continuous memory block. This gives us a grand total of just two memory accesses.

Another nice property is that since we are now accessing the animation data as a stream (strictly linearly, from beginning to end) we can gzip it and get another factor two of compression. We can also easily stream it from disk.

One drawback of this system is that it only supports playing an animation forward, you cannot jump to a particular time in an animation without ”fast forwarding” through all intermediate curve points.

If you need support for jumping, the easiest way to achieve it is perhaps to add a separate index with jump frames. A jump frame consists of the state of the active array at some point in time, together with an offset into the data stream. In other words, all the state information that the animation player needs to jump to that time point and resume playing.

Using jump frames let’s you balance performance and memory use. If you add more jump frames you will use more memory but on the other hand, you will be able to find a jump frame closer to the time you actually want to go to which means less fast forwarding.

Saturday, October 8, 2011

Caring by Sharing: Header Hero


Compile times get worse over time, that is the second law of C++ programming dynamics. There are many small day-to-day changes that each exacerbate the problem slightly: The project grows. New header files get included. Clever templates get written. And so on. There are comparatively few forces that work in the other direction. Once an #include has been added, it stays.

The only exception is when some hero steps up, says Enough! and starts to crunch down on those header files. It is thankless menial work that offers few rewards, save the knowledge that you are contributing to the public good.

Today, I want to give something back to these unsung heroes, so I’ve made a small tool to make their drudgery a bit less... drudgery-ish? It is called Header Hero:


To run Header Hero you specify the directories where your .cpp files can be found as well as the directories to search for included headers. The program scans your .h and .cpp files to find all the include links. It presents the result in a summarized report that shows you what the worst headers are. You can think of it as a header file profiler.

You don’t need to specify all your include directories, but only the ones you have specified will be scanned.

I’ve focused on making the tool fast by caching as much information as possible and using a simple parser that just looks for #include patterns rather than running the real C preprocessor. The downside is that if you are using any fancy preprocessor tricks, they will most likely be missed. On the other hand, the tool can scan a huge project in seconds. And after the initial scan, new scans can be done in a fraction of that time.

The program produces a report that looks something like this:


At the top are some statistics, such as the total number of files and lines in the project. Total Parsed counts how many lines that would actually be parsed in a full recompile of the project. So, a header that is included by several .cpp files adds to that number every time. The Blowup Factor are the last two items divided. It specifies how many times, on average, each line gets parsed. A value of 35 means that on average, each line in our project is parsed 35 times. That seems quite a lot.

Below the summary are a list of the header files sorted by how many lines they contributed to the Total Parsed number. In other words, the size of that file multiplied by the number of times it was included.

Looking at the sample report above, it seems pretty reasonable. At the top we find big templated collection classes (map, set, string, vector) that have big header files and are used in a lot of places. Math (matrix4x4, vector3) and utility (critical_section, file_system) files also end up high on the list.

But when you dig into it, there are also things that seem a bit fishy. Set<T> is not a very popular collection class. Sets are used less than maps, and HashSet is usually preferable to Set. Why does it end up so high on the list? What is shader.h doing there? That seems too specialized to end up so high. And file_system.h? There shouldn’t be that much code that directly accesses the file system, only the resource loader needs to do that.

To answer those questions, you can click on any file in the report to get a detailed view of its relations:


In the middle we find the file we are looking at. To the left are the files that directly include it. The number after each file name specifies how many files that directly or indirectly include that file. To the right are the files included by the file. The numbers are all the files directly or indirectly included by those files. You can double click on any file name in the view to refocus on it.

Here we clearly see that the main culprit is data_compiler.h. It includes set.h and is in turn included by 316 other files. To fix the compile times we can make data_compiler.h not include set.h or we can try to reduce the number of files that include data_compiler.h (that number also seems high). If we also fix scene_graph.h we can really make a difference.

Breaking dependencies is a whole topic in itself, especially when it comes to templates and inlined code. Here are some quick tips though:

1) Predeclare the structs and classes that you use instead of including the header file. Don’t forget that you can predeclare templates and typedefs as well as regular classes:

class MyClass;
typedef int Id;
template <class T> class Vector;

2) Predeclared types can only be used as pointers and references. You can’t have a member variable of a type whose actual size is unknown. So you may have to change your member variables to pointers in order to get rid of the header dependency. You can also use the pimpl idiom, if you can live with the extra indirection and lack of inlining.

3) Switching from in-place variables to pointers can lead to bad memory access patterns. One way of fixing that is to placement new the object directly into a raw memory buffer.

// a.h

class B;

class A {
    A();
    B *_b;
    static const int SIZE_OF_B = 20;
    char _b_storage[SIZE_OF_B];
};

// a.cpp

#include ”b.h”

A::A()
{
    XASSERT(sizeof(B) == SIZE_OF_B);
    _b = new (_b_storage) B();
}

With this technique, you get the data for B stored inside A, without having to include the b.h header in a.h. But the code isn’t exactly easy to read, so you should only use this in desperate situations.

4) For files with small type definitions, but lots of inlined methods (e.g., matrix4x4.h), a good strategy is to split the file, so you have just the type in one file and all the methods in the other. Header files can then include just the type definition, while .cpp files pull in the whole shebang.

Using these techniques you can get rid of the header dependencies one by one, until you are back at reasonable compile times. Since a rescan takes just a fraction of a second it is easy to see how your changes affect the compile time. Just make sure you have your integration test running, it is easy to break build configurations when you are fiddling around with the headers.

Here is the result of about a day and a half of header optimization in our code base:


From 6 million to 4.3 million lines, that’s not too shabby. We can now do a complete rebuild in 37 seconds on a reasonably modern machine. With this tool we can hopefully keep that number.

You can download the C# source code here. Feel free to do whatever you like with it:

Friday, September 23, 2011

Managing Decoupling Part 4 -- The ID Lookup Table

Today I am going to dig deeper into an important and versatile data structure that pops up all the time in the BitSquid engine -- the ID lookup table.

I have already talked about the advantages of using IDs to refer to objects owned by other systems, but let me just quickly recap.

IDs are better than direct pointers because we don’t get dangling references if the other system decides that the object needs to be destroyed.

IDs are better than shared_ptr<> and weak_ptr<> because it allows the other system to reorganize its objects in memory, delete them at will and doesn’t require thread synchronization to maintain a reference count. They are also POD (plain old data) structures, so they can be copied and moved in memory freely, passed back and forth between C++ and Lua, etc.

By an ID I simply mean an opaque data structure of n bits. It has no particular meaning to us, we just use it to refer to an object. The system provides the mechanism for looking up an object based on it. Since we seldom create more than 4 billion objects, 32 bits is usually enough for the ID, so we can just use a standard integer. If a system needs a lot of objects, we can go to 64 bits.

In this post I’m going to look at what data structures a system might use to do the lookup from ID to system object. There are some requirements that such data structures need to fulfill:

  • There should be a 1-1 mapping between live objects and IDs.
  • If the system is supplied with an ID to an old object, it should be able to detect that the object is no longer alive.
  • Lookup from ID to object should be very fast (this is the most common operation).
  • Adding and removing objects should be fast.

Let’s look at three different ways of implementing this data structure, with increasing degrees of sophistication.

The STL Method


The by-the-book object oriented approach is to allocate objects on the heap and use a std::map to map from ID to object.

typedef unsigned ID;

struct System
{
	ID _next_id;
	std::map<ID, Object *> _objects;

	System() {_next_id = 0;}

	inline bool has(ID id) {
		return _objects.count(id) > 0;
	}
	
	inline Object &lookup(ID id) {
		return *_objects[id];
	}
	
	inline ID add() {
		ID id = _next_id++;
		Object *o = new Object();
		o->id = id;
		_objects[id] = o;
		return id;
	}
	
	inline void remove(ID id) {
		Object &o = lookup(id);
		_objects.erase(id);
		delete &o;
	}
};

Note that if we create more than four billion objects, the _next_id counter will wrap around and we risk getting two objects with the same ID.

Apart from that, the only problem with this solution is that it is really inefficient. All objects are allocated individually on the heap, which gives bad cache behavior and the map lookup results in tree walking which is also bad for the cache. We can switch the map to a hash_map for slightly better performance, but that still leaves a lot of unnecessary pointer chasing.

Array With Holes


What we really want to do is to store our objects linearly in memory, because that will give us the best possible cache behavior. We can either use a fixed size array Object[MAX_SIZE] if we know the maximum number of objects that will ever be used, or we can be more flexible and use a std::vector.

Note: If you care about performance and use std::vector<T> you should make a variant of it (call it array<T> for example) that doesn’t call constructors or initializes memory. Use that for simple types, when you don’t care about initialization. A dynamic vector<T> buffer that grows and shrinks a lot can spend a huge amount of time doing completely unnecessary constructor calls.

To find an object in the array, we need to know its index. But just using the index as ID is not enough, because the object might have been destroyed and a new object might have been created at the same index. To check for that, we also need an id value, as before. So we make the ID type a combination of both:

struct ID {
	unsigned index;
	unsigned inner_id;
};

Now we can use the index to quickly look up the object and the inner_id to verify its identity.

Since the object index is stored in the ID which is exposed externally, once an object has been created it cannot move. When objects are deleted they will leave holes in the array.


When we create new objects we don’t just want to add them to the end of the array. We want to make sure that we fill the holes in the array first.

The standard way of doing that is with a free list. We store a pointer to the first hole in a variable. In each hole we store a pointer to the next hole. These pointers thus form a linked list that enumerates all the holes.


An interesting thing to note is that we usually don’t need to allocate any memory for these pointers. Since the pointers are only used for holes (i. e. dead objects) we can reuse the objects’ own memory for storing them. The objects don’t need that memory, since they are dead.

Here is an implementation. For clarity, I have used an explicit member next in the object for the free list rather than reusing the object’s memory:

struct System
{
	unsigned _next_inner_id;
	std::vector<Object> _objects;
	unsigned _freelist;

	System() {
		_next_inner_id = 0;
		_freelist = UINT_MAX;
	}

	inline bool has(ID id) {
		return _objects[id.index].id.inner_id == id.inner_id;
	}
	
	inline Object &lookup(ID id) {
		return _objects[id.index];
	}
	
	inline ID add() {
		ID id;
		id.inner_id = _next_inner_id++;
		if (_freelist == UINT_MAX) {
			Object o;
			id.index = _objects.size();
			o.id = id;
			_objects.push_back(o);
		} else {
			id.index = _freelist;
			_freelist = _objects[_freelist].next;
		}
		return id;
	}
	
	inline void remove(ID id) {
		Object &o = lookup(id);
		o.id.inner_id = UINT_MAX;
		o.next = _freelist;
		_freelist = id.index;
	}
};

This is a lot better than the STL solution. Insertion and removal is O(1). Lookup is just array indexing, which means it is very fast. In a quick-and-dirty-don’t-take-it-too-seriously test this was 40 times faster than the STL solution. In real-life it all depends on the actual usage patterns, of course.

The only part of this solution that is not an improvement over the STL version is that our ID structs have increased from 32 to 64 bits.

There are things that can be done about that. For example, if you never have more than 64 K objects live at the same time, you can get by with 16 bits for the index, which leaves 16 bits for the inner_id. Note that the inner_id doesn’t have to be globally unique, it is enough if it is unique for that index slot. So a 16 bit inner_id is fine if we never create more than 64 K objects in the same index slot.

If you want to go down that road you probably want to change the implementation of the free list slightly. The code above uses a standard free list implementation that acts as a LIFO stack. This means that if you create and delete objects in quick succession they will all be assigned to the same index slot which means you quickly run out of inner_ids for that slot. To prevent that, you want to make sure that you always have a certain number of elements in the free list (allocate more if you run low) and rewrite it as a FIFO. If you always have N free objects and use a FIFO free list, then you are guaranteed that you won’t see an inner_id collision until you have created at least N * 64 K objects.

Of course you can slice and dice the 32 bits in other ways if you hare different limits on the maximum number of objects. You have to crunch the numbers for your particular case to see if you can get by with a 32 bit ID.

Packed Array


One drawback with the approach sketched above is that since the index is exposed externally, the system cannot reorganize its objects in memory for maximum performance.

The holes are especially troubling. At some point the system probably wants to loop over all its objects and update them. If the object array is nearly full, no problem, But if the array has 50 % objects and 50 % holes, the loop will touch twice as much memory as necessary. That seems suboptimal.

We can get rid of that by introducing an extra level of indirection, where the IDs point to an array of indices that points to the objects themselves:


This means that we pay the cost of an extra array lookup whenever we resolve the ID. On the other hand, the system objects are packed tight in memory which means that they can be updated more efficiently. Note that the system update doesn’t have to touch or care about the index array. Whether this is a net win depends on how the system is used, but my guess is that in most cases more items are touched internally than are referenced externally.

To remove an object with this solution we use the standard trick of swapping it with the last item in the array. Then we update the index so that it points to the new location of the swapped object.

Here is an implementation. To keep things interesting, this time with a fixed array size, a 32 bit ID and a FIFO free list.

typedef unsigned ID;

#define MAX_OBJECTS 64*1024
#define INDEX_MASK 0xffff
#define NEW_OBJECT_ID_ADD 0x10000

struct Index {
	ID id;
	unsigned short index;
	unsigned short next;
};

struct System
{
	unsigned _num_objects;
	Object _objects[MAX_OBJECTS];
	Index _indices[MAX_OBJECTS];
	unsigned short _freelist_enqueue;
	unsigned short _freelist_dequeue;

	System() {
		_num_objects = 0;
		for (unsigned i=0; i<MAX_OBJECTS; ++i) {
			_indices[i].id = i;
			_indices[i].next = i+1;
		}
		_freelist_dequeue = 0;
		_freelist_enqueue = MAX_OBJECTS-1;
	}

	inline bool has(ID id) {
		Index &in = _indices[id & INDEX_MASK];
		return in.id == id && in.index != USHRT_MAX;
	}
	
	inline Object &lookup(ID id) {
		return _objects[_indices[id & INDEX_MASK].index];
	}
	
	inline ID add() {
		Index &in = _indices[_freelist_dequeue];
		_freelist_dequeue = in.next;
		in.id += NEW_OBJECT_ID_ADD;
		in.index = _num_objects++;
		Object &o = _objects[in.index];
		o.id = in.id;
		return o.id;
	}
	
	inline void remove(ID id) {
		Index &in = _indices[id & INDEX_MASK];
		
		Object &o = _objects[in.index];
		o = _objects[--_num_objects];
		_indices[o.id & INDEX_MASK].index = in.index;
		
		in.index = USHRT_MAX;
		_indices[_freelist_enqueue].next = id & INDEX_MASK;
		_freelist_enqueue = id & INDEX_MASK;
	}
};

Thursday, September 8, 2011

A Simple Roll-Your-Own Documentation System

I like to roll my own documentation systems. There, I’ve said it. Not for inline documentation, mind you. For that there is Doxygen and for that I am grateful. Because while I love coding, there is fun coding and not-so-fun coding, and writing C++ parsers tends to fall in the latter category.

So for inline documentation I use Doxygen, but for everything else, I roll my own. Why?

I don’t want to use Word or Pages or any other word processing program because I want my documents to be plain text that can be diffed and merged when necessary. And I want to be able to output it as clean HTML or in any other format I may like.

I don’t want to use HTML or LaTeX or any other presentation-oriented language, because I want to be able to massage the content in various ways before presenting it. Reordering it, adding an index or a glossary, removing deprecated parts, etc. Also, writing <p> gets boring very quickly.

I don’t want to use a Wiki, because I want to check in my documents together with the code, so that code versions and document versions match in the repository. I definitely don’t want to manage five different Wikis, corresponding to different engine release versions. Also, Wiki markup languages tend to be verbose and obtuse.

I could use an existing markup language, such as DocBook, Markdown or ReStructured Text. But all of them contain lots of stuff that I don’t need and lack some stuff that I do need. For example I want to include snippets of syntax highlighted Lua code, margin notes and math formulas. And I want to do it in a way that is easy to read and easy to write. Because I want there to be as few things as possible standing in the way of writing good documentation.

So I roll my own. But as you will see, it is not that much work.

I’ve written a fair number of markup systems over the years (perhaps one too many, but hey, that is how you learn) and I’ve settled on a pretty minimalistic structure that can be implemented in a few hundred lines of Ruby. In general, I tend to favor simple minimalistic systems over big frameworks that try to ”cover everything”. Covering everything is usually impossible and when you discover that you need new functionality, the lightweight systems are a lot easier to extend than the behemoths.

There are two basic components to the system. Always two there are, a parser and a generator. The parser reads the source document and converts it to some kind of structured representation. The generator takes the structured representation and converts it to an output format. Here I’ll only consider HTML, because to me that is the only output format that really matters.

To have something concrete to talk about, let’s use this source document, written in a syntax that I just made up:

@h1 Flavors of ice cream

My favorite ice cream flavors are:

@li Strawberry
@li Seagull

The Parser


The most crucial point of the system is what the structured representation should look like. How should the parser communicate with the generator? My minimalistic solution is to just let the representation be a list of lines, with each line consisting of a type marker and some text.

(:h1, ”Flavors of...”)
(:empty, ””)
(:text, ”My favorite...”)
(:empty, ””)
(:li, ”Strawberry”)
(:li, ”Seagull”)

To some this will probably seem like complete heresy. Surely I need some kind of hierarchical representation. How can I otherwise represent things like a list-in-a-list-in-a-cat-in-a-hat?

No problem, to represent a list item nested in another list, I just use a @li_li tag and a corresponding :li_li type marker. If someone wants three or more levels of nesting I suggest that they rewrite their document. This is supposed to be readable documentation, not Tractatus Logico-Philosophicus. I simply don’t think that deep nesting is important enough to warrant a complicated hierarchical design. As I said, I prefer the simple things in life.

So, now that we know the output format, we can write the parser in under 20 lines:

class Parser
  attr_reader :lines
  
  def initialize()
    @lines = []
  end
  
  def parse(line)
    case line
    when /^$/
      @lines << {:type => :empty, :line => ""}
    when /@(\S+)\s+(.*)$/
      @lines << {:type => $1.intern, :line => $2}
    when /^(.*)$/
      @lines << {:type => :text, :line => line}
    end
  end
end

Of course you can go a lot fancier with the parser than this. For example, you can make a more Markdown-like syntax where you create lists by just starting lines with bullet points. But this doesn’t really change the basic structure, you just need to add more whens in your case-statement.

One useful approach, as you make more advanced parsers, is to have markers that put the parser in a particular state. For example, you could have a marker @lua that made the parser consider all the lines following it to be of type :lua until the marker @endlua was reached.

The Generator


A useful trick when writing HTML generators is to always keep track of the HTML tags that you have currently opened. This lets you write a method context(tags) which takes a list of tags as arguments and closes and opens tags so that exactly the tags specified in the list are open.

With such a method available, it is simple to write the code for outputting tags:

class Generator
  def h1(line)
    context(%W(h1 #{"a name=\"#{line}\""}))
    print line
  end
  
  def text(line)
    context(%w(p))
    print line
  end

  def empty(line)
    context(%w())
    print line
  end
  
  def li(line)
    context(%w(ul li))
    print line
    context(%w(ul))
  end
end

Notice how this works. The li() method makes sure that we are in a <ul> <li> context, so it closes all other open tags and opens the right ones. Then, after printing its content, it says that the context should just be <ul> which forces the closure of the <li> tag. If we wanted to support the :li_li tag, mentioned above, we could write it simply as:

class Generator
  def li_li(line)
    context(%w(ul li ul li))
    print line
    context(%w(ul li ul))
  end
end

Notice also that this approach allows us to just step through the lines in the data structure and print them. We don’t have to look back and forward in the data structure to find out where a <ul> should begin and end.

The rest of the Generator class implements the context() function and handles indentation:

class Generator
  def initialize()
    @out = ""
    @context = []
    @indent = 0
  end
  
  def print(s)
    @out << ("  " * @indent) << s << "\n"
  end
  
  def open(ci)
    print "<#{ci}>"
    @indent += 1
  end
  
  def close(ci)
    @indent -= 1
    print "</#{ci[/^\S*/]}>"
  end
  
  def context(c)
    i = 0
    while @context[i] != nil && @context[i] == c[i]
      i += 1
    end
    while @context.size > i
      close(@context.last)
      @context.pop
    end
    while c.size > @context.size
      @context.push( c[@context.size] )
      open(@context.last)
    end
  end
  
  def format(lines)
    lines.each {|line| self.send(line[:type], line[:line])
    context(%w())
    return @out
  end
end

Used as:

parser = Parser.new
text.each_line {|line| parser.parse(line)}
puts Generator.new.format(parser.lines)

So there you have it, the start of a custom documentation system, easy to extend with new tags in under 100 lines of Ruby code.

There are some things I haven’t touched on here, like TOC generation or inline formatting (bold and emphasized text). But it is easy to write them as extensions of this basic system. For example, the TOC could be generated with an additional pass over the structured data. If there is enough interest I could show an example in a follow-up post.

Thursday, August 25, 2011

Code Snippet: Murmur hash inverse / pre-image

Today's caring by sharing. I needed this non-trivial code snippet today and couldn't find it anywhere on the internet, so here it is for future reference. It computes the inverse / pre-image of a murmur hash. I. e., given a 32 bit murmur hash value, it computes a 32 bit value that when hashed produces that hash value:

/// Inverts a (h ^= h >> s) operation with 8 <= s <= 16
unsigned int invert_shift_xor(unsigned int hs, unsigned int s)
{
	XENSURE(s >= 8 && s <= 16);
	unsigned hs0 = hs >> 24;
	unsigned hs1 = (hs >> 16) & 0xff;
	unsigned hs2 = (hs >> 8) & 0xff;
	unsigned hs3 = hs & 0xff;

	unsigned h0 = hs0;
	unsigned h1 = hs1 ^ (h0 >> (s-8));
	unsigned h2 = (hs2 ^ (h0 << (16-s)) ^ (h1 >> (s-8))) & 0xff;
	unsigned h3 = (hs3 ^ (h1 << (16-s)) ^ (h2 >> (s-8))) & 0xff;
	return (h0<<24) + (h1<<16) + (h2<<8) + h3;
}

unsigned int murmur_hash_inverse(unsigned int h, unsigned int seed)
{
	const unsigned int m = 0x5bd1e995;
	const unsigned int minv = 0xe59b19bd;	// Multiplicative inverse of m under % 2^32
	const int r = 24;

	h = invert_shift_xor(h,15);
	h *= minv;
	h = invert_shift_xor(h,13);

	unsigned int hforward = seed ^ 4;
	hforward *= m;
	unsigned int k = hforward ^ h;
	k *= minv;
	k ^= k >> r;
	k *= minv;

	#ifdef PLATFORM_BIG_ENDIAN
		char *data = (char *)&k;
		k = (data[0]) + (data[1] << 8) + (data[2] << 16) + (data[3] << 24);
	#endif

	return k;
}
And for reference, here is the full code, with both the regular murmur hash and the inverses for 32- and 64-bit hashes:
unsigned int murmur_hash ( const void * key, int len, unsigned int seed )
{
	// 'm' and 'r' are mixing constants generated offline.
	// They're not really 'magic', they just happen to work well.

	const unsigned int m = 0x5bd1e995;
	const int r = 24;

	// Initialize the hash to a 'random' value

	unsigned int h = seed ^ len;

	// Mix 4 bytes at a time into the hash

	const unsigned char * data = (const unsigned char *)key;

	while(len >= 4)
	{
		#ifdef PLATFORM_BIG_ENDIAN
			unsigned int k = (data[0]) + (data[1] << 8) + (data[2] << 16) + (data[3] << 24);
		#else
			unsigned int k = *(unsigned int *)data;
		#endif

		k *= m;
		k ^= k >> r;
		k *= m;

		h *= m;
		h ^= k;

		data += 4;
		len -= 4;
	}

	// Handle the last few bytes of the input array

	switch(len)
	{
	case 3: h ^= data[2] << 16;
	case 2: h ^= data[1] << 8;
	case 1: h ^= data[0];
		h *= m;
	};

	// Do a few final mixes of the hash to ensure the last few
	// bytes are well-incorporated.

	h ^= h >> 13;
	h *= m;
	h ^= h >> 15;

	return h;
}

/// Inverts a (h ^= h >> s) operation with 8 <= s <= 16
unsigned int invert_shift_xor(unsigned int hs, unsigned int s)
{
	XENSURE(s >= 8 && s <= 16);
	unsigned hs0 = hs >> 24;
	unsigned hs1 = (hs >> 16) & 0xff;
	unsigned hs2 = (hs >> 8) & 0xff;
	unsigned hs3 = hs & 0xff;

	unsigned h0 = hs0;
	unsigned h1 = hs1 ^ (h0 >> (s-8));
	unsigned h2 = (hs2 ^ (h0 << (16-s)) ^ (h1 >> (s-8))) & 0xff;
	unsigned h3 = (hs3 ^ (h1 << (16-s)) ^ (h2 >> (s-8))) & 0xff;
	return (h0<<24) + (h1<<16) + (h2<<8) + h3;
}

unsigned int murmur_hash_inverse(unsigned int h, unsigned int seed)
{
	const unsigned int m = 0x5bd1e995;
	const unsigned int minv = 0xe59b19bd;	// Multiplicative inverse of m under % 2^32
	const int r = 24;

	h = invert_shift_xor(h,15);
	h *= minv;
	h = invert_shift_xor(h,13);

	unsigned int hforward = seed ^ 4;
	hforward *= m;
	unsigned int k = hforward ^ h;
	k *= minv;
	k ^= k >> r;
	k *= minv;

	#ifdef PLATFORM_BIG_ENDIAN
		char *data = (char *)&k;
		k = (data[0]) + (data[1] << 8) + (data[2] << 16) + (data[3] << 24);
	#endif

	return k;
}

uint64 murmur_hash_64(const void * key, int len, uint64 seed)
{
	const uint64 m = 0xc6a4a7935bd1e995ULL;
	const int r = 47;

	uint64 h = seed ^ (len * m);

	const uint64 * data = (const uint64 *)key;
	const uint64 * end = data + (len/8);

	while(data != end)
	{
		#ifdef PLATFORM_BIG_ENDIAN
			uint64 k = *data++;
			char *p = (char *)&k;
			char c;
			c = p[0]; p[0] = p[7]; p[7] = c;
			c = p[1]; p[1] = p[6]; p[6] = c;
			c = p[2]; p[2] = p[5]; p[5] = c;
			c = p[3]; p[3] = p[4]; p[4] = c;
		#else
			uint64 k = *data++;
		#endif

		k *= m;
		k ^= k >> r;
		k *= m;
		
		h ^= k;
		h *= m;
	}

	const unsigned char * data2 = (const unsigned char*)data;

	switch(len & 7)
	{
	case 7: h ^= uint64(data2[6]) << 48;
	case 6: h ^= uint64(data2[5]) << 40;
	case 5: h ^= uint64(data2[4]) << 32;
	case 4: h ^= uint64(data2[3]) << 24;
	case 3: h ^= uint64(data2[2]) << 16;
	case 2: h ^= uint64(data2[1]) << 8;
	case 1: h ^= uint64(data2[0]);
			h *= m;
	};
 
	h ^= h >> r;
	h *= m;
	h ^= h >> r;

	return h;
}

uint64 murmur_hash_64_inverse(uint64 h, uint64 seed)
{
	const uint64 m = 0xc6a4a7935bd1e995ULL;
	const uint64 minv = 0x5f7a0ea7e59b19bdULL; // Multiplicative inverse of m under % 2^64
	const int r = 47;

	h ^= h >> r;
	h *= minv;
	h ^= h >> r;
	h *= minv;

	uint64 hforward = seed ^ (8 * m);
	uint64 k = h ^ hforward;

	k *= minv;
	k ^= k >> r;
	k *= minv;

	#ifdef PLATFORM_BIG_ENDIAN
		char *p = (char *)&k;
		char c;
		c = p[0]; p[0] = p[7]; p[7] = c;
		c = p[1]; p[1] = p[6]; p[6] = c;
		c = p[2]; p[2] = p[5]; p[5] = c;
		c = p[3]; p[3] = p[4]; p[4] = c;
	#endif
	
	return k;
}

Wednesday, August 24, 2011

An idea for better watch windows

Watch windows suck. I’ve spent a large part of my career looking at them (that’s how those bugs get fixed) and it’s often a frustrating experience.


Visual Studio’s watch window is one of the better ones, but it still has many issues that make the debugging experience a lot less pleasant than it could be.

  • Custom data types such as MyTree, MyHashSet and MyLinkedList are difficult to look at. To get to the content you have to understand the internal data layout and expand the links by hand.
  • I like to pack my resource data into tight static blobs -- file formats for memory. A simple such blob might have a header with a variable number of offsets into a buffer of tightly packed strings. Such memory layouts cannot be described with just C structs and the watch window can’t inspect them. You have to cast pointers by hand or use the Memory view.


I don’t even see the code. All I see is a hermite curve fitted, time key sorted, zlib compressed reload animation.

  • If I have an array with 10 000 floats and one of them is a #NaN, I have no way of finding out except to expand it and scroll through the numbers until I find the bad one.
  • The watch window can’t do reverse lookup of string hashes, so when I see a hash value in the data I have no idea what it refers to.

Yes, I know that some of these things can be fixed. I know that you can get the Visual Studio Debugger to understand your own data types by editing autoexp.dat. And since I’ve done that for all our major collection types (Vector, Deque, Map, SortMap, HashMap, Set, SortSet, HashSet, ConstConfigValue and DynamicConfigValue) I know what a pain it is, and I know I don’t want to do it any more. Also, it doesn’t help the debuggers for the other platforms.

I also know that you can do some tricks with Visual Studio extensions. At my previous company we had reverse hash lookup through a Visual Studio extension. That was also painful to write, and a single platform solution.

So yes, you can fix some things and will make your work environment a little better. But I think we should aim higher.

Consider this: The variable watcher has access to the entire game memory and plenty of time to analyze it. (Variable watching is not a time critical task.)

Imagine what a well written C program that knew the layout of all your data structures could do with that information. It could expand binary trees and display them in a nice view, reverse lookup your hashes, highlight uninitialized 0xdeadbeef variables, spell check your strings, etc.

The idea


So this is my idea: instead of writing plug-ins and extensions for all the IDEs and platforms in the world, we write the watcher as a separate external program. The user starts the program, connects to a process, enters a memory address and a variable type and gets presented with a nice view of the data:


The connection backend would be customizable so that we could use it both for local processes and remote devices (Xbox/PS3). The front end sends an (address, size) request and the backend replies with a bunch of data. So the platform doesn’t matter. As long as there is some way of accessing the memory of the device we can connect it to the watcher.

We can even use it to look at file contents. All we need is a backend that can return data from different offsets in the file. This works especially well for data blobs, where the file and memory formats are identical. The watcher would function as a general data viewer that could be used for both files and memory.

For this to work, we need a way to describe our data structures to the program. It should understand regular C structs, of course, but we also need some way of describing more complex data, such as variable length objects, offsets, choices, etc. Essentially, what we need is a generic way to describe blobs of structured data, no matter what the format and layout.

I’m not sure what such a description language might look like (or if one already exists), but it might be something loosely based on C structs and then extended to cover more cases. Perhaps something like:

struct Data
{
	zero_terminated char[] name;
	pad_to_4_bytes_alignment;
	platform_endian unsigned count;
	Entry entries[count];
};

The program also needs an extension mechanism so that we can write custom code for processing objects that can’t be described using even this more advanced syntax. This could be used for things like reverse hash lookups, or other queries that depend on external data.

Going further the program could be extended with more visualizers that could allow you to view and edit complex objects in lots of interesting ways:


I think this could be a really useful tool, both for debugging and for inspecting files (as a sort of beefed up hex editor). All I need is some time to write it.

What do you think?

Tuesday, August 9, 2011

Fixing memory issues in Lua

Garbage collection can be both a blessing and a curse. On the one hand, it frees you from manually managing memory. This saves development time, reduces bugs, and avoids tricky decisions about objects' ownerships and lifetimes.

On the other hand, when you do run into memory issues (and you most likely will), they can be a lot harder to diagnose and fix, because you don't have detailed control over how memory is allocated and freed.

In this post I'll show some techniques that you can use to address memory issues in Lua (and by extension, in other garbage collected languages).

All Lua memory issues essentially boil down to one of two things:

Lua uses too much memory
On consoles memory is a precious resource and sometimes Lua is just using too much of it. The root cause can either be memory leaks or badly constructed/bloated data structures.
Garbage collection is taking too long
Too much garbage collection is (not surprisingly) caused by having too much garbage. The code must be rewritten so that it generates less garbage.

Let's look at each issue in turn and see how we can address it.

1. Lua uses too much memory


The first step towards plugging leaks and reducing memory use is to find out where the memory is going. Once we know that, the problems are usually quite easy to fix.

So how do we find out where the memory is going? One way would be to add tracing code to the lua_Alloc() function, but actually there is a much simpler method that doesn't require any C code and is more in line with Lua's dynamic nature. We can just use Lua to count all the objects in the runtime image:

function count_all(f)
	local seen = {}
	local count_table
	count_table = function(t)
		if seen[t] then return end
		f(t)
		seen[t] = true
		for k,v in pairs(t) do
			if type(v) == "table" then
				count_table(v)
			elseif type(v) == "userdata" then
				f(v)
			end
		end
	end
	count_table(_G)
end

Here we just start with the global table _G and recursively enumerate all subtables and userdata. For each object that we haven't seen before, we call the enumeration function f. This will enumerate all the objects in the Lua runtime that can be reached from _G. Depending on how you use Lua you may also want to add some code for enumerating objects stored in the registry, and recurse over metatables and function upvalues to make sure that you really count all the objects in the runtime.

Once you have a function for enumerating all your Lua objects, there are lots of useful things you can do. When it comes to plugging leaks and reducing memory usage I find one of the most useful things is to count the number of objects of each type:

function type_count()
	local counts = {}
	local enumerate = function (o)
		local t = type_name(o)
		counts[t] = (counts[t] or 0) + 1
	end
	count_all(enumerate)
	return counts
end

Here type_name() is a function that returns the name of an object's type. This function will depend on what kind of class/object system you use in your Lua runtime. One common approach is to have global class objects that also act as metatables for objects:

-- A class
Car = {}
Car.__index = Car

-- A method
function Car.honk(self)
	print "toot"
end

-- An object
local my_car = {}
setmetatable(my_car, Car)

In this case, the type_name() function could look something like this:

global_type_table = nil
function type_name(o)
	if global_type_table == nil then
		global_type_table = {}
		for k,v in pairs(_G) do
			global_type_table[v] = k
		end
		global_type_table[0] = "table"
	end
	return global_type_table[getmetatable(o) or 0] or "Unknown"
end

The object count usually gives you a good idea of where your memory problems lie. For example, if the number of AiPathNode objects constantly rises, you can conclude that you are somehow leaking those objects. If you have 200 000 GridCell objects you should write a smarter grid implementation.

You can also use this enumeration technique to pinpoint problems further if necessary. For example, if you are hunting for leaks, you can rewrite the count_all() function so that it keeps track of the sub keys where an object were found. In this way, you might see that the AiPathNode objects can be accessed through paths like:

_G.managers.ai_managers.active_paths[2027]

Then you know that the source of the leak is that paths never get removed from the active_paths table.

2. Garbage collection is taking too long


Garbage collection is a very cache unfriendly task that can have a significant performance impact. This is especially frustrating since garbage collection doesn't really do anything. Well, it lets your gameplay programmers work faster and with fewer bugs, but when you have reached the optimization phase you tend to forget about that and just swear at the slow collector.

Lua's default garbage collection scheme is not adapted for realtime software and if you just run it straight up you will get lots of disturbing frame rate hitches. As has already been mentioned in previous #AltDevBlogADay articles, it is better to use a step size of 0 and just run the garbage collector for a certain number of milliseconds every frame:

OpaqueTimeValue start = time();
while (milliseconds_elapsed_since(start) < milliseconds_to_run)
	lua_gc(L, LUA_GCSTEP, 0);

Note that you can run this garbage collection on any thread, as long as Lua is not running at the same time, so you might be able to offset some of the cost by running the garbage collection on a background thread while your main thread is doing something non-Lua related.

How much time should you spend on garbage collection? A tricky question. If you spend too little, the garbage will grow and you will eventually run out of memory. If you spend too much, you are wasting precious milliseconds.

My preferred solution is to use a feedback mechanism. I dynamically adjust the garbage collection time so that the amount of garbage always stays below 10 % of the total Lua memory. If the garbage goes above that, I increase the collection time. If the garbage goes below, I decrease the collection time. As with all feedback mechanisms is a good idea to plot the curves for memory use and garbage collection time as you tweak the feedback parameters. That way you can verify that the system behaves nicely and that the curves settle down in a stable state rather than going into oscillation.

Choosing the figure 10 % is a balance between memory use and performance. If you choose a higher value, your program will use more memory (because of the increased amount of garbage). On the other hand, you can give the garbage collection a smaller time slice. I've chosen a pretty low number, because on consoles, memory is always precious. If you are targeting a platform with more memory, you can go higher.

Let's compute how much time we need to spend on garbage collection to stay below a certain fraction 0 <= a <= 1 of garbage. Assume that we complete a full garbage collection cycle (scan all Lua memory) in time t. The amount of garbage generated in that time will be:

t g

Where g is the garbage/s created by the program. To make sure that we stay below a fraction a we must have (where m is the total memory used by the program, including the garbage):

t g <= a m

Assume that we sweep s bytes/s. Then the time t required to sweep the entire memory m will be:

t = m / s

Combining the two equations we get:

s <= g / a

So the amount of garbage collection work we need to do per frame is directly proportional to the amount of garbage / s generated by the program and inversely proportional to the fraction of garbage we are willing to accept. (Note that interestingly, m cancels out of the equation.)

So, if we are willing to spend more memory, we can address garbage collection problems by increasing a. But since a can never be higher than 1, there are limits to what we can achieve in this way. A better option, that doesn't cost any memory, is to reduce g -- the amount of garbage generated.

In my experience, most garbage generation problems are "easy mistakes" from sloppy and thoughtless programming. Once you know where the problems are, it is usually not hard to rewrite the code so that garbage generation is avoided. Some useful refactoring techniques are:

  • Update the fields in an existing table instead of creating a new one.
  • Return a reference to an object member rather than a copy. Copy only when needed.
  • Write functions so that they take and return values rather than tables to avoid temporary tables. I. e., make_point(2,3) rather than make_point({2,3}).
  • If you need temporary objects, find a way of reusing them so you don't need to create so many of them.
  • Avoid excessive string concatenation.

Of course a key requirement for this to work is that your Lua-to-C bindings are written so that they don't generate garbage. Otherwise your poor gameplay programmer has no chance. In my opinion, it should be possible to call any C function in a "garbage free" way (though you may choose to also have a more convenient path that does generate garbage). For tips on how to write garbage free bindings, see my previous posts on Lightweight Lua Bindings.

To reduce garbage generation, you need to be able to pinpoint where in the program the garbage is being generated. Luckily, that is not difficult.

Once the game has reached a stable state (total Lua memory doesn't grow or shrink) any allocation made can be considered garbage, because it will soon be freed again (otherwise the Lua memory would keep growing). So to find the garbage all you have to do is to add some tracing code to lua_Alloc that you can trigger when you have reached a stable state.

You can use lua_getstack() to get the current Lua stack trace from inside lua_Alloc and use a HashMap to count the number of allocations associated with each stack trace. If you then sort this data by the number of allocations it is easy to identify the "hotspots" that are generating the most garbage. A gameplay programmer can go through this list and reduce the amount of garbage generation using the tips above.

The code may look something like this:

struct TraceEntry {
	TraceEntry() : alloc_count(0), alloc_bytes(0) {}
	String trace;
	unsigned alloc_count;
	unsigned alloc_bytes;
};
HashMap<uint64, TraceEntry> _traces;

if (_tracing_allocs) {
	lua_Debug stack[5] = {0};
	int count = lua_debugger::stack_dump(L, stack, 5);
	uint64 hash = murmur_hash_64(&stack[0], sizeof(lua_Debug)*count);
	TraceEntry &te = _traces[hash];
	te.alloc_count += 1;
	te.alloc_bytes += (new_size - old_size);
	if (te.trace.empty())
		lua_debugger::stack_dump_to_string(L, te.trace);
}

In my experience, spending a few hours on fixing the worst hot spots indicated by the trace can reduce the garbage collection time by an order of magnitude.

Sunday, June 26, 2011

Lightweight Lua Bindings

A scripting language, such as Lua, can bring huge productivity gains to a game project. Quick iterations, immediate code reloads and an in-game console with a read-eval-print-loop are invaluable tools. A less obvious benefit is that introducing a scripting language creates a clear dividing line between "engine" and "gameplay" code with a well defined API between them. This is often good for the structure of the engine, at least if you intend to use it for more than one game.

The main drawback is of course performance. It is a scary thing to discover late in a project that the game is slow because the script is doing too much. Especially since bad script performance cannot always be traced back to bugs or bad algorithms. Sure, you get those as well, but you can also get problems with "overall slowness" that cannot easily be traced back to specific bottlenecks or hot spots. There are two reasons for this. First, the slowness of script code compared to C, which means that everything just takes more time. And second, the fact that gameplay code tends to be "connection" rather than "compute" heavy which means there is less to gain from algorithmic improvements.

Part of this is a management issue. It is important to monitor the script performance (on the slowest target platform) throughout the production so that measures can be taken early if it looks like it will become a problem. But in this article I will focus on the technical aspects, specifically the C-to-Lua bindings.

It is important to note that when I am talking about performance in this article I mean performance on current generation consoles, because that is where performance problems occur. PC processors are much more powerful (especially when running virtual machines, which tend to be brutal to the cache). The extra cores on the consoles don't help much with script execution (since scripts are connection heavy, they are hard to multithread). And the PC can run LuaJIT which changes the game completely.

This may of course change in future generation consoles. If anyone from Sony or Microsoft is reading this, please add support for JITting to your next generation ventures.

Lua bindings


Apart from optimizing the Lua interpreter itself, optimizing the bindings between Lua and C is the best way of achieving a general performance improvement, since the bindings are used whenever Lua calls some function in the C code which in a typical game happens constantly.

The standard way of binding an object on the C side to Lua is to use a full userdata object. This is a heap allocated data blob with an associated metatable that can be used to store the methods of the object. This allows the user to make a call like:

game_world:get_camera():set_position(Vector3(0,0,0))

In many ways, this is the easiest and most convenient way of using objects in Lua, but it comes with several performance problems:

  • Any time an object is passed from C to Lua, such as the camera in get_camera()
    or the vector created by Vector3(0,0,0), memory for the object must be allocated on the heap. This can be costly.
  • All the heap objects must be garbage collected by Lua. Calls such as get_camera() create temporary objects that must be collected at some later time. The more garbage we create, the more time we need to spend in garbage collection.
  • Making use of many heap allocated objects can lead to bad cache performance. When the C side wants to use an object from Lua, it must first fetch it from Lua's heap, then (in most cases) extract an object pointer from its data and look up the object in the game heap. So each time there is an extra cache miss.
  • The colon method call syntax world:get_camera() actually translates to something like (I've simplified this a bit, see the Lua documentation for details) world._meta_table["get_camera"](world). I.e., it creates an extra table lookup operation for every call.

We can get rid of the first two issues by caching the Lua objects. I.e. instead of creating a new Lua object every time get_camera() is called, we keep a reference to the object on the Lua side and just look it up and return it every time it is requested. But this has other disadvantages. Managing the cache can be tricky and it creates a lot more objects in the Lua heap, since the heap will now hold every object that has ever been touched by Lua. This makes garbage collection take longer and the heap can grow uncontrollably during the play of a level, depending on which objects the player interacts with. Also, this doesn't solve the issue with objects that are truly temporary, such as Vector3(0,0,0).

A better option is to use what Lua calls light userdata. A light userdata is essentially just a C pointer stored in Lua, with no additional information. It lives on the Lua stack (i.e. not the heap), does not require any memory allocations, does not participate in garbage collection and does not have an associated metatable. This addresses all our performance problems, but introduces new (not performance-related) issues:

  • Since the objects don't have metatables we cannot use the convenient colon syntax for calling their methods.
  • Light user data objects do not carry any type information, they are just raw pointers. So on the C side we have no way of telling if we have been called with an object of the right type.
  • Lifetime management is trickier since objects do not have destructors and are not garbage collected. How do we manage dangling pointers in Lua?

Colon syntax


With light user data we cannot use the colon syntax to look up methods. Instead we must call global functions and pass in the objects as parameters. But we can still make sure to organize our methods nicely, i.e., put all the functions that operate on World objects in a table called World. It might then look something like this:

Camera.set_position(World.get_camera(game_world), Vector3(0,0,0))

If you are used to the object oriented style this way of writing can feel awkward at first. But in my experience you get accustomed to it quite quickly. It does have some implications which are not purely syntactical though. On the plus side, this style of writing makes it easy to cache the method lookups for better performance:

local camera_set_position = Camera.set_position
local world_get_camera = World.get_camera

camera_set_position(world_get_camera(game_world), Vector3(0,0,0))

This transformation is so simple that you can easily write a script that performs it on your entire code base.

The main drawback is that we are no longer doing dynamic method lookup, we are calling one specific C method. So we can't do virtual inheritance with method overrides. To me that is not a big problem because firstly, I think inheritance is vastly overrated as a design concept, and secondly, if you really need virtual calls you can always do the virtual method resolution on the C side and get the benefits while still having a static call in Lua.

Type checking


For full userdata we can check the type by looking at the metatable. The Lua library function luaL_checkudata provides this service. Since light userdata is just a raw pointer to Lua, no corresponding functionality is offered. So we need to provide the type checking ourselves. But how can we know the type of an arbitrary C pointer?

An important thing to notice is that type checking is only used for debugging. We only need to know if a function has been called with the right arguments or not. So we don't actually need to know the exact type of the pointer, we just need to know if it points to the thing we expect. And since this is only used for bug detection, it doesn't matter if we get a few false positives. And it is fine if the test takes a few cycles since we can strip it from our release builds.

Since we just need to know "is the object of this type" we can make test different for each type. So for each type, we can just pick whatever test fits that type best. Some possibilities are:

  • Store a known four byte type marker at the start of the object's memory. To verify the type, just dereference the pointer and check that the first four bytes match the expected marker. (This is the method I use most frequently.)
  • Keep a hash table of all objects of the specified type and check if it is there.
  • For objects that are allocated from a pool, check that the pointer lies within the range of the pool.

Object lifetimes


There are two approaches you can take to ownership of objects in the Lua interface. They can either be Lua owned and destroyed by the garbage collector or they can be owned by the C side and destroyed by explicit function calls. Both approaches have their advantages, but I usually lean towards the latter one. To me it feels more natural that Lua explicitly creates and destroys cameras with World.destroy_camera() rather than cameras just popping out of existence when the garbage collector feels they are no longer used. Also, since in our engine, Lua is an option, not a requirement, it makes more sense to have the ownership on the C side.

With this approach you have the problem that Lua can hold "dangling pointers" to C objects, which can lead to nasty bugs. (If you took the other approach, you would have the opposite problem, which is equally nasty.)

Again, for debugging purposes, we would want to do something similar to what we did with the type information. We would like to know, in debug builds, if the programmer has passed us a pointer to a dead object, so that we can display an error message rather than exhibit undefined behavior.

This is a trickier issue and I haven't found a clear cut solution, but here are some of the techniques I have used:

  • Clear out the marker field of the object when it is freed. That way if you attempt to use it later you will get a type error. Of course, checking this can cause an access violation if the memory has been returned to the system.
  • For objects that get created and destroyed a lot, such as particles or sound instances, let Lua manage them by IDs rather than by raw pointers.
  • Keep a hash table of all known live objects of the type.
  • Let Lua point to the object indirectly through a handle. Use some bits of the pointer to locate the handle and match the rest to a counter in the handle so that you can detect if the handle has been released and repurposed for something else.

Conclusions


Using light instead of full userdata does make things more inconvenient. But as we have seen, there are tricks that help overcome many of these inconveniences.

We still haven't looked at truly the temporary objects, such as Vector3(0,0,0). In my next article I will discuss what can be done about them.

(This has also been posted to the BitSquid blog.)