Sunday, October 23, 2011

Low Level Animation -- Part 2

Some time ago I wrote an article describing how animation compression is implemented in the BitSquid engine. In that article I made a vague promise that I would follow up with a description of how to pack the data in a cache-friendly way. Now, the time has come to deliver on that vague promise.

A quick recap: After curve fitting, each track of our animation consists of a number of curve points that describe the curve for each animation track:


By an animation track I mean the animation of a single parameter, typically the position or rotation of a bone.

The data for the track is a sequence of times and curve data:


Here t_i is the time of a curve point and A_i is the corresponding curve data.

To evaluate the curve at any particular point t we need the curve points both before and after the time t


Depending on what curve type you use (hermite, bezier, b-spline, etc) you might actually need more than two curve points to evaluate a segment, but that doesn’t really affect the discussion in this article, so for the sake of simplicity, let’s stick with two.

Note that the time points for the different tracks in the animation typically do not match up. For example, one curve may be completely flat and only require one sample at the start and one sample at the end. Another curve may be complicated and require lots of samples.

To simplify the discussion further, assume that the animation only contains two tracks (it is easy to generalize the solution to more tracks). We will call the curve points of one (t_i, A_i) and the curve points of the other (s_i, B_i):


How can we organize this data to be as cache friendly as possible?

The most natural approach is perhaps to sort the data first by track and then by time. Let’s see what this means for the cache. To evaluate the animation for some particular time t, we have to go into the data for each track at that time to look up the two neighboring curve points. Let’s assume that we have somehow cached our current position in each track, so that we don’t have to search for it, we will still have at least one cache miss for each track. A modern character can have over 100 bones, with two tracks per bone. That’s 200 cache misses for just a single frame of a single animation.

To do better, we need to organize the data by time somehow. But it is not immediately clear how. Just sorting the data by time won’t help, because then a flat curve with just two curve points, one at the beginning and one at the end, will have them at complete opposite ends of the data and no matter what we do we will get cache misses when touching them.

Let’s consider all the data we need to evaluate the tracks at time t. We need (t_i, A_i), (t_i+1, A_i+1) and (s_j, B_j), (s_j+1, B_j+1) where t_i <= t <= t_i+1 and s_j <= t <= s_j+1. This is our ”hot” data, because we will need to refer to it several times as we evaluate the curve at different points in time. In fact, we can keep using this same data until we reach whichever is smallest of t_i+1 and s_j+1. A general rule in memory access optimization is to keep the ”hot” data together, so let’s create an additional data structure, an array with the currently active curve points for a playing animation instance.


Now we’re getting somewhere. Not only have we significantly improved the cache behavior; as long as we don’t need to fetch new curve points we only need to refer to the active array, a single memory access. We have also decomposed our animation evaluation problem into two simpler tasks: evaluating curves and fetching new curve points. This makes our code both simpler and more flexible.

Let’s look at the second issue, fetching new curve points. In the example above, when we reach the time t_i+1 we will need to fetch the new curve point (t_i+2, A_i+2) and when we reach the time s_j+1 we will need to fetch (s_j+2, B_j+2).


Generalizing, we always need to fetch the point (t_i, A_i) at the time t_i-1, and we always need to fetch the point (s_i, B_i) at the time s_i-1. This is excellent, because since we now the time when each of our curve points will be needed we can put them all in a single stream of data which is sorted by the time when they will be needed.


This means that our animation player only needs to keep a single pointer into the animation stream. That pointer will always point to the next curve point that needs to be moved to the active list. As time is advanced, curve points are copied from the animation data into the active list and then the curve is evaluated.


Note the excellent cache behavior this gives us. To fetch new curve points, we just move a pointer forward in memory. And then, to evaluate the curves, we just need to access our active array, a single continuous memory block. This gives us a grand total of just two memory accesses.

Another nice property is that since we are now accessing the animation data as a stream (strictly linearly, from beginning to end) we can gzip it and get another factor two of compression. We can also easily stream it from disk.

One drawback of this system is that it only supports playing an animation forward, you cannot jump to a particular time in an animation without ”fast forwarding” through all intermediate curve points.

If you need support for jumping, the easiest way to achieve it is perhaps to add a separate index with jump frames. A jump frame consists of the state of the active array at some point in time, together with an offset into the data stream. In other words, all the state information that the animation player needs to jump to that time point and resume playing.

Using jump frames let’s you balance performance and memory use. If you add more jump frames you will use more memory but on the other hand, you will be able to find a jump frame closer to the time you actually want to go to which means less fast forwarding.

Saturday, October 8, 2011

Caring by Sharing: Header Hero


Compile times get worse over time, that is the second law of C++ programming dynamics. There are many small day-to-day changes that each exacerbate the problem slightly: The project grows. New header files get included. Clever templates get written. And so on. There are comparatively few forces that work in the other direction. Once an #include has been added, it stays.

The only exception is when some hero steps up, says Enough! and starts to crunch down on those header files. It is thankless menial work that offers few rewards, save the knowledge that you are contributing to the public good.

Today, I want to give something back to these unsung heroes, so I’ve made a small tool to make their drudgery a bit less... drudgery-ish? It is called Header Hero:


To run Header Hero you specify the directories where your .cpp files can be found as well as the directories to search for included headers. The program scans your .h and .cpp files to find all the include links. It presents the result in a summarized report that shows you what the worst headers are. You can think of it as a header file profiler.

You don’t need to specify all your include directories, but only the ones you have specified will be scanned.

I’ve focused on making the tool fast by caching as much information as possible and using a simple parser that just looks for #include patterns rather than running the real C preprocessor. The downside is that if you are using any fancy preprocessor tricks, they will most likely be missed. On the other hand, the tool can scan a huge project in seconds. And after the initial scan, new scans can be done in a fraction of that time.

The program produces a report that looks something like this:


At the top are some statistics, such as the total number of files and lines in the project. Total Parsed counts how many lines that would actually be parsed in a full recompile of the project. So, a header that is included by several .cpp files adds to that number every time. The Blowup Factor are the last two items divided. It specifies how many times, on average, each line gets parsed. A value of 35 means that on average, each line in our project is parsed 35 times. That seems quite a lot.

Below the summary are a list of the header files sorted by how many lines they contributed to the Total Parsed number. In other words, the size of that file multiplied by the number of times it was included.

Looking at the sample report above, it seems pretty reasonable. At the top we find big templated collection classes (map, set, string, vector) that have big header files and are used in a lot of places. Math (matrix4x4, vector3) and utility (critical_section, file_system) files also end up high on the list.

But when you dig into it, there are also things that seem a bit fishy. Set<T> is not a very popular collection class. Sets are used less than maps, and HashSet is usually preferable to Set. Why does it end up so high on the list? What is shader.h doing there? That seems too specialized to end up so high. And file_system.h? There shouldn’t be that much code that directly accesses the file system, only the resource loader needs to do that.

To answer those questions, you can click on any file in the report to get a detailed view of its relations:


In the middle we find the file we are looking at. To the left are the files that directly include it. The number after each file name specifies how many files that directly or indirectly include that file. To the right are the files included by the file. The numbers are all the files directly or indirectly included by those files. You can double click on any file name in the view to refocus on it.

Here we clearly see that the main culprit is data_compiler.h. It includes set.h and is in turn included by 316 other files. To fix the compile times we can make data_compiler.h not include set.h or we can try to reduce the number of files that include data_compiler.h (that number also seems high). If we also fix scene_graph.h we can really make a difference.

Breaking dependencies is a whole topic in itself, especially when it comes to templates and inlined code. Here are some quick tips though:

1) Predeclare the structs and classes that you use instead of including the header file. Don’t forget that you can predeclare templates and typedefs as well as regular classes:

class MyClass;
typedef int Id;
template <class T> class Vector;

2) Predeclared types can only be used as pointers and references. You can’t have a member variable of a type whose actual size is unknown. So you may have to change your member variables to pointers in order to get rid of the header dependency. You can also use the pimpl idiom, if you can live with the extra indirection and lack of inlining.

3) Switching from in-place variables to pointers can lead to bad memory access patterns. One way of fixing that is to placement new the object directly into a raw memory buffer.

// a.h

class B;

class A {
    A();
    B *_b;
    static const int SIZE_OF_B = 20;
    char _b_storage[SIZE_OF_B];
};

// a.cpp

#include ”b.h”

A::A()
{
    XASSERT(sizeof(B) == SIZE_OF_B);
    _b = new (_b_storage) B();
}

With this technique, you get the data for B stored inside A, without having to include the b.h header in a.h. But the code isn’t exactly easy to read, so you should only use this in desperate situations.

4) For files with small type definitions, but lots of inlined methods (e.g., matrix4x4.h), a good strategy is to split the file, so you have just the type in one file and all the methods in the other. Header files can then include just the type definition, while .cpp files pull in the whole shebang.

Using these techniques you can get rid of the header dependencies one by one, until you are back at reasonable compile times. Since a rescan takes just a fraction of a second it is easy to see how your changes affect the compile time. Just make sure you have your integration test running, it is easy to break build configurations when you are fiddling around with the headers.

Here is the result of about a day and a half of header optimization in our code base:


From 6 million to 4.3 million lines, that’s not too shabby. We can now do a complete rebuild in 37 seconds on a reasonably modern machine. With this tool we can hopefully keep that number.

You can download the C# source code here. Feel free to do whatever you like with it:

Friday, September 23, 2011

Managing Decoupling Part 4 -- The ID Lookup Table

Today I am going to dig deeper into an important and versatile data structure that pops up all the time in the BitSquid engine -- the ID lookup table.

I have already talked about the advantages of using IDs to refer to objects owned by other systems, but let me just quickly recap.

IDs are better than direct pointers because we don’t get dangling references if the other system decides that the object needs to be destroyed.

IDs are better than shared_ptr<> and weak_ptr<> because it allows the other system to reorganize its objects in memory, delete them at will and doesn’t require thread synchronization to maintain a reference count. They are also POD (plain old data) structures, so they can be copied and moved in memory freely, passed back and forth between C++ and Lua, etc.

By an ID I simply mean an opaque data structure of n bits. It has no particular meaning to us, we just use it to refer to an object. The system provides the mechanism for looking up an object based on it. Since we seldom create more than 4 billion objects, 32 bits is usually enough for the ID, so we can just use a standard integer. If a system needs a lot of objects, we can go to 64 bits.

In this post I’m going to look at what data structures a system might use to do the lookup from ID to system object. There are some requirements that such data structures need to fulfill:

  • There should be a 1-1 mapping between live objects and IDs.
  • If the system is supplied with an ID to an old object, it should be able to detect that the object is no longer alive.
  • Lookup from ID to object should be very fast (this is the most common operation).
  • Adding and removing objects should be fast.

Let’s look at three different ways of implementing this data structure, with increasing degrees of sophistication.

The STL Method


The by-the-book object oriented approach is to allocate objects on the heap and use a std::map to map from ID to object.

typedef unsigned ID;

struct System
{
	ID _next_id;
	std::map<ID, Object *> _objects;

	System() {_next_id = 0;}

	inline bool has(ID id) {
		return _objects.count(id) > 0;
	}
	
	inline Object &lookup(ID id) {
		return *_objects[id];
	}
	
	inline ID add() {
		ID id = _next_id++;
		Object *o = new Object();
		o->id = id;
		_objects[id] = o;
		return id;
	}
	
	inline void remove(ID id) {
		Object &o = lookup(id);
		_objects.erase(id);
		delete &o;
	}
};

Note that if we create more than four billion objects, the _next_id counter will wrap around and we risk getting two objects with the same ID.

Apart from that, the only problem with this solution is that it is really inefficient. All objects are allocated individually on the heap, which gives bad cache behavior and the map lookup results in tree walking which is also bad for the cache. We can switch the map to a hash_map for slightly better performance, but that still leaves a lot of unnecessary pointer chasing.

Array With Holes


What we really want to do is to store our objects linearly in memory, because that will give us the best possible cache behavior. We can either use a fixed size array Object[MAX_SIZE] if we know the maximum number of objects that will ever be used, or we can be more flexible and use a std::vector.

Note: If you care about performance and use std::vector<T> you should make a variant of it (call it array<T> for example) that doesn’t call constructors or initializes memory. Use that for simple types, when you don’t care about initialization. A dynamic vector<T> buffer that grows and shrinks a lot can spend a huge amount of time doing completely unnecessary constructor calls.

To find an object in the array, we need to know its index. But just using the index as ID is not enough, because the object might have been destroyed and a new object might have been created at the same index. To check for that, we also need an id value, as before. So we make the ID type a combination of both:

struct ID {
	unsigned index;
	unsigned inner_id;
};

Now we can use the index to quickly look up the object and the inner_id to verify its identity.

Since the object index is stored in the ID which is exposed externally, once an object has been created it cannot move. When objects are deleted they will leave holes in the array.


When we create new objects we don’t just want to add them to the end of the array. We want to make sure that we fill the holes in the array first.

The standard way of doing that is with a free list. We store a pointer to the first hole in a variable. In each hole we store a pointer to the next hole. These pointers thus form a linked list that enumerates all the holes.


An interesting thing to note is that we usually don’t need to allocate any memory for these pointers. Since the pointers are only used for holes (i. e. dead objects) we can reuse the objects’ own memory for storing them. The objects don’t need that memory, since they are dead.

Here is an implementation. For clarity, I have used an explicit member next in the object for the free list rather than reusing the object’s memory:

struct System
{
	unsigned _next_inner_id;
	std::vector<Object> _objects;
	unsigned _freelist;

	System() {
		_next_inner_id = 0;
		_freelist = UINT_MAX;
	}

	inline bool has(ID id) {
		return _objects[id.index].id.inner_id == id.inner_id;
	}
	
	inline Object &lookup(ID id) {
		return _objects[id.index];
	}
	
	inline ID add() {
		ID id;
		id.inner_id = _next_inner_id++;
		if (_freelist == UINT_MAX) {
			Object o;
			id.index = _objects.size();
			o.id = id;
			_objects.push_back(o);
		} else {
			id.index = _freelist;
			_freelist = _objects[_freelist].next;
		}
		return id;
	}
	
	inline void remove(ID id) {
		Object &o = lookup(id);
		o.id.inner_id = UINT_MAX;
		o.next = _freelist;
		_freelist = id.index;
	}
};

This is a lot better than the STL solution. Insertion and removal is O(1). Lookup is just array indexing, which means it is very fast. In a quick-and-dirty-don’t-take-it-too-seriously test this was 40 times faster than the STL solution. In real-life it all depends on the actual usage patterns, of course.

The only part of this solution that is not an improvement over the STL version is that our ID structs have increased from 32 to 64 bits.

There are things that can be done about that. For example, if you never have more than 64 K objects live at the same time, you can get by with 16 bits for the index, which leaves 16 bits for the inner_id. Note that the inner_id doesn’t have to be globally unique, it is enough if it is unique for that index slot. So a 16 bit inner_id is fine if we never create more than 64 K objects in the same index slot.

If you want to go down that road you probably want to change the implementation of the free list slightly. The code above uses a standard free list implementation that acts as a LIFO stack. This means that if you create and delete objects in quick succession they will all be assigned to the same index slot which means you quickly run out of inner_ids for that slot. To prevent that, you want to make sure that you always have a certain number of elements in the free list (allocate more if you run low) and rewrite it as a FIFO. If you always have N free objects and use a FIFO free list, then you are guaranteed that you won’t see an inner_id collision until you have created at least N * 64 K objects.

Of course you can slice and dice the 32 bits in other ways if you hare different limits on the maximum number of objects. You have to crunch the numbers for your particular case to see if you can get by with a 32 bit ID.

Packed Array


One drawback with the approach sketched above is that since the index is exposed externally, the system cannot reorganize its objects in memory for maximum performance.

The holes are especially troubling. At some point the system probably wants to loop over all its objects and update them. If the object array is nearly full, no problem, But if the array has 50 % objects and 50 % holes, the loop will touch twice as much memory as necessary. That seems suboptimal.

We can get rid of that by introducing an extra level of indirection, where the IDs point to an array of indices that points to the objects themselves:


This means that we pay the cost of an extra array lookup whenever we resolve the ID. On the other hand, the system objects are packed tight in memory which means that they can be updated more efficiently. Note that the system update doesn’t have to touch or care about the index array. Whether this is a net win depends on how the system is used, but my guess is that in most cases more items are touched internally than are referenced externally.

To remove an object with this solution we use the standard trick of swapping it with the last item in the array. Then we update the index so that it points to the new location of the swapped object.

Here is an implementation. To keep things interesting, this time with a fixed array size, a 32 bit ID and a FIFO free list.

typedef unsigned ID;

#define MAX_OBJECTS 64*1024
#define INDEX_MASK 0xffff
#define NEW_OBJECT_ID_ADD 0x10000

struct Index {
	ID id;
	unsigned short index;
	unsigned short next;
};

struct System
{
	unsigned _num_objects;
	Object _objects[MAX_OBJECTS];
	Index _indices[MAX_OBJECTS];
	unsigned short _freelist_enqueue;
	unsigned short _freelist_dequeue;

	System() {
		_num_objects = 0;
		for (unsigned i=0; i<MAX_OBJECTS; ++i) {
			_indices[i].id = i;
			_indices[i].next = i+1;
		}
		_freelist_dequeue = 0;
		_freelist_enqueue = MAX_OBJECTS-1;
	}

	inline bool has(ID id) {
		Index &in = _indices[id & INDEX_MASK];
		return in.id == id && in.index != USHRT_MAX;
	}
	
	inline Object &lookup(ID id) {
		return _objects[_indices[id & INDEX_MASK].index];
	}
	
	inline ID add() {
		Index &in = _indices[_freelist_dequeue];
		_freelist_dequeue = in.next;
		in.id += NEW_OBJECT_ID_ADD;
		in.index = _num_objects++;
		Object &o = _objects[in.index];
		o.id = in.id;
		return o.id;
	}
	
	inline void remove(ID id) {
		Index &in = _indices[id & INDEX_MASK];
		
		Object &o = _objects[in.index];
		o = _objects[--_num_objects];
		_indices[o.id & INDEX_MASK].index = in.index;
		
		in.index = USHRT_MAX;
		_indices[_freelist_enqueue].next = id & INDEX_MASK;
		_freelist_enqueue = id & INDEX_MASK;
	}
};

Thursday, September 8, 2011

A Simple Roll-Your-Own Documentation System

I like to roll my own documentation systems. There, I’ve said it. Not for inline documentation, mind you. For that there is Doxygen and for that I am grateful. Because while I love coding, there is fun coding and not-so-fun coding, and writing C++ parsers tends to fall in the latter category.

So for inline documentation I use Doxygen, but for everything else, I roll my own. Why?

I don’t want to use Word or Pages or any other word processing program because I want my documents to be plain text that can be diffed and merged when necessary. And I want to be able to output it as clean HTML or in any other format I may like.

I don’t want to use HTML or LaTeX or any other presentation-oriented language, because I want to be able to massage the content in various ways before presenting it. Reordering it, adding an index or a glossary, removing deprecated parts, etc. Also, writing <p> gets boring very quickly.

I don’t want to use a Wiki, because I want to check in my documents together with the code, so that code versions and document versions match in the repository. I definitely don’t want to manage five different Wikis, corresponding to different engine release versions. Also, Wiki markup languages tend to be verbose and obtuse.

I could use an existing markup language, such as DocBook, Markdown or ReStructured Text. But all of them contain lots of stuff that I don’t need and lack some stuff that I do need. For example I want to include snippets of syntax highlighted Lua code, margin notes and math formulas. And I want to do it in a way that is easy to read and easy to write. Because I want there to be as few things as possible standing in the way of writing good documentation.

So I roll my own. But as you will see, it is not that much work.

I’ve written a fair number of markup systems over the years (perhaps one too many, but hey, that is how you learn) and I’ve settled on a pretty minimalistic structure that can be implemented in a few hundred lines of Ruby. In general, I tend to favor simple minimalistic systems over big frameworks that try to ”cover everything”. Covering everything is usually impossible and when you discover that you need new functionality, the lightweight systems are a lot easier to extend than the behemoths.

There are two basic components to the system. Always two there are, a parser and a generator. The parser reads the source document and converts it to some kind of structured representation. The generator takes the structured representation and converts it to an output format. Here I’ll only consider HTML, because to me that is the only output format that really matters.

To have something concrete to talk about, let’s use this source document, written in a syntax that I just made up:

@h1 Flavors of ice cream

My favorite ice cream flavors are:

@li Strawberry
@li Seagull

The Parser


The most crucial point of the system is what the structured representation should look like. How should the parser communicate with the generator? My minimalistic solution is to just let the representation be a list of lines, with each line consisting of a type marker and some text.

(:h1, ”Flavors of...”)
(:empty, ””)
(:text, ”My favorite...”)
(:empty, ””)
(:li, ”Strawberry”)
(:li, ”Seagull”)

To some this will probably seem like complete heresy. Surely I need some kind of hierarchical representation. How can I otherwise represent things like a list-in-a-list-in-a-cat-in-a-hat?

No problem, to represent a list item nested in another list, I just use a @li_li tag and a corresponding :li_li type marker. If someone wants three or more levels of nesting I suggest that they rewrite their document. This is supposed to be readable documentation, not Tractatus Logico-Philosophicus. I simply don’t think that deep nesting is important enough to warrant a complicated hierarchical design. As I said, I prefer the simple things in life.

So, now that we know the output format, we can write the parser in under 20 lines:

class Parser
  attr_reader :lines
  
  def initialize()
    @lines = []
  end
  
  def parse(line)
    case line
    when /^$/
      @lines << {:type => :empty, :line => ""}
    when /@(\S+)\s+(.*)$/
      @lines << {:type => $1.intern, :line => $2}
    when /^(.*)$/
      @lines << {:type => :text, :line => line}
    end
  end
end

Of course you can go a lot fancier with the parser than this. For example, you can make a more Markdown-like syntax where you create lists by just starting lines with bullet points. But this doesn’t really change the basic structure, you just need to add more whens in your case-statement.

One useful approach, as you make more advanced parsers, is to have markers that put the parser in a particular state. For example, you could have a marker @lua that made the parser consider all the lines following it to be of type :lua until the marker @endlua was reached.

The Generator


A useful trick when writing HTML generators is to always keep track of the HTML tags that you have currently opened. This lets you write a method context(tags) which takes a list of tags as arguments and closes and opens tags so that exactly the tags specified in the list are open.

With such a method available, it is simple to write the code for outputting tags:

class Generator
  def h1(line)
    context(%W(h1 #{"a name=\"#{line}\""}))
    print line
  end
  
  def text(line)
    context(%w(p))
    print line
  end

  def empty(line)
    context(%w())
    print line
  end
  
  def li(line)
    context(%w(ul li))
    print line
    context(%w(ul))
  end
end

Notice how this works. The li() method makes sure that we are in a <ul> <li> context, so it closes all other open tags and opens the right ones. Then, after printing its content, it says that the context should just be <ul> which forces the closure of the <li> tag. If we wanted to support the :li_li tag, mentioned above, we could write it simply as:

class Generator
  def li_li(line)
    context(%w(ul li ul li))
    print line
    context(%w(ul li ul))
  end
end

Notice also that this approach allows us to just step through the lines in the data structure and print them. We don’t have to look back and forward in the data structure to find out where a <ul> should begin and end.

The rest of the Generator class implements the context() function and handles indentation:

class Generator
  def initialize()
    @out = ""
    @context = []
    @indent = 0
  end
  
  def print(s)
    @out << ("  " * @indent) << s << "\n"
  end
  
  def open(ci)
    print "<#{ci}>"
    @indent += 1
  end
  
  def close(ci)
    @indent -= 1
    print "</#{ci[/^\S*/]}>"
  end
  
  def context(c)
    i = 0
    while @context[i] != nil && @context[i] == c[i]
      i += 1
    end
    while @context.size > i
      close(@context.last)
      @context.pop
    end
    while c.size > @context.size
      @context.push( c[@context.size] )
      open(@context.last)
    end
  end
  
  def format(lines)
    lines.each {|line| self.send(line[:type], line[:line])
    context(%w())
    return @out
  end
end

Used as:

parser = Parser.new
text.each_line {|line| parser.parse(line)}
puts Generator.new.format(parser.lines)

So there you have it, the start of a custom documentation system, easy to extend with new tags in under 100 lines of Ruby code.

There are some things I haven’t touched on here, like TOC generation or inline formatting (bold and emphasized text). But it is easy to write them as extensions of this basic system. For example, the TOC could be generated with an additional pass over the structured data. If there is enough interest I could show an example in a follow-up post.

Thursday, August 25, 2011

Code Snippet: Murmur hash inverse / pre-image

Today's caring by sharing. I needed this non-trivial code snippet today and couldn't find it anywhere on the internet, so here it is for future reference. It computes the inverse / pre-image of a murmur hash. I. e., given a 32 bit murmur hash value, it computes a 32 bit value that when hashed produces that hash value:

/// Inverts a (h ^= h >> s) operation with 8 <= s <= 16
unsigned int invert_shift_xor(unsigned int hs, unsigned int s)
{
	XENSURE(s >= 8 && s <= 16);
	unsigned hs0 = hs >> 24;
	unsigned hs1 = (hs >> 16) & 0xff;
	unsigned hs2 = (hs >> 8) & 0xff;
	unsigned hs3 = hs & 0xff;

	unsigned h0 = hs0;
	unsigned h1 = hs1 ^ (h0 >> (s-8));
	unsigned h2 = (hs2 ^ (h0 << (16-s)) ^ (h1 >> (s-8))) & 0xff;
	unsigned h3 = (hs3 ^ (h1 << (16-s)) ^ (h2 >> (s-8))) & 0xff;
	return (h0<<24) + (h1<<16) + (h2<<8) + h3;
}

unsigned int murmur_hash_inverse(unsigned int h, unsigned int seed)
{
	const unsigned int m = 0x5bd1e995;
	const unsigned int minv = 0xe59b19bd;	// Multiplicative inverse of m under % 2^32
	const int r = 24;

	h = invert_shift_xor(h,15);
	h *= minv;
	h = invert_shift_xor(h,13);

	unsigned int hforward = seed ^ 4;
	hforward *= m;
	unsigned int k = hforward ^ h;
	k *= minv;
	k ^= k >> r;
	k *= minv;

	#ifdef PLATFORM_BIG_ENDIAN
		char *data = (char *)&k;
		k = (data[0]) + (data[1] << 8) + (data[2] << 16) + (data[3] << 24);
	#endif

	return k;
}
And for reference, here is the full code, with both the regular murmur hash and the inverses for 32- and 64-bit hashes:
unsigned int murmur_hash ( const void * key, int len, unsigned int seed )
{
	// 'm' and 'r' are mixing constants generated offline.
	// They're not really 'magic', they just happen to work well.

	const unsigned int m = 0x5bd1e995;
	const int r = 24;

	// Initialize the hash to a 'random' value

	unsigned int h = seed ^ len;

	// Mix 4 bytes at a time into the hash

	const unsigned char * data = (const unsigned char *)key;

	while(len >= 4)
	{
		#ifdef PLATFORM_BIG_ENDIAN
			unsigned int k = (data[0]) + (data[1] << 8) + (data[2] << 16) + (data[3] << 24);
		#else
			unsigned int k = *(unsigned int *)data;
		#endif

		k *= m;
		k ^= k >> r;
		k *= m;

		h *= m;
		h ^= k;

		data += 4;
		len -= 4;
	}

	// Handle the last few bytes of the input array

	switch(len)
	{
	case 3: h ^= data[2] << 16;
	case 2: h ^= data[1] << 8;
	case 1: h ^= data[0];
		h *= m;
	};

	// Do a few final mixes of the hash to ensure the last few
	// bytes are well-incorporated.

	h ^= h >> 13;
	h *= m;
	h ^= h >> 15;

	return h;
}

/// Inverts a (h ^= h >> s) operation with 8 <= s <= 16
unsigned int invert_shift_xor(unsigned int hs, unsigned int s)
{
	XENSURE(s >= 8 && s <= 16);
	unsigned hs0 = hs >> 24;
	unsigned hs1 = (hs >> 16) & 0xff;
	unsigned hs2 = (hs >> 8) & 0xff;
	unsigned hs3 = hs & 0xff;

	unsigned h0 = hs0;
	unsigned h1 = hs1 ^ (h0 >> (s-8));
	unsigned h2 = (hs2 ^ (h0 << (16-s)) ^ (h1 >> (s-8))) & 0xff;
	unsigned h3 = (hs3 ^ (h1 << (16-s)) ^ (h2 >> (s-8))) & 0xff;
	return (h0<<24) + (h1<<16) + (h2<<8) + h3;
}

unsigned int murmur_hash_inverse(unsigned int h, unsigned int seed)
{
	const unsigned int m = 0x5bd1e995;
	const unsigned int minv = 0xe59b19bd;	// Multiplicative inverse of m under % 2^32
	const int r = 24;

	h = invert_shift_xor(h,15);
	h *= minv;
	h = invert_shift_xor(h,13);

	unsigned int hforward = seed ^ 4;
	hforward *= m;
	unsigned int k = hforward ^ h;
	k *= minv;
	k ^= k >> r;
	k *= minv;

	#ifdef PLATFORM_BIG_ENDIAN
		char *data = (char *)&k;
		k = (data[0]) + (data[1] << 8) + (data[2] << 16) + (data[3] << 24);
	#endif

	return k;
}

uint64 murmur_hash_64(const void * key, int len, uint64 seed)
{
	const uint64 m = 0xc6a4a7935bd1e995ULL;
	const int r = 47;

	uint64 h = seed ^ (len * m);

	const uint64 * data = (const uint64 *)key;
	const uint64 * end = data + (len/8);

	while(data != end)
	{
		#ifdef PLATFORM_BIG_ENDIAN
			uint64 k = *data++;
			char *p = (char *)&k;
			char c;
			c = p[0]; p[0] = p[7]; p[7] = c;
			c = p[1]; p[1] = p[6]; p[6] = c;
			c = p[2]; p[2] = p[5]; p[5] = c;
			c = p[3]; p[3] = p[4]; p[4] = c;
		#else
			uint64 k = *data++;
		#endif

		k *= m;
		k ^= k >> r;
		k *= m;
		
		h ^= k;
		h *= m;
	}

	const unsigned char * data2 = (const unsigned char*)data;

	switch(len & 7)
	{
	case 7: h ^= uint64(data2[6]) << 48;
	case 6: h ^= uint64(data2[5]) << 40;
	case 5: h ^= uint64(data2[4]) << 32;
	case 4: h ^= uint64(data2[3]) << 24;
	case 3: h ^= uint64(data2[2]) << 16;
	case 2: h ^= uint64(data2[1]) << 8;
	case 1: h ^= uint64(data2[0]);
			h *= m;
	};
 
	h ^= h >> r;
	h *= m;
	h ^= h >> r;

	return h;
}

uint64 murmur_hash_64_inverse(uint64 h, uint64 seed)
{
	const uint64 m = 0xc6a4a7935bd1e995ULL;
	const uint64 minv = 0x5f7a0ea7e59b19bdULL; // Multiplicative inverse of m under % 2^64
	const int r = 47;

	h ^= h >> r;
	h *= minv;
	h ^= h >> r;
	h *= minv;

	uint64 hforward = seed ^ (8 * m);
	uint64 k = h ^ hforward;

	k *= minv;
	k ^= k >> r;
	k *= minv;

	#ifdef PLATFORM_BIG_ENDIAN
		char *p = (char *)&k;
		char c;
		c = p[0]; p[0] = p[7]; p[7] = c;
		c = p[1]; p[1] = p[6]; p[6] = c;
		c = p[2]; p[2] = p[5]; p[5] = c;
		c = p[3]; p[3] = p[4]; p[4] = c;
	#endif
	
	return k;
}

Wednesday, August 24, 2011

An idea for better watch windows

Watch windows suck. I’ve spent a large part of my career looking at them (that’s how those bugs get fixed) and it’s often a frustrating experience.


Visual Studio’s watch window is one of the better ones, but it still has many issues that make the debugging experience a lot less pleasant than it could be.

  • Custom data types such as MyTree, MyHashSet and MyLinkedList are difficult to look at. To get to the content you have to understand the internal data layout and expand the links by hand.
  • I like to pack my resource data into tight static blobs -- file formats for memory. A simple such blob might have a header with a variable number of offsets into a buffer of tightly packed strings. Such memory layouts cannot be described with just C structs and the watch window can’t inspect them. You have to cast pointers by hand or use the Memory view.


I don’t even see the code. All I see is a hermite curve fitted, time key sorted, zlib compressed reload animation.

  • If I have an array with 10 000 floats and one of them is a #NaN, I have no way of finding out except to expand it and scroll through the numbers until I find the bad one.
  • The watch window can’t do reverse lookup of string hashes, so when I see a hash value in the data I have no idea what it refers to.

Yes, I know that some of these things can be fixed. I know that you can get the Visual Studio Debugger to understand your own data types by editing autoexp.dat. And since I’ve done that for all our major collection types (Vector, Deque, Map, SortMap, HashMap, Set, SortSet, HashSet, ConstConfigValue and DynamicConfigValue) I know what a pain it is, and I know I don’t want to do it any more. Also, it doesn’t help the debuggers for the other platforms.

I also know that you can do some tricks with Visual Studio extensions. At my previous company we had reverse hash lookup through a Visual Studio extension. That was also painful to write, and a single platform solution.

So yes, you can fix some things and will make your work environment a little better. But I think we should aim higher.

Consider this: The variable watcher has access to the entire game memory and plenty of time to analyze it. (Variable watching is not a time critical task.)

Imagine what a well written C program that knew the layout of all your data structures could do with that information. It could expand binary trees and display them in a nice view, reverse lookup your hashes, highlight uninitialized 0xdeadbeef variables, spell check your strings, etc.

The idea


So this is my idea: instead of writing plug-ins and extensions for all the IDEs and platforms in the world, we write the watcher as a separate external program. The user starts the program, connects to a process, enters a memory address and a variable type and gets presented with a nice view of the data:


The connection backend would be customizable so that we could use it both for local processes and remote devices (Xbox/PS3). The front end sends an (address, size) request and the backend replies with a bunch of data. So the platform doesn’t matter. As long as there is some way of accessing the memory of the device we can connect it to the watcher.

We can even use it to look at file contents. All we need is a backend that can return data from different offsets in the file. This works especially well for data blobs, where the file and memory formats are identical. The watcher would function as a general data viewer that could be used for both files and memory.

For this to work, we need a way to describe our data structures to the program. It should understand regular C structs, of course, but we also need some way of describing more complex data, such as variable length objects, offsets, choices, etc. Essentially, what we need is a generic way to describe blobs of structured data, no matter what the format and layout.

I’m not sure what such a description language might look like (or if one already exists), but it might be something loosely based on C structs and then extended to cover more cases. Perhaps something like:

struct Data
{
	zero_terminated char[] name;
	pad_to_4_bytes_alignment;
	platform_endian unsigned count;
	Entry entries[count];
};

The program also needs an extension mechanism so that we can write custom code for processing objects that can’t be described using even this more advanced syntax. This could be used for things like reverse hash lookups, or other queries that depend on external data.

Going further the program could be extended with more visualizers that could allow you to view and edit complex objects in lots of interesting ways:


I think this could be a really useful tool, both for debugging and for inspecting files (as a sort of beefed up hex editor). All I need is some time to write it.

What do you think?

Tuesday, August 9, 2011

Fixing memory issues in Lua

Garbage collection can be both a blessing and a curse. On the one hand, it frees you from manually managing memory. This saves development time, reduces bugs, and avoids tricky decisions about objects' ownerships and lifetimes.

On the other hand, when you do run into memory issues (and you most likely will), they can be a lot harder to diagnose and fix, because you don't have detailed control over how memory is allocated and freed.

In this post I'll show some techniques that you can use to address memory issues in Lua (and by extension, in other garbage collected languages).

All Lua memory issues essentially boil down to one of two things:

Lua uses too much memory
On consoles memory is a precious resource and sometimes Lua is just using too much of it. The root cause can either be memory leaks or badly constructed/bloated data structures.
Garbage collection is taking too long
Too much garbage collection is (not surprisingly) caused by having too much garbage. The code must be rewritten so that it generates less garbage.

Let's look at each issue in turn and see how we can address it.

1. Lua uses too much memory


The first step towards plugging leaks and reducing memory use is to find out where the memory is going. Once we know that, the problems are usually quite easy to fix.

So how do we find out where the memory is going? One way would be to add tracing code to the lua_Alloc() function, but actually there is a much simpler method that doesn't require any C code and is more in line with Lua's dynamic nature. We can just use Lua to count all the objects in the runtime image:

function count_all(f)
	local seen = {}
	local count_table
	count_table = function(t)
		if seen[t] then return end
		f(t)
		seen[t] = true
		for k,v in pairs(t) do
			if type(v) == "table" then
				count_table(v)
			elseif type(v) == "userdata" then
				f(v)
			end
		end
	end
	count_table(_G)
end

Here we just start with the global table _G and recursively enumerate all subtables and userdata. For each object that we haven't seen before, we call the enumeration function f. This will enumerate all the objects in the Lua runtime that can be reached from _G. Depending on how you use Lua you may also want to add some code for enumerating objects stored in the registry, and recurse over metatables and function upvalues to make sure that you really count all the objects in the runtime.

Once you have a function for enumerating all your Lua objects, there are lots of useful things you can do. When it comes to plugging leaks and reducing memory usage I find one of the most useful things is to count the number of objects of each type:

function type_count()
	local counts = {}
	local enumerate = function (o)
		local t = type_name(o)
		counts[t] = (counts[t] or 0) + 1
	end
	count_all(enumerate)
	return counts
end

Here type_name() is a function that returns the name of an object's type. This function will depend on what kind of class/object system you use in your Lua runtime. One common approach is to have global class objects that also act as metatables for objects:

-- A class
Car = {}
Car.__index = Car

-- A method
function Car.honk(self)
	print "toot"
end

-- An object
local my_car = {}
setmetatable(my_car, Car)

In this case, the type_name() function could look something like this:

global_type_table = nil
function type_name(o)
	if global_type_table == nil then
		global_type_table = {}
		for k,v in pairs(_G) do
			global_type_table[v] = k
		end
		global_type_table[0] = "table"
	end
	return global_type_table[getmetatable(o) or 0] or "Unknown"
end

The object count usually gives you a good idea of where your memory problems lie. For example, if the number of AiPathNode objects constantly rises, you can conclude that you are somehow leaking those objects. If you have 200 000 GridCell objects you should write a smarter grid implementation.

You can also use this enumeration technique to pinpoint problems further if necessary. For example, if you are hunting for leaks, you can rewrite the count_all() function so that it keeps track of the sub keys where an object were found. In this way, you might see that the AiPathNode objects can be accessed through paths like:

_G.managers.ai_managers.active_paths[2027]

Then you know that the source of the leak is that paths never get removed from the active_paths table.

2. Garbage collection is taking too long


Garbage collection is a very cache unfriendly task that can have a significant performance impact. This is especially frustrating since garbage collection doesn't really do anything. Well, it lets your gameplay programmers work faster and with fewer bugs, but when you have reached the optimization phase you tend to forget about that and just swear at the slow collector.

Lua's default garbage collection scheme is not adapted for realtime software and if you just run it straight up you will get lots of disturbing frame rate hitches. As has already been mentioned in previous #AltDevBlogADay articles, it is better to use a step size of 0 and just run the garbage collector for a certain number of milliseconds every frame:

OpaqueTimeValue start = time();
while (milliseconds_elapsed_since(start) < milliseconds_to_run)
	lua_gc(L, LUA_GCSTEP, 0);

Note that you can run this garbage collection on any thread, as long as Lua is not running at the same time, so you might be able to offset some of the cost by running the garbage collection on a background thread while your main thread is doing something non-Lua related.

How much time should you spend on garbage collection? A tricky question. If you spend too little, the garbage will grow and you will eventually run out of memory. If you spend too much, you are wasting precious milliseconds.

My preferred solution is to use a feedback mechanism. I dynamically adjust the garbage collection time so that the amount of garbage always stays below 10 % of the total Lua memory. If the garbage goes above that, I increase the collection time. If the garbage goes below, I decrease the collection time. As with all feedback mechanisms is a good idea to plot the curves for memory use and garbage collection time as you tweak the feedback parameters. That way you can verify that the system behaves nicely and that the curves settle down in a stable state rather than going into oscillation.

Choosing the figure 10 % is a balance between memory use and performance. If you choose a higher value, your program will use more memory (because of the increased amount of garbage). On the other hand, you can give the garbage collection a smaller time slice. I've chosen a pretty low number, because on consoles, memory is always precious. If you are targeting a platform with more memory, you can go higher.

Let's compute how much time we need to spend on garbage collection to stay below a certain fraction 0 <= a <= 1 of garbage. Assume that we complete a full garbage collection cycle (scan all Lua memory) in time t. The amount of garbage generated in that time will be:

t g

Where g is the garbage/s created by the program. To make sure that we stay below a fraction a we must have (where m is the total memory used by the program, including the garbage):

t g <= a m

Assume that we sweep s bytes/s. Then the time t required to sweep the entire memory m will be:

t = m / s

Combining the two equations we get:

s <= g / a

So the amount of garbage collection work we need to do per frame is directly proportional to the amount of garbage / s generated by the program and inversely proportional to the fraction of garbage we are willing to accept. (Note that interestingly, m cancels out of the equation.)

So, if we are willing to spend more memory, we can address garbage collection problems by increasing a. But since a can never be higher than 1, there are limits to what we can achieve in this way. A better option, that doesn't cost any memory, is to reduce g -- the amount of garbage generated.

In my experience, most garbage generation problems are "easy mistakes" from sloppy and thoughtless programming. Once you know where the problems are, it is usually not hard to rewrite the code so that garbage generation is avoided. Some useful refactoring techniques are:

  • Update the fields in an existing table instead of creating a new one.
  • Return a reference to an object member rather than a copy. Copy only when needed.
  • Write functions so that they take and return values rather than tables to avoid temporary tables. I. e., make_point(2,3) rather than make_point({2,3}).
  • If you need temporary objects, find a way of reusing them so you don't need to create so many of them.
  • Avoid excessive string concatenation.

Of course a key requirement for this to work is that your Lua-to-C bindings are written so that they don't generate garbage. Otherwise your poor gameplay programmer has no chance. In my opinion, it should be possible to call any C function in a "garbage free" way (though you may choose to also have a more convenient path that does generate garbage). For tips on how to write garbage free bindings, see my previous posts on Lightweight Lua Bindings.

To reduce garbage generation, you need to be able to pinpoint where in the program the garbage is being generated. Luckily, that is not difficult.

Once the game has reached a stable state (total Lua memory doesn't grow or shrink) any allocation made can be considered garbage, because it will soon be freed again (otherwise the Lua memory would keep growing). So to find the garbage all you have to do is to add some tracing code to lua_Alloc that you can trigger when you have reached a stable state.

You can use lua_getstack() to get the current Lua stack trace from inside lua_Alloc and use a HashMap to count the number of allocations associated with each stack trace. If you then sort this data by the number of allocations it is easy to identify the "hotspots" that are generating the most garbage. A gameplay programmer can go through this list and reduce the amount of garbage generation using the tips above.

The code may look something like this:

struct TraceEntry {
	TraceEntry() : alloc_count(0), alloc_bytes(0) {}
	String trace;
	unsigned alloc_count;
	unsigned alloc_bytes;
};
HashMap<uint64, TraceEntry> _traces;

if (_tracing_allocs) {
	lua_Debug stack[5] = {0};
	int count = lua_debugger::stack_dump(L, stack, 5);
	uint64 hash = murmur_hash_64(&stack[0], sizeof(lua_Debug)*count);
	TraceEntry &te = _traces[hash];
	te.alloc_count += 1;
	te.alloc_bytes += (new_size - old_size);
	if (te.trace.empty())
		lua_debugger::stack_dump_to_string(L, te.trace);
}

In my experience, spending a few hours on fixing the worst hot spots indicated by the trace can reduce the garbage collection time by an order of magnitude.