Monday, June 22, 2015

Allocation Adventures 2: Arrays of Arrays

Last week's post ended with a puzzle: How can we allocate an array of dynamically growing and shrinking things in an efficient and data-oriented way? I.e. using contiguous memory buffers and as few allocations as possible.

The example in that post was kind of complicated, and I don't want to get lost in the details, so let's look at a simpler version of the same fundamental problem.

Suppose we want to create a TagComponent that allows us to store a number of unsigned tags for an entity.

These tags will be hashes of strings such as "player", "enemy", "consumable", "container", etc and the TagComponent will have some sort of efficient lookup structure that allows us to quickly find all entities with a particular tag.

But to keep things simple, let's ignore that for now. For now we will just consider how to store these lists of tags for all our entities. I.e. we want to find an alternative to:

std::vector< std::vector < unsigned> > data;

that doesn't store every list in a separate memory allocation.

Fixed size

If we can get away with it, we can get rid of the "array of arrays" by setting a hard limit on the number of items we can store per entity. In that case, the data structure becomes simply:

enum {MAX_TAGS = 8};
struct Tags
{
    unsigned n;
    unsigned tags[MAX_TAGS];
};
Array<Tags> data;

Now all the data is contained in a single buffer, the data buffer for Array<Tags>.

Sometimes the hard limit is inherent in the problem itself. For example, in a 2D grid a cell can have at most four neighbors.

Sometimes the limit is a widely accepted compromise between cost and quality. For example, when skinning meshes it is usually consider ok to limit the number of bone influences per vertex to four.

Sometimes there is no sensible limit inherent to the problem itself, but for the particular project that we are working on we can agree to a limit and then design the game with that limit in mind. For example we may know that there will never be more than two players, never more than three lights affecting an object, never more than four tags needed for an entity, etc.

This of course requires that we are writing, or at least configuring, the engine for a particular project. If we are writing a general engine to be used for a lot of games it is hard to set such limits without artificially constraining what those games will be able to do.

Also, since the fixed size must be set to the maximum array size, every entity that uses fewer entries than the maximum will waste some space. If we need a high maximum this can be a significant problem and it might make sense to go with a dynamic solution even though there is an upper limit.

So while the fixed size approach can be good in some circumstances, it doesn't work in every situation.

Linked list

Instead of using arrays, we can put the tags for a particular entity in a linked list:

struct Tag
{
    unsigned tag;
    Tag *next;
};
Array<Tag *> data;

Using a linked list may seem like a very bad choice at first. A linked list can give us a cache miss for every next pointer we follow. This would give us even worse performance than we would get with vector < vector < unsigned > >.

But the nodes in the linked list do not necessarily have to be allocated individually on the heap. We can do something similar to what we did in the last post: allocate the nodes in a buffer and refer to them using offsets rather than pointers:

struct Node
{
    unsigned tag;
    unsigned next;
};
Array<Node> nodes;

With this approach we only have a single allocation -- the buffer for the array that contains all the tag nodes -- and we can follow the indexes in the next field to walk the list.

Side note: Previously I have always used UINT_MAX to mark an nil value for an unsigned. So in the struct above, I would have used UINT_MAX for the next value to indicate the end of the list. But recently, I've switched to using 0 instead. I think it is nice to be able to memset() a buffer to 0 to reset all values. I think it is nice that I can just use if (next) to check if the value is valid. It is also nice that the invalid value will continue to be 0 even if I later decide to change the type to int or uint_16t. It does mean that I can't use the nodes[0] entry, since that is reserved for the nil value, but I think the increased simplicity is worth it.

Using a single buffer rather than separate allocations gives us much better cache locality, but the next references can still jump around randomly in that buffer. So we can still get cache misses. If the buffer is large, this can be as bad as using freely allocated nodes.

Another thing to note is that we are wasting a significant amount of memory. Only half of the memory is used for storing tags, the rest of it is wasted on the next pointers.

We can try to address both these problems by making the nodes a little bigger:

enum {MAX_TAGS_PER_NODE = 8};
struct Node
{
    unsigned n;
    unsigned tags[MAX_TAGS_PER_NODE];
    unsigned next;
};
Array<Node> nodes;

This is just as before, except we have more than one tag per node. This gives better cache performance because we can now process eight tags at a time before we have to follow a next pointer and jump to a different memory location. Memory use can also be better. If the nodes are full, we are using 80 % of the memory for actual tags, rather than 50 % as we had before.

However, if the nodes are not full we could be wasting even more memory than before. If entities have three tags on average, then we are only using 30 % of the memory to store tags.

We can balance cache performance and memory use by changing MAX_TAGS_PER_NODE. Increasing it gives better cache coherence, because we can process more tags before we need to jump to a different memory location. However, increasing it also means more wasted memory. It is probably good to set the size so that "most entities" fit into a single node, but a few special ones (players and enemies maybe) need more.

One interesting thing to note about the cache misses is that we can get rid of them by sorting the nodes. If we sort them so that the nodes in the same next chain always appear directly after one another in the array, then walking the list will access the data linearly in memory, just as if we were accessing an array:

--------------------------------------------------
|  A1 --|--> A2 --|--> A3 |  B  |  C1 --|--> C2  |
--------------------------------------------------

Note that a complete ordering is not required, it is enough if the linked nodes end up together. Single nodes, such as the B node above could go anywhere.

Since these are dynamic lists where items will be added and removed all the time, we can't really do a full O(n log n) sort every time something changes. That would be too expensive. But we could sort the list "incrementally". Every time the list is accessed, we do a little bit of sorting work. As long as the rate of mutation is low compared to the rate of access, which you would expect in most circumstances, our sorting should be able to keep up with the mutations and keep the list "mostly sorted".

You would need a sorting algorithm that can be run incrementally and that works well with already sorted data. Two-way bubble sort perhaps? I haven't thought too deeply about this, because I haven't implemented this method in practice.

Custom memory allocator

Another option is to write a custom memory allocator to divide the bigger buffer up into smaller parts for memory allocations.

You might think that this is a much too complex solution, but a custom memory allocator doesn't necessarily need to be a complex thing. In fact, both the fixed size and linked list approaches described above could be said to be using a very simple kind of custom memory allocator: one that just allocates fixed blocks from an array. Such an allocator does not need many lines of code.

Another criticism against this approach is that if we are writing our own custom memory allocator, aren't we just duplicating the work that malloc() or new already does? What's the point of first complaining a lot about how problematic the use of malloc() can be and then go on to write our very own (and probably worse) implementation of malloc()?

The answer is that malloc() is a generic allocator that has to do well in a lot of different situations. If we have more detailed knowledge of how the allocator is used, we can write an allocator that is both simpler and performs better. For example, as seen above, when we know the allocations are fixed size we can make a very fast and simple allocator. System software typically uses such allocators (check out the slab allocator for instance) rather than relying on malloc().

In addition, we also get the benefit that I talked about in the previous post. Having all of a system's allocations in a single place (rather than mixed up with all other malloc() allocations) makes it much easier to reason about them and optimize them.

As I said above, the key to making something better than malloc() is to make use of the specific knowledge we have about the allocation patterns of our system. So what is special about our vector < vector < unsigned > > case?

1. There are no external pointers to the data.

All the pointers are managed by the TagComponent itself and never visible outside that component.

This means that we can "move around" memory blocks as we like, as long as the TagComponent keeps track of and updates its data structures with the new locations. So we don't have to worry (that much) about fragmentation, because when we need to, we can always move things around in order to defrag the memory.

I'm sure you can build something interesting based on that, but I actually want to explore another property:

2. Memory use always grows by a factor of two.

If you look at the implementation of std::vector or a similar class (since STL code tends to be pretty unreadable) you will see that the memory allocated always grows by a factor of two. (Some implementations may use 1.5 or something else, but usually it is 2. The exact figure doesn't matter that much.)

The vector class keeps track of two counters:

  • size which stores the number of items in the vector and
  • capacity which stores how many items the vector has room for, i.e. how much memory has been allocated.

If you try to push an item when size == capacity, more memory is needed. So what typically happens is that the vector allocates twice as much memory as was previously used (capacity *= 2) and then you can continue to push items.

This post is already getting pretty long, but if you haven't thought about it before you may wonder why the vector grows like this. Why doesn't it grow by one item at a time, or perhaps 16 items at a time.

The reason is that we want push_back() to be a cheap operation -- O(1) using computational complexity notation. When we reallocate the vector buffer, we have to move all the existing elements from the old place to the new place. This will take O(n) time. Here, n is the number of elements in the vector.

If we allocate one item at a time, then we need to allocate every time we push and since re-allocate takes O(n) that means push will also take O(n). Not good.

If we allocate 16 items at a time, then we need to allocate every 16th time we push, which means that push on average takes O(n)/16, which by the great laws of O(n) notation is still O(n). Oops!

But if we allocate 2*n items when we allocate, then we only need to reallocate after we have pushed n more items, which means that push on average takes O(n)/n. And O(n)/n is O(1), which is exactly what we wanted.

Note that it is just on average that push is O(1). Every n pushes, you will encounter a push that takes O(n) time. For this reason, push is said to run in amortized constant time. If you have really big vectors, that can cause an unacceptable hitch and in that case you may want to use something other than a vector to store the data.

Anyways, back to our regular programming.

The fact that our data (and indeed, any kind of dynamic data that uses the vector storage model) grows by powers of two is actually really interesting. Because it just so happens that there is an allocator that is very good at allocating blocks at sizes that are powers of two. It is called the buddy allocator and we will take a deeper look at it in the next post.

Friday, June 12, 2015

Allocation Adventures 1: The DataComponent

When sketching out a new low level system I always start with the data layout, because that's the most important part. And I do that with two main goals:

  1. Make sure that memory is laid out and accessed linearly.
  2. Minimize the number of individual memory allocations.

The importance of the first goal should be obvious to anybody who has been following the data-oriented design movement. Modern CPUs are often memory bound. Writing cache friendly code is paramount. Etc.

The advantages of the second goal are less obvious.

Too some extent, it goes hand-in-hand with the first. For the cache to work efficiently you need to avoid pointer chasing, which naturally leads to fewer allocations. But there are other advantages as well.

Data with fewer individual allocations puts less pressure on the memory system. It gives you less fragmentation, more control over memory usage, allocation patterns that are easier to profile and optimize, etc.

It also makes the data easier to move and copy, since it requires less pointer patching, which let's you do lots of cool tricks with it. In general, it's just neater.

For static data, such as resources loaded from disk, both these goals can be easily achieved, by using the blob method.

For dynamic data, things get trickier. If you just use STL as you are told to, you end up with scary stuff like std::map< std::string, std::vector<std::string> > where the cache and memory usage is all over the place.

So let's try and see how we can escape that monster's lair by looking at a component and improving it step-by-step until we get to something nicer.

Introducing the DataComponent

The DataComponent is a part of our entity system that is used to store small amounts of arbitrary dynamic data for an entity. For example, it can be used to store a character sheet:

name = "The One"
stats = {
    health = 100
    mana = 200
}
status_effects = {
    drunk = true
    delirious = true
}

As data format we use a restricted form of JSON where the only allowed values are:

  • booleans
  • floats
  • strings
  • objects
  • arrays of numbers

Note that arbitrary arrays are not allowed. Arrays can only be lists of numbers, i.e. [0 0.5 0.5 0.7]. If you want to have sets of other items, you need to use an object with unique values (possibly GUIDs) as the keys.

The reason for these restrictions is that we want to make sure that all operations on the data merge in simple and well-defined ways. This makes it easier to use the data in collaborative workflows.

Strings and float arrays are regarded as monolithic with respect to merging. This means that all data operations can be reduced to setting object keys, which makes them easy to reason about.

We assume that the amount of data stored for each entity will be small, but that lots of entities will have data and that it is important to keep memory use low and performance high.

The naive approach

Representing a tree structure that can include both strings and arrays is pretty complicated and if we just start to sketch the structure using our regular STL tools it quickly becomes pretty messy:

enum class DataType {BOOL, FLOAT, STRING, OBJECT, ARRAY};
struct DataValue
{
    DataType type;
    union {
        bool b;
        float f;
        std::string *s;
        std::vector<float> *array;
        std::map<std::string, DataValue> *object;
    };
};

That doesn't look very fun at all. There are tons of allocations everywhere, both explicit (std::string *) and implicit (inside string, vector, and map). So let's start working.

1: Hash the keys

If we consider how we will use this data structure, we will always be setting and getting the values corresponding to certain keys. We don't have a pressing need to extract the key strings themselves. This means we can use hashed strings instead of the strings themselves for lookup.

Since we assume that the number of keys will be small, it is probably enough to use an unsigned rather than an uint64_t for the hash value:

std::map<unsigned, DataValue> *object;

Great start. That's a lot of allocations already gone and as a bonus we also got rid of a lot of string comparisons.

2: Flatten the structure

If we also sacrifice the ability of being able to enumerate all the (hashed) keys in a particular object we can go even further and flatten the entire structure.

That is, instead of representing the data as:

name = "The One"
stats = {
    health = 100
    mana = 200
}
status_effects = {
    drunk = true
    delirious = true
}

We will use:

name = "The One"
stats.health = 100
stats.mana = 200
status_effects.drunk = true
status_effects.delirious = true

Just as before we represent these keys as hashed values, so "stats.health" is represented by an unsigned containing the hashed value of "stats.health".

(Note: we won't actually hash the string "stats.health" directly, because that could lead to problems if some of our keys contained a "." character. Instead we hash each sub key separately and then hash them all together.)

With this approach, the user will still be able to look up a particular piece of data, such as stats.health, but she won't be able to enumerate all the keys and values under stats, since the hierarchy information is lost. But that's ok.

Getting rid of the tree structure allows us to store our data in a flat array:

enum class DataType {BOOL, FLOAT, STRING, OBJECT, ARRAY};
struct Entry
{
    unsigned key;
    DataType type;
    union {
        bool b;
        float f;
        std::string *s;
        std::vector<float> *array;
    };
};
std::vector<Entry> data;

Note that we no longer have a lookup hierarchy for the entries. To find a particular key we have to linearly search std::vector<Entry> for a match. Luckily, linear search is the best thing caches know and for reasonable sizes, the search will run faster than a std::map lookup.

We could sort the entries and do a binary search, but since the number of entries will be small, it is probably not even worth bothering.

3: Rewrite using a structure-of-arrays approach

One thing that does slow the search for a particular key down is all the additional data we need to load into the cache to perform the search. As we scan the vector for a particular key, the cache will also be filled by the type and value data.

We can solve this with the standard techinque of rewriting our "array of structures" as a "structure of arrays". I.e., we break out each field of the structure into its own array:

enum class DataType {BOOL, FLOAT, STRING, OBJECT, ARRAY};
struct Value
{
    union {
        bool b;
        float f;
        std::string *s;
        std::vector<float> *array;
    }
};
std::vector<unsigned> keys;
std::vector<DataType> types;
std::vector<Value> values;

Now when searching for a particular key, we only need to load the keys table which means the cache only contains useful data.

4: Co-allocate the arrays

Not too shabby, but we still have three separate allocations for the three vectors. To get rid of that we have to stop using these fancy std::vectors and go back to the basics, and in terms of basics, a std::vector<unsigned> is really just:

struct {
    int capacity;
    int size;
    unsigned *data;
};

So we can represent our three vectors as:

int capacity;
int size;
unsigned *keys;
DataType *types;
Value *values;

Note that as a bonus, we can now share the capacity and the size fields between the vectors, since they all have the same values and change together.

The arrays are also all reallocated at the same time. We can make use of this and instead of doing three separate allocations, we just allocate one big buffer and lay them out sequentially in that buffer:

       -----------------------------------
buffer |   keys   |   types   |  values  |
       -----------------------------------

Which in code would be something like:

char *buffer = allocate(capacity * (sizeof(unsigned) + sizeof(DataType) + sizeof(Value));
keys = (unsigned *)buffer;
types = (DataType *)(keys + capacity);
values = (Value *)(types + capacity);

Presto, now we have all the header data stored in a single buffer.

5: Get rid of STL

We still have these pointers to deal with in the values union:

std::string *s;
std::vector<float> *array;

Let's start by getting rid of those STL types. At this point they are no longer helping. STL is based around individual allocations which we don't want. We also have two levels of indirection there. First, the pointer to the std::vector, and then the pointer to the actual data inside the std::vector type.

So, we just replace them with their non-STL equivalents:

struct {
    char *data;
} s;
struct {
    int capacity;
    int size;
    float *data;
} array;

This actually bloats the size of the Value union from 64 to 128 bits, which is less than stellar, but have faith, we are going in the right direction.

6: Put the value data in a buffer

We can now repeat the same trick that we did with the header data and put all the strings and float arrays in a big single buffer:

-------------------------------------------------------------
| "hello" | [0 1 3] | "mana" | [0 2] | ... unused space ... |
-------------------------------------------------------------

To allocate some data we just keep track of this buffer and how much of it is currently in use:

struct value_buffer
{
    char *p;
    unsigned capacity;
    unsigned size;
};

void *allocate_memory(value_buffer &vb, unsigned size)
{
    if (vb.size + size > vb.capacity)
        return nullptr;
    auto res = vb.p + vb.size;
    vb.size += size;
    return res;
}

There are two things we need to take care of here. First, we may run out of space in this buffer as we add more data. Second, the individual values we store may change size and no longer fit in their designated place.

The first issue is no big problem. If we run out of space, we can just allocate a bigger buffer as std::vector does.

The second issue is no biggie either. If an item needs to grow, we just allocate a new bigger slot for it from the buffer. This will leave a hole where the item was originally located:

-------------------------------------------------------------
| ....... | [0 1 3] | "mana" | [0 2] | "hello more" | ..... |
-------------------------------------------------------------

These "holes" will cause fragmentation of the buffer and eventually make it run out of space. If this was a general purpose memory allocator that would be a serious issue that we would have to be seriously worried about.

But in our case it doesn't really matter. The number of items is small and we control the pointers to them, so we can easily move them around and defragment the buffer when necessary. But in fact, we don't even have to do that. If we get fragmentation we will eventually run out of space. This will reallocate the buffer which will also defragment it. This is enough for our purposes.

7: Slim down the Value struct

Since all our value data are now allocated in a single buffer we can use the buffer offset rather than the pointer to refer to the data. This saves us memory (from 64 bits to 32 bits) and as an added bonus, it also allows us to relocate the buffer without having to do pointer patching.

For these small snippets of data 64 K is plenty, which means we can fit both the offset and the size in 32 bits:

struct Data {
    uint16_t offset;
    uint16_t size;
};

(If you need more memory, you could use a full uint32_t for the offset and store the size in the buffer, together with the data.)

So know we have:

struct Value
{
    union {
        bool b;
        float f;
        Data data;
    }
};

where the data field is used for both strings and arrays. This means our Value structure is now just 32 bits, half the size of the STL version.

8: Merge the final two allocations

Now we are down to just two buffers: one for the header (keys and types), and one for the values (strings and arrays). Wouldn't it be great if we could merge them to use just a single allocation?

Let's try our old trick and putting them together in the same buffer:

--------------------------------------------------------
| Header      | Values     | ....... free space ...... |
--------------------------------------------------------

This doesn't really work, because now the Header part can't grow unless we move the values out of the way.

We could give them each a little bit of free space:

----------------------------------------------------------
| Header      | ... free ... | Values     | ... free ... |
----------------------------------------------------------

Now they both have some room to grow, but we are not utilizing the free space to its maximum potential. The Values section may run out of free space before the Header, forcing us to reallocate the buffer even though we actually have some free space left in it. That's a bummer.

But what if we do this:

----------------------------------------------------------
| Header      | ........ free space ........ | Values     |
----------------------------------------------------------

Instead of allocating the values bottom up, we allocate them top down in the buffer. Now the headers and the values can share the same free space and we don't have to reallocate the buffer until they meet in the middle. Nice!

Benefits of single buffer structures

Stripping everything down to a single buffer gives us many of the same benefits that the blob approach gives us for static data.

Since the data is stored in a single buffer and doesn't contain any pointers it is fully relocatable. We can move it around in memory as we please and make copies of it with a simple memcpy().

If we want to save the data to disk, as a static resource or as part of a saved game, we don't need any special serialization code. We can just write and read the data directly.

The pointer arithmetic that is needed to manipulate data in this way may seem complex at first, if you are not used to think about pointer operations on raw memory. But I would argue that once you get used to that, this kind of representation is in many ways simpler than the multi-level std::vector, std::map based approach that we started with. And in that simplicity lies a lot of power.

But wait, I have been cheating!

Yes, to be honest I have been cheating all this time.

I have ignored the old adage of data-oriented programming:

Where there is one, there are many.

All this time I have been talking about how one DataComponent can store its data in a single buffer. But in our real system we will have many DataComponents. We might have 1000 entities, each with a DataComponent that stores a single health float. Even if each component uses a contiguous buffer, that's still a lot of individual memory allocations.

Wouldn't it be nice if we could somehow bundle them together too?

But that represents a trickier challenge. We need to find a way to allocate a lot of individual objects in a single buffer in such a way that each individual object can grow and shrink independently.

And that will be the topic for another post.

Thursday, June 11, 2015

Upgrading the DirectX SDK

When I joined Bitsquid a month ago, someone mentioned they wanted to upgrade the DirectX SDK to get some improvements, but that there was a dependency in the way. I was foolish, and volunteered to investigate. Over the past ten days or so I have untangled the whole mess, leading to a successful upgrade. I now want to share my findings so the next unfortunate soul can save some time.

Step 1: Explore

First stop: MSDN's article. I had heard that the DirectX SDK was now included in the Windows SDK, but I wasn't sure what that covered. This article sums it up. With a teammate, we went through the whole list, figuring out what we were and were not using. In the end, the only problematic components were XInput, XAudio2, and D3DX9Mesh. The bulk of the codebase had already been converted away from using D3DX, which was great!
However another thing needed clearing up. Our minspec is still Windows 7. How was that going to work? Luckily, MSDN had the answer again. This article reveals that the Windows 8.X SDK is available on Windows 7. This is covered in more details on this page and that page.

Step 2: Well let's just try then

I changed the paths in our project generation files to the Windows SDK. I also added the June 2010 SDK, but only for XAudio2 and D3DX9Mesh (more on XInput further down). After fixing only a few compile errors, things seemed mostly fine... until I got a runtime crash about ID3D11ShaderReflection. Huh?

Step 3: GUIDs and the magic #define

I had wrongly assumed that the link errors I had been seeing when changing the paths were caused by DX9, because I read too fast. Linking with the old dxguid.lib made the errors go away, so I didn't think about it more. However, a large part of DirectX relies on GUIDs, unique hardcoded identifiers. When debugging, I noticed that IID_ID3D11ShaderReflection had the wrong value compared to the Windows SDK header, which was causing the crash. I went on a goose hunt for what was somehow changing this value, and wasted a day to looking for a wrongly included file.
But by default, those GUIDs are extern variables, and will get their values from lib files. And I was linking with an old one. Mystery solved! I removed dxguid.lib from the linker, but that of course caused the GUIDs to be undefined. The solution for that is to #define INITGUID before including windows.h. Thanks to the Ogre3D forums for pointing me towards the relevant support page, since they encountered the same issue before. At this point everything was fine, except that it was failing on the build machines.

Step 4: d3dcompiler

The first error had been around for a long time. We had so far, unknowingly, relied on the d3dcompiler DLL being present in System32! Since System32 is part of the default DLL search path, this is easy to overlook, especially when the DirectX SDK is a required install anyway. We were now relying on a more recent version, supposed to be included in the Windows SDK. Yet still it was failing... because we did not have a proper installation step. I tweaked the project files again, adding a copy step for that DLL. CI, however, was still failing.

Step 5: XInput

XInput comes in several versions in the Windows SDK. 1.4 is the most recent one as I'm writing this, and is Windows 8-only. To use XInput on Windows 7, you need to use version 9.1.0. For that, ensure that the magic _WIN32_WINNT #define is set to the proper value (see further up on the page). You also need to explicitly link with XInput9_1_0.lib and not XInput.lib, or Windows 7 will get a runtime crash trying to fetch XInput1_4.dll, which doesn't exist on Windows 7. In my case this was breaking the automated tests on a Windows 7 machine, but was completely fine on my Windows 8 workstation.

Step 6: Profit?

As far as I can tell this should be the end of it, but the rendering team has yet to stress-test it. We'll see what breaks as they poke around :)
Hopefully this can save you some time if you're doing a similar upgrade, or convince you to give it a try if you've been holding back.

[Cross-posted from personal blog]

Tuesday, March 10, 2015

Multithreaded Gameplay

I've written before about multithreading gameplay, but since I didn't really come to any conclusion I think it is time to revisit the topic.

As the number of processors and cores in consumer hardware keeps increasing, Amdahl's law tells us that any single-threaded part of the code will have a bigger and bigger effect on the performance. (As long as you are not memory bound, so keep those caches happy, ok?)

So even if single-threaded gameplay isn't a problem today it soon will be. At some point the elephant will outgrow the living room.

There are (at least) three problems with multithreading gameplay code as we know it:

  1. Writing and reading multithreaded code is much harder than single threaded code, especially for "messy" stuff such as gameplay code.

  2. Gameplay code tend to be more sprawling than engine code, touching all kinds of systems, which means that the standard optimisation technique of finding the hotspots and multithreading them is less likely to work.

  3. Lua, which we use as our scripting language, does not have built-in multithreading support. This might not be a problem for you. On the other hand, if you expect your gameplay programmers to write multithreaded C++ code, you probably have other problems.

If we start with the first point, I don't think it is reasonable to expect gameplay programmers to write safe and efficient multithreaded code using the standard techniques of mutexes, locks, queues, semaphores, atomic operations, etc. Especially not when writing messy gameplay code where requirements often change and experimentation and quick iterations are important. If anyone has experience of this, I'd like to know.

So I think the primary goal is to find a multithreading model that is easy to work with.

To me, the best (easiest and safest) model seems to be the Actor model used for example by Erlang and Scala.

You can read up on the actor model if you are not familiar with it. The basic idea is that processing nodes only touch their own local memory and communicate with other nodes through asynchronous message passing. Since there is no shared memory, explicit synchronization primitives are not necessary.

Luckily for us, using this model also takes care of issue #3. If the nodes don't need to share memory, we can let them live in separate Lua VMs that communicate through message passing. Typically we would spawn a separate Lua VM for each processing core in our system.

As a completely contrived example, suppose we had a bunch of numbers that we needed to factor. We could then split them up by our number of VMs and send each VM a message ['factor', n]. Each VM would compute its factors in parallel and send the result back to the main thread.

All fine and dandy. This would work well and give us a good performance boost. But of course, this contrived example is absolutely nothing like real gameplay code.

In real gameplay code, we don't have pockets of completely isolated but computationally intensive code that lends itself to easy parallelization. (If we do, that code should probably be moved out of the gameplay code and into the engine.)

Instead, most gameplay code interacts with the world and does things like moving a unit a little bit, casting a physics ray, adjusting the position of another unit, etc. Unless we can paralelize that kind of messy code, we won't have gained very much.

The problem here is that your engine is probably a big ball of mutable state. If it isn't, congratulations to you I guess. You have figured out something we others haven't and I look forward to your next GDC talk. But for the sake of argument, let's assume that it is.

Any interaction with this mutable state (say a script calling PhysicsWorld.raycast()) is a potential for threading issues.

We could try to make the entire script API thread-safe. For example, we could put a critical section in each API call. But that is unlikely to make anyone happy. With so many critical sections, we will probably loose whatever performance we hoped to gain from multithreading.

So we seem to be at an impasse. Gameplay code will need to interact frequently with a lot of engine APIs and making those APIs thread-safe will likely kill performance.

I've been stuck here for a while. To be honest, a couple of years. (Hey, it's not like I haven't had other stuff to do.) But in the general creative atmosphere of GDC and a discussion with some colleagues and the nice people at Pixeldiet, something shook loose.

Instead of synchronizing at each function call, what if we did it at the level of the API:

Unit = LuaThreads.lock_api("Unit", player, LockType.WRITE)
...
Unit.set_position(0, Vector3(0,0,0))
# Do other stuff with the player
...
LuaThreads.unlock_api(Unit)

In this model, the Lua VM for the threads start with a blank slate. There are no public APIs (except for safe, functional APIs that don't touch mutable state). To do anything with the engine, you must obtain a lock for a particular API.

You could argue that this is nothing than another shade of the complicated explicit multithreading model that we wanted to get rid of to begin with, but I do think there is something different here.

First, since the Lua part of the code will use the Actor model, we have eliminated all the problems with synchronizing the Lua state.

Second, since you can't use an API before locking in it there is a safety mechanism that prevents you from accidentally using multithreading the wrong way.

In this model, the main Lua thread (yes there would still be a main Lua thread) would spawn of a number of jobs for performing a computation intensive task, such as updating a number of units. The main Lua thread would be suspended while the task was performed. (We can only avoid suspension if the main thread also uses the locking mechanism to access the APIs, but that seems too cumbersome.)

The worker Lua threads lock the APIs they need to perform their tasks, and when they have completed the control returns back to the main Lua thread that can gather the results.

Since Lua supports coroutines (aka green threads) the lock_api() function does not have to lock the thread if an API is locked by someone else, we can just switch to a different coroutine.

This model is certainly not perfect. It's a bit annoying that we still have to have a main Lua thread that is "special". It's also a pity that the main Lua thread can't continue to run while the jobs are being executed, since that could allow for greater parallelism.

And it is certainly possible for the gameplay programmer to mess up. For example, it is easy to create a deadlock by requiring the same APIs in different orders in different threads.

But still, to me this seems like the best solution, and something that would actually be worthwhile for the gameplay programmers (unlike some of my previous ideas). So I think I will start tinkering with it and see if it will fly.

Friday, October 10, 2014

Building a Data-Oriented Entity System (Part 4: Entity Resources)

In the last post, I talked about the design of the TransformComponent. Today we will look at how we can store entities as resources.

Dynamic and Static Data

I’m a huge fan of compiling resources into big blobs of binary data that can be read directly into memory and used as-is without any need for “deserialization” or “reference patching”.

This requires two things:

  • First, the data must be swapped to the right endianness when the data is generated.

  • Second, internal references in the resource must use offsets instead of pointers, since we don’t know where the loaded resource will end up in memory and we want to avoid pointer patching.

Initially, this approach can seem a bit complicated. But it is actually a lot simpler than messing around with deserialization and reference patching.

Note though, that this approach only works for static (read-only) data, such as meshes, textures, etc. If data needs to change for each instance of a resource, we must store it somewhere else. If an instance needs to change color, we can’t store that color value in a memory area that is shared with other instances.

So what typically happens is that we split out the dynamic data as “instance data” and let that data refer to the static resource data. Many instances can make use of the same resource data, thus saving memory:

---------------                -----------------
|Instance of A| --------+----> |A resource data|
---------------         |      -----------------
                        |
---------------         |
|Instance of A| --------+
---------------

---------------                -----------------
|Instance of B| --------+----> |B resource data|
---------------         |      -----------------
                        |
---------------         |
|Instance of B|---------+
---------------

We typically hard-code what goes into the instance. For example, we know that the color is something that we want to modify so we add it to the instance data. The vertex and index buffers cannot be changed and thus go into the static data. When we create the instance we initialize the instance data with “default” values from an instance template in the resource data.

You could imagine doing this another way. Instead of hard-coding the data that can be modified per instance, you could say that everything can be modified per instance. In the instance data you would then use a flexible key-value store to store the delta between the instance and the resource data.

This is more flexible than hard-coding, because it allows you to override everything per instance, even texture or vertex data if you want that. It can also save memory, because the instance data will only contain the things you actually override, not everything that potentially could be overridden. So if you have many instances that use the default values, you don’t have to store any data for them.

On the other hand, there are drawbacks to this approach. Accessing value becomes a lot more complicated and expensive, because we always need to perform an extra query to find out if the instance has overridden the default value or not.

We currently don’t use this approach anywhere in the engine. But I think there are circumstances where it would make sense.

Anyway, I’m getting sidetracked. Back to the entity system.

The thing is, the way I envision the entity system, it is very dynamic. Components can be added and removed at runtime, child entities linked in and properties changed. Components that handle static data, such as the MeshComponent do it by referencing a separate MeshResource that contains the mesh data. There is no mesh data stored in the component itself.

Since everything in the entity system is dynamic, there is only instance data. The only thing we have in the resource is the template for the instance data. Essentially, just a set of “instructions” for setting up an instance. There is no need for the instance to refer back to the resource after those instructions have been followed.

Defining the Resource Format

So an entity resource should contain “instructions” for setting up an entity. What should it look like? Let’s start by just writing up what needs to go in there:

struct EntityResource
{
    unsigned num_components;
    ComponentData components[num_components];
    unsigned num_children;
    EntityResource children[num_children];
};

Note: Of course the above is not legal C++ code. I’m using some kind of C-like pseudo-code that allows things like dynamically sized structs in order to describe the data layout. I’ve written about the need for a language to describe data layouts before.

The exact binary layout of the ComponentData is defined by each component type, but let’s use a common wrapper format:

struct ComponentData
{
    unsigned component_identifier;
    unsigned size;
    char data[size];
};

Now we have a common way of identifying the component type, so we know if we should create a MeshComponent, a TransformComponent or something else. We also know the size of the component data, so if we should encounter a component type that we don’t understand, we can ignore it and skip over its data to get to the next component. (Another option would be to treat unknown component types as a fatal error.)

A quick fix to make this layout slightly better is to move all the fixed size fields to the start of the struct:

struct EntityResource
{
    unsigned num_components;
    unsigned num_children;
    ComponentData components[num_components];
    EntityResource children[num_children];
};

Now we can access the num_children parameter without having to look at all the components and their sizes to know how far we need to skip forward in the resource to get to the num_children field.

This may or may not matter in practice. Perhaps, we only need the value of num_children after we have processed all the component data, and at that point we already have a pointer into the resource that points to the right place. But I always put the fixed size data first as a force of habit, in case we might need it.

Sometimes, it makes sense to add offset tables to these kinds of resources, so that we can quickly lookup the offset of a particular component or child, without having to walk all of the memory and count up the sizes:

struct EntityResource
{
    unsigned num_components;
    unsigned num_children;
    unsigned offset_to_component_data[num_components];
    unsigned offset_to_child_data[num_children];
    ComponentData components[num_components];
    EntityResource children[num_children];
};

With this layout, we can get to the data for the i’th component and the j’th child as:

struct EntityResourceHeader
{
    unsigned num_components;
    unsigned num_children;
};

const EntityResourceHeader *resource;
const unsigned *offset_to_component_data = (const unsigned *)(resource + 1);
ComponentData *data_i = (const ComponentData *)
    ((const char *)resource + offset_to_component_data[i]);

const unsigned *offset_to_child_data = (const unsigned *)
    (offset_to_component_data + num_components);
EntityResourceHeader *child_j = (const EntityResourceHeader *)
    ((const char *)resource + offset_to_child_data[j]);

The first time you encounter code like this it can seriously spin your head around with all the casting and pointer arithmetic. However, if you think about what happens and how the data is laid out in memory it is really pretty straight forward. Any mistakes you do will most likely cause huge crashes that are easy to find, not sneaky subtle bugs. And after a while you get used to these kinds of manipulations.

But, anyway, I’m drifting off on a tangent again, because actually for our purposes we don’t need these lookup tables. We will just walk the memory from beginning to end, creating one component at a time. Since we don’t need to jump around between different components, we don’t need the lookup tables.

What we do need though is some way of storing more than one resource. Storing one entity is fine if we are dealing with a “prefab” type of resource, that contains a definition of a single entity. However, what about a level? It will probably contain a bunch of entities. So it would be nice to have a resource type that could store all those entities.

Ok, no biggie, we know how to do that:

struct EntitiesResource
{
    unsigned num_entities;
    EntityResource entities[num_entities];
};

Done, yesno?

Pivot!

Working for a while in this business you get an intuitive feel for when performance matters and when it doesn’t. Of course, intuitions can be wrong, so don’t forget to measure, measure, measure. But level spawning tends to be one of these areas where performance does matter.

A level can easily have 10 000 objects or more and sometimes you want to spawn them really fast, such as when the player restarts the level. So it seems worth it to spend a little bit of time to think about how we can spawn levels fast.

Looking at the resource layout, our spawn algorithm seems pretty straight forward:

  • Create the first entity
    • Add its first component
    • Add its second component
    • Create its child entities
  • Create the second entity
    • Create its first component
    • Create its second component

This is so simple and straight forward that it might seem impossible to improve on. We are walking the resource memory linearly as we step through components, so we are being cache friendly, aren’t we?

Well, not really. We are violating one of the fundamental principles of data-oriented design: Do similar things together.

If we write out the operations we actually perform linearly instead of in an hierarchy and make things a bit more concrete, it’s easier to see:

  • Create entity A
  • Create a TransformComponent for A
  • Create a MeshComponent for A
  • Create an ActorComponent for A
  • Create entity B
  • Create a TransformComponent for B
  • Create a MeshComponent for B

Note how we are alternating between creating different kinds of components and entities. This not only messes with our instruction cache (because each component has its own code path), but with our data cache as well (because each component has its own data structures where the instances get inserted).

So let’s rewrite this so that we keep common operations together:

  • Create entity A
  • Create entity B
  • Create a TransformComponent for A
  • Create a TransformComponent for B
  • Create a MeshComponent for A
  • Create a MeshComponent for B
  • Create an ActorComponent for A

Much better. And we can go even further.

Instead of telling the EntityManager to “create an entity” one hundred times, let’s just tell it to “create 100 entities”. That way, if there is any economy of scale to creating more than one entity, the EntityManager can take advantage of that. And let’s do the same thing for the components:

  • Create entities (A, B)
  • Create TransformComponents for (A,B)
  • Create MeshComponents for (A,B)
  • Create an ActorComponent for A

Notice how we are encountering and making use of a whole bunch of data-oriented principles and guidelines here:

  • Access memory linearly.
  • Where there is one, there are many.
  • Group similar objects and operations together.
  • Perform operations on multiple objects at once, rather than one at a time.

Let’s rewrite the data format to reflect all our newly won insight:

struct EntityResource
{
    unsigned num_entities;
    unsigned num_component_types;
    ComponentTypeData component_types[num_component_types];
};

struct ComponentTypeData
{
    unsigned component_identifier;
    unsigned num_instances;
    unsigned size;
    unsigned entity_index[num_instances];
    char instance_data[size];
};

For each component, we store an identifier so we know if it’s a MeshComponent, TransformComponent, etc. Then we store the number of instances of that component we are going to create and the size of the data for all those instances.

Note that now when we are walking the format, we can skip all instances of an unknown component type with a single jump, instead of having to ignore them one by one. This doesn’t matter that much, but it is interesting to note that data-oriented reorganizations often make a lot of different kinds of operations more efficient, not just the one you initially targeted.

The entity_index is used to associate components with entities. Suppose we create five entities: A, B, C, D and E and two ActorComponents. We need to know which entity each ActorComponent should belong to. We do that by simply storing the index of the entity in the entity_index. So if the entity index contained {2,3} the components would belong to C and D.

There is one thing we haven’t handled in the new layout: child entities.

But child entities are not conceptually different from any other entities. We can just add them to num_entities and add their component instances to the ComponentTypeData just as we would do for any other entity.

The only additional thing we need is some way of storing the parent-child relationship. We could store that as part of the data for the TransformComponent, or we could just store an array that specified the index of each parent’s entity (or UINT_MAX for root entities):

struct EntityResource
{
    unsigned num_entities;
    unsigned num_component_types;
    unsigned parent_index[num_entities];
    ComponentTypeData component_types[num_component_types];
};

If parent_index was {UINT_MAX, 0, 1, 1, 2} in our A, B, C, D, E example, the hierarchy would be:

A --- B --- C --- E
      |
      + --- D

Implementation Details

This post is too long already, so I’ll just say something quickly about how the implementation of this is organized.

In the engine we have a class EntityCompiler for compiling entities and a similar class EntitySpawner for spawning entities.

A component that can compile data needs to register itself with the entity compiler, so that it can be called when component data of that kind is encountered by the compiler.

Ignoring some of the nitty-gritty details, like error handling, endian swapping and dependency tracking, this looks something like this:

typedef Buffer (*CompileFunction)(const JsonData &config, NittyGritty &ng);

void register_component_compiler(const char *name, CompileFunction f,
    int spawn_order);

The compile function takes some JSON configuration data that describes the component and returns a binary BLOB of resource data for insertion into the entity resource. Note that the compile function operates on a single component at a time, because we are not that concerned with compile time performance.

When registering the compiler we specify a name, such as "mesh_component". If that name is found in the JSON data, the entity compiler will redirect the compile of the component data to this function. The name is also hashed into the component_identifier for the component.

The spawn_order is used to specify the compile order of the different component, and by extension, their spawn order as well. Some components make use of other components. For example, the MeshComponent wants to know where the enitty is, so it looks for a TransformComponent in the entity. Thus, the TransformComponent must be created before the MeshComponent.

A similar approach is used to register a component spawner:

typedef void (*SpawnFunction)(const Entity *entity_lookup,
    unsigned num_instances, const unsigned *entity_index, const char *data);

void register_component_spawner(const char *name, SpawnFunction f);

Here the entity_lookup allows us to look up an entity index in the resource data to a an actual Entity that is created in the first step of spawning the resource. num_instances is the number of component instances that should be created and entity_index is the entity index from the ComponentTypeData that lets us lookup which entity should own the component.

So entity_lookup[entity_index[i]] gives the Entity that should own the ith component instance.

The data finally is a pointer to the instance_data from the ComponentTypeData.

That’s certainly enough for today. Next time, we’ll look at a concrete example of this.

Friday, October 3, 2014

Building a Data-Oriented Entity System (Part 3: The Transform Component)

In the last post, I talked generally about the design of components. Today I will focus on a specific real-world component, the TransformComponent.

The Transform Component

The purpose of the TransformComponent is to allow entities to be positioned in the world.

It handles positioning of entities in the world and child-parent linking. For example, you may want to link a “wheel” entity to a “car” entity, so that the wheel follows the car around when it moves.

In that sense, the TransformComponent is what forms the scene graph of an engine world.

Design Decisions

Should every entity have a transform component?

In some engines, every entity has to have a transform component, even if it is just a purely “logical” entity that doesn’t really have a position in the world.

To me it seems strange to force an entity to have a position when that has no real meaning. I also want entities to be as cheap as possible. So it seems better to have the transform component optional, just as any other component. An entity that doesn’t have a transform component doesn’t have any position in the world.

Actually, talking about the world is a bit of misnomer. The Bitsquid engine does not have a single World where everything has to live. Instead you can create multiple worlds, each populated with its own objects. So you might have one world for your “main game”, one world for the “inventory screen”, one world for the “loading screen”, etc.

This is a lot better than having an “inventory room” at some secret hidden place in the main game world.

Each world has its own TransformComponent manager, and an entity can create transform components in several of these managers, if it so desires. So the same entity can exist and be positioned at different places in different game worlds. Since a MeshComponent manager also exists in each world, the entity can have different graphical representations in each world.

This is a bit esoteric, and I don’t expect many entities to make use of this, but there are situations when it could be interesting. For example, a player’s pickaxe could exist both in the “game world” and in the “inventory world” and still be managed as the same entity.

Entity scene graphs and model scene graphs

In the entity system there are really two kinds of “scene graphs” that we need to deal with.

The first is the one we have already talked about, the graph formed by entities and their linked child entities.

The second is the graph of nodes within an entity. For example, a character entity may have a model with hundreds of bones that can be individually animated.

What should the relationship be between these two graphs?

In previous engine code, I have always treated these two graphs as parts of the same system. The model scene graphs were linked to nodes in the entity scene graphs and computed their transforms in world space. This creates an update order dependency. We can’t compute the world positions in the model scene graph until we have computed the world position in the entity scene graph. This limits what kinds of things we can do in parallel.

For the entity system I’ve decided to decouple these two concepts. The model scene graph won’t compute world space poses, instead it will compute poses relative to the entity pose. This means that we can evaluate the animations and compute the model pose without knowing anything about the entity pose. (Ignoring world space constraints, of course, but they will be handled in a later pass.)

Of course it also requires us to multiply the model node transforms with the entity transform to get the actual world position of the model nodes.

I have not completed the design of the model scene graph component yet, but maybe I’ll get a chance to return to this in a future post.

Immediate or deferred updates

In previous engines I have always used deferred updates of the world transforms. I.e., changing the local transform of a node would not immediately update its world transform (or the world transforms of its children). Instead it would simply set a “dirty” flag in the entity. Later, I would compute the world transforms of all the dirty nodes (and their children) as a single step.

This has the advantage that we never have to compute the world transform of a node more than once.

Consider the worst case scenario, a long chain of nodes:

[ node_1 ] ---> [ node_2 ] ---> [ node_3 ] ---> ... ---> [ node_n ]

With a deferred update, changing the local pose of every node will still just require O(n) computations to compute all the world transforms. With an immediate update, where we compute the world transforms of all children as soon as the parent transform changes, we will need O(n^2) computations.

On the other hand, there is a drawback to using deferred updates. Whenever we ask for an object’s world position we won’t get its actual world position, but its world position from the last frame (unless we ask after the world transform update). This can lead to a lot of confusion and subtle bugs. Solving them often requires ugly hacks, such as forcing graph updates at different times.

So what should we choose?

I think that with the decision to decouple the model scene graphs from the entity scene graphs the performance problems of immediate updates are a lot less serious. Long chains of nodes that are all moving can certainly exist in the model scene graph. (Consider an animation of a character swinging a whip.) But I would guess that long chains of objects that are all moving at once are a lot less common in the entity scene graph.

Note that the performance problems do not appear if it is just the root entity that is moving. In that case, both the immediate and the deferred update will be O(n). It is only when the parent and the children are moving that the immediate update does worse.

I don’t expect there to be very long chains of entities (n <= 5 ???) and I don’t expect all of the objects in those chains to be moving simultaneously. So I have decided to go with immediate updates so that we always have accurate world transforms.

Note: If we run into performance problems as a result of this, we can always create an API function that allows us to set multiple local transforms at once while doing a single world transform update, thus getting back the O(n) performance.

A side note on deferred updates

Note that if you want to do deferred updates, you want to keep the entity array sorted so that parents always appear before children. That way you can just walk the array from beginning to end and compute the transforms and be sure that the world transform of a parent has been computed before you compute the world transform of its children.

Also, you don’t want to loop over the entire array to look for dirty objects:

for (int i=0; i<n; ++i) {
    if (dirty[i])
        transform(i);
}

Typically, in a scene, only a small percentage of the objects are moving at any one time (maybe as little as 1 %). So looping over all objects, even just to check a flag, can waste a lot of time.

A better solution is to sort all the dirty objects to the end of the array, so we can loop over just them:

for (int i=first_dirty; i<n; ++i)
    transform(i);

Since we only need a partial sorting of the array, we don’t have to run an expensive O(n log n) sorting algorithm. (It would kind of defeat the purpose to run an O(n log n) sort to avoid an O(n) update.) Instead, we can achieve this by judicious swapping.

When a node becomes dirty we move it to the start of the dirty list by swapping it with the element before the dirty list and decreasing first_dirty:

                                 =============== dirty ==============
|   |   |   | D |   |   |   | X |   |   |   |   |   |   |   |   |   |

                             ================= dirty ================
|   |   |   | X |   |   |   | D |   |   |   |   |   |   |   |   |   |

We do the same for all children of the node and the children’s children, etc.

As we process the items in the dirty array, whenever we find a child that has its parent at a later position in the array, we swap the child and the parent.

                             ================= dirty ================
|   |   |   |   |   |   |   |   |   |   | C |   |   | P |   |   |   |
                                          ^

                             ================= dirty ================
|   |   |   |   |   |   |   |   |   |   | P |   |   | C |   |   |   |
                                          ^

This guarantees that parents are always processed before their children.

We also need a way to move items off the dirty list, or it will continue to grow indefinitely. We could clear the list every frame, but that might lead to a lot of swapping as items are moved in and out of the list. A better approach might be to check if an item hasn’t moved in five frames or so, and in that case we move it off the dirty list. This avoids swapping those items which are always moving.

When using the immediate update strategy, sorting the list is not as important, but we can employ similar swapping strategies to make sure that a parent node and its children are kept close together in the array, so that the immediate update is cache friendly.

Implementation

With the design thought through, there is really not that much to the implementation.

Just as in the last post, we store the transform component data for all instances in a single big memory block:

struct Instance {int i;};

/// Instance data.
struct InstanceData {
    unsigned size;              ///< Number of used entries in arrays
    unsigned capacity;          ///< Number of allocated entries in arrays
    void *buffer;               ///< Raw buffer for data.

    Entity *entity;             ///< The entity owning this instance.
    Matrix4x4 *local;           ///< Local transform with respect to parent.
    Matrix4x4 *world;           ///< World transform.
    Instance *parent;           ///< The parent instance of this instance.
    Instance *first_child;      ///< The first child of this instance.
    Instance *next_sibling;     ///< The next sibling of this instance.
    Instance *prev_sibling;     ///< The previous sibling of this instance.
};

The parent, first_child, next_sibling and prev_sibling arrays all store instance indexes. We can find all the children of a particular entity by following the first_child link and then the next_sibling links of that link.

We can use that to do the immediate transform update:

void TransformComponent::set_local(Instance i, const Matrix4x4 &m)
{
    _data.local[i.i] = m;
    Instance parent = _data.parent[i.i];
    Matrix4x4 parent_tm = is_valid(parent) ? _data.world[ parent.i ] :
        matrix4x4_identity();
    transform(parent_tm, i);
}

void TransformComponent::transform(const Matix4x4 &parent, Instance i)
{
   _data.world[i.i] = _data.local[i.i] * p;

    Instance child = _data.first_child[i.i];
    while (is_valid(child)) {
       transform(_data.world[i.i], child);
       child = _data.next_sibling[child.i];
    }
}

Note: I’ve written this as a recursive function for easier reading, but you might want to rewrite it as an iterative function for better performance.

Note that when you swap two instances in the array (to do swap-erase or to sort the array as described above), in addition to swapping the entries in the array you also need to take care to keep all the parent, first_child, next_sibling and prev_sibling references intact. This can get a little hairy, especially when you are changing references and trying to walk those lists of references at the same time. My suggestion when you want to swap two instances [A] and [B] is to use the element at the end of the array [size] as a temporary storage slot and instead of trying to do everything at once, use three steps:

// Move element at A (and references to it) to size.
[size] <--- [A]

// Now nothing refers to A, so we can safely move element at B (and references
// to it) to A.
[A] <--- [B]

// And finally move the element at size to B.
[B] <-- [size]

In the next post I’ll look at compiling entities into resource files.

Monday, September 8, 2014

Building a Data-Oriented Entity System (Part 2: Components)

In the last post, I talked about the design of the Entity Manager and how we handle creation and destruction of game entities.

In this post we will look at how components can be implemented.

A quick recap: Components in our system are not individual objects, instead all components of a particular type are handled by a component manager for that type. The component manager has full control over how the component data is stored internally and how updates are applied.

A Component Example

To have something to talk about we will consider a fictitious component that handles point mass objects. For each component instance we want to store the following data:

Entity entity;          ///< Entity owner
float mass;             ///< Mass of object
Vector3 position;       ///< Object's position
Vector3 velocity;       ///< Object's velocity
Vector3 acceleration;   ///< Object's acceleration

The component needs functions for accessing this data and simulating physics.

It is perhaps not self-evident why we want to store the entity that owns the component, but it will come in handy later.

Note that this is not a real world example. We don’t actually have a component like this in the engine, and perhaps it’s not the best or most interesting design, but it gives us something to talk about.

Component Data Layout

When considering how we should layout the data in the component manager we have two goals:

  • Given an entity we want to be able to quickly look up the component data for that entity.
  • We want the component data to be packed tightly in memory for good cache performance.

Let’s tackle the second question first.

Actual cache performance depends on how your CPU works and what the data access patterns in the code are. You can spend a lot of time trying to bend your mind around those things, but I would recommend going with a simple rule of thumb instead:

Pack the data in arrays that you access sequentially.

Only get more fancy than that when you are trying to fix a diagnosed performance issue.

A generally good approach is to use a structure-of-arrays. I.e., each field is stored in an array in memory, with one entry for each component instance:

[entity_1]  [entity_2]  [entity_3] ...
[mass_1]    [mass_2]    [mass_3]   ...
[pos_1]     [pos_2]     [pos_3]    ...
[vel_1]     [vel_2]     [vel_3]    ...
[acc_1]     [acc_2]     [acc_3]    ...

The advantage of having each field stored separately is that code that only processes some of the fields don’t have to waste precious cache space on the others.

You could go even further and put each x, y and z component of a Vector3 into its own array. An advantage of that is that you can do more efficient SIMD calculations, if you want to go down that route. But for this example, let’s keep things a bit simpler and store the Vector3s together. Since the layout of the data is entirely encapsulated in the ComponentManager class we can always go back and redesign that later if we need some extra performance.

The simplest way of implementing this data layout is to use an Array for each component:

class PointMassComponentManager {
    struct InstanceData {
        Array<Entity> entity;
        Array<float> mass;
        Array<Vector3> position;
        Array<Vector3> velocity;
        Array<Vector3> acceleration;
    };
    InstanceData _data;
};

That works well enough, but it does mean that the data gets stored in five separately allocated memory buffers. So I use a different approach. I allocate the entire memory buffer as a single allocation and then just let entity, mass, etc, point to different parts of that buffer:

struct InstanceData {
    unsigned n;          ///< Number of used instances.
    unsigned allocated;  ///< Number of allocated instances.
    void *buffer;        ///< Buffer with instance data.

    Entity *entity;
    float *mass;
    Vector3 *position;
    Vector3 *velocity;
    Vector3 *acceleration;
};
InstanceData _data;

void allocate(unsigned sz)
{
    assert(sz > _data.n);

    InstanceData new_data;
    const unsigned bytes = sz * (sizeof(Entity) + sizeof(float) +
        3 * sizeof(Vector3));
    new_data.buffer = _allocator.allocate(bytes);
    new_data.n = _data.n;
    new_data.allocated = sz;

    new_data.entity = (Entity *)(new_data.buffer);
    new_data.mass = (float *)(new_data.entity + sz);
    new_data.position = (Vector3 *)(new_data.mass + sz);
    new_data.velocity = new_data.position + sz;
    new_data.acceleration = new_data.velocity + sz;

    memcpy(new_data.entity, _data.entity, _data.n * sizeof(Entity));
    mempcy(new_data.mass, _data.mass, _data.n * sizeof(float));
    memcpy(new_data.position, _data.position, _data.n * sizeof(Vector3));
    memcpy(new_data.velocity, _data.velocity, _data.n * sizeof(Vector3));
    memcpy(new_data.acceleration, _data.acceleration,
        _data.n * sizeof(Vector3));

    _allocator.deallocate(_data.buffer);
    _data = new_data;
}

This avoids any hidden overheads that might exist in the Array class and we only have a single allocation to keep track of. This is better both for the cache and the memory allocation system.

Side note: I’m tempted to write a memory system with a 4 K allocation granularity. I.e. there is no traditional heap allocator, just a page allocator and you have to design your systems so that they only work with large allocations.

Accessing Data

Let’s consider the second issue, how we map from an entity to its component data. For the sake of simplicity, let’s assume for now that we don’t support multiple components per entity.

In the data layout, we refer to a particular component instance by its index in the mass, position, etc arrays. So what we need is a way to map from an entity to an index.

You may remember from the previous post, that Entity itself contains a unique index. So one alternative would be to just use this index.

This could be a good approach if almost every entity in the game had this component. But if that is not the case our arrays will contain a lot of “holes” corresponding to entities that lack the component. This will waste memory, but also performance, because we will fill our caches with unused data.

We can improve this somewhat by using a level of indirection:

Array<unsigned> _map;

Here, the _map allows us to look up a component index based on the entity index. This is a lot better, because now it is just the _map array that has holes, not the _data array, which means that the holes are fewer and smaller.

Still, I would only use this if I was certain that the component was almost universal and that lookups where performance critical. In most cases, I think a hash index is a better approach:

HashMap<Entity, unsigned> _map;

This uses less memory and lookups are still pretty fast.

Since the lookup from Entity to instance index involves an extra step we want to reflect that in the API and not force the user to do multiple lookups when she wants to access different fields of the same component. Something like this:

/// Handle to a component instance.
struct Instance {int i;};

/// Create an instance from an index to the data arrays.
Instance make_instance(int i) {Instance inst = {i}; return inst;}

/// Returns the component instance for the specified entity or a nil instance
/// if the entity doesn't have the component.
Instance lookup(Entity e) {return make_instance(_map.get(e, 0));}

float mass(Instance i) {return _data.mass[i.i];}
void set_mass(Instance i, float mass) {_data.mass[i.i] = mass;}
Vector3 position(Instance i) {return _data.position[i.i];}
...

To support multiple component instance per entity, you can add a next_instance field to the component data that allows you to traverse a linked list of component instances belonging to the same entity. This is left as an exercise to the reader.

Component Updates

Since the component data is laid out sequentially in memory, writing a function that simulates physics for all entities is simple:

void simulate(float dt)
{
    for (unsigned i=0; i<_data.n; ++i) {
        _data.velocity[i] += _data.acceleration[i] * dt;
        _data.position[i] += _data.velocity[i] * dt;
    }
}

This function traverses memory in-order which gives us good cache performance. It’s also easy to profile, vectorize and parallelize, should the need arise.

Side rant: I’m somewhat allergic to methods being called update(). That is a bad remain from bad inheritance-based designs. If you take a second to think about it you can almost always come up with better, more informative names than update().

Destroying Components

When destroying components, we want to make sure that we keep the _data array tightly packed. We can achieve that by moving the last element to the position of the component we want to remove. We must also update the _map entry for the corresponding entity.

void destroy(unsigned i)
{
    unsigned last = _data.n - 1;
    Entity e = _data.entity[i];
    Entity last_e = _data.entity[last];

    _data.entity[i] = _data.entity[last];
    _data.mass[i] = _data.mass[last];
    _data.position[i] = _data.position[last];
    _data.velocity[i] = _data.velocity[last];
    _data.acceleration[i] = _data.acceleration[last];

    _map[last_e] =  i;
    _map.erase(e);

    --_n;
}

Another question is how we handle destruction of components when an entity is destroyed. As you may recall, the entity does not have an explicit list of components that it owns. Also, it seems onerous to require of the user of the API to manually destroy the right components when the entity dies.

Instead, we use one of two approaches.

Components that need to be destroyed immediately (perhaps because they hold external resources) can register a destruction callback with the EntityManager and that callback will be called when the entity is destroyed.

However, for simpler components, like the point mass component, there is nothing that require components to be destroyed at exactly the same time as the entity. We can take advantage of that and use garbage collection to lazily destroy components instead of spending memory and effort on storing callback lists:

void gc(const EntityManager &em)
{
    unsigned alive_in_row = 0;
    while (_data.n > 0 && alive_in_row < 4) {
        unsigned i = random_in_range(0, _data.n - 1);
        if (em.alive(_data.entity[i])) {
            ++alive_in_row;
            continue;
        }
        alive_in_row = 0;
        destroy(i);
    }
}

Here, we pick random component indices and destroy them if the corresponding entity has been destroyed. We do this until we hit four living entities in a row.

The nice thing about this code is that it cost almost nothing if there are no destroyed entities (just four passes of the loop). But when there are a lot of destroyed entities the components will be quickly destroyed.

In the next post, we will look at the Transform Component that handles links between parent and child entities.