Friday, September 23, 2011

Managing Decoupling Part 4 -- The ID Lookup Table

Today I am going to dig deeper into an important and versatile data structure that pops up all the time in the BitSquid engine -- the ID lookup table.

I have already talked about the advantages of using IDs to refer to objects owned by other systems, but let me just quickly recap.

IDs are better than direct pointers because we don’t get dangling references if the other system decides that the object needs to be destroyed.

IDs are better than shared_ptr<> and weak_ptr<> because it allows the other system to reorganize its objects in memory, delete them at will and doesn’t require thread synchronization to maintain a reference count. They are also POD (plain old data) structures, so they can be copied and moved in memory freely, passed back and forth between C++ and Lua, etc.

By an ID I simply mean an opaque data structure of n bits. It has no particular meaning to us, we just use it to refer to an object. The system provides the mechanism for looking up an object based on it. Since we seldom create more than 4 billion objects, 32 bits is usually enough for the ID, so we can just use a standard integer. If a system needs a lot of objects, we can go to 64 bits.

In this post I’m going to look at what data structures a system might use to do the lookup from ID to system object. There are some requirements that such data structures need to fulfill:

  • There should be a 1-1 mapping between live objects and IDs.
  • If the system is supplied with an ID to an old object, it should be able to detect that the object is no longer alive.
  • Lookup from ID to object should be very fast (this is the most common operation).
  • Adding and removing objects should be fast.

Let’s look at three different ways of implementing this data structure, with increasing degrees of sophistication.

The STL Method


The by-the-book object oriented approach is to allocate objects on the heap and use a std::map to map from ID to object.

typedef unsigned ID;

struct System
{
	ID _next_id;
	std::map<ID, Object *> _objects;

	System() {_next_id = 0;}

	inline bool has(ID id) {
		return _objects.count(id) > 0;
	}
	
	inline Object &lookup(ID id) {
		return *_objects[id];
	}
	
	inline ID add() {
		ID id = _next_id++;
		Object *o = new Object();
		o->id = id;
		_objects[id] = o;
		return id;
	}
	
	inline void remove(ID id) {
		Object &o = lookup(id);
		_objects.erase(id);
		delete &o;
	}
};

Note that if we create more than four billion objects, the _next_id counter will wrap around and we risk getting two objects with the same ID.

Apart from that, the only problem with this solution is that it is really inefficient. All objects are allocated individually on the heap, which gives bad cache behavior and the map lookup results in tree walking which is also bad for the cache. We can switch the map to a hash_map for slightly better performance, but that still leaves a lot of unnecessary pointer chasing.

Array With Holes


What we really want to do is to store our objects linearly in memory, because that will give us the best possible cache behavior. We can either use a fixed size array Object[MAX_SIZE] if we know the maximum number of objects that will ever be used, or we can be more flexible and use a std::vector.

Note: If you care about performance and use std::vector<T> you should make a variant of it (call it array<T> for example) that doesn’t call constructors or initializes memory. Use that for simple types, when you don’t care about initialization. A dynamic vector<T> buffer that grows and shrinks a lot can spend a huge amount of time doing completely unnecessary constructor calls.

To find an object in the array, we need to know its index. But just using the index as ID is not enough, because the object might have been destroyed and a new object might have been created at the same index. To check for that, we also need an id value, as before. So we make the ID type a combination of both:

struct ID {
	unsigned index;
	unsigned inner_id;
};

Now we can use the index to quickly look up the object and the inner_id to verify its identity.

Since the object index is stored in the ID which is exposed externally, once an object has been created it cannot move. When objects are deleted they will leave holes in the array.


When we create new objects we don’t just want to add them to the end of the array. We want to make sure that we fill the holes in the array first.

The standard way of doing that is with a free list. We store a pointer to the first hole in a variable. In each hole we store a pointer to the next hole. These pointers thus form a linked list that enumerates all the holes.


An interesting thing to note is that we usually don’t need to allocate any memory for these pointers. Since the pointers are only used for holes (i. e. dead objects) we can reuse the objects’ own memory for storing them. The objects don’t need that memory, since they are dead.

Here is an implementation. For clarity, I have used an explicit member next in the object for the free list rather than reusing the object’s memory:

struct System
{
	unsigned _next_inner_id;
	std::vector<Object> _objects;
	unsigned _freelist;

	System() {
		_next_inner_id = 0;
		_freelist = UINT_MAX;
	}

	inline bool has(ID id) {
		return _objects[id.index].id.inner_id == id.inner_id;
	}
	
	inline Object &lookup(ID id) {
		return _objects[id.index];
	}
	
	inline ID add() {
		ID id;
		id.inner_id = _next_inner_id++;
		if (_freelist == UINT_MAX) {
			Object o;
			id.index = _objects.size();
			o.id = id;
			_objects.push_back(o);
		} else {
			id.index = _freelist;
			_freelist = _objects[_freelist].next;
		}
		return id;
	}
	
	inline void remove(ID id) {
		Object &o = lookup(id);
		o.id.inner_id = UINT_MAX;
		o.next = _freelist;
		_freelist = id.index;
	}
};

This is a lot better than the STL solution. Insertion and removal is O(1). Lookup is just array indexing, which means it is very fast. In a quick-and-dirty-don’t-take-it-too-seriously test this was 40 times faster than the STL solution. In real-life it all depends on the actual usage patterns, of course.

The only part of this solution that is not an improvement over the STL version is that our ID structs have increased from 32 to 64 bits.

There are things that can be done about that. For example, if you never have more than 64 K objects live at the same time, you can get by with 16 bits for the index, which leaves 16 bits for the inner_id. Note that the inner_id doesn’t have to be globally unique, it is enough if it is unique for that index slot. So a 16 bit inner_id is fine if we never create more than 64 K objects in the same index slot.

If you want to go down that road you probably want to change the implementation of the free list slightly. The code above uses a standard free list implementation that acts as a LIFO stack. This means that if you create and delete objects in quick succession they will all be assigned to the same index slot which means you quickly run out of inner_ids for that slot. To prevent that, you want to make sure that you always have a certain number of elements in the free list (allocate more if you run low) and rewrite it as a FIFO. If you always have N free objects and use a FIFO free list, then you are guaranteed that you won’t see an inner_id collision until you have created at least N * 64 K objects.

Of course you can slice and dice the 32 bits in other ways if you hare different limits on the maximum number of objects. You have to crunch the numbers for your particular case to see if you can get by with a 32 bit ID.

Packed Array


One drawback with the approach sketched above is that since the index is exposed externally, the system cannot reorganize its objects in memory for maximum performance.

The holes are especially troubling. At some point the system probably wants to loop over all its objects and update them. If the object array is nearly full, no problem, But if the array has 50 % objects and 50 % holes, the loop will touch twice as much memory as necessary. That seems suboptimal.

We can get rid of that by introducing an extra level of indirection, where the IDs point to an array of indices that points to the objects themselves:


This means that we pay the cost of an extra array lookup whenever we resolve the ID. On the other hand, the system objects are packed tight in memory which means that they can be updated more efficiently. Note that the system update doesn’t have to touch or care about the index array. Whether this is a net win depends on how the system is used, but my guess is that in most cases more items are touched internally than are referenced externally.

To remove an object with this solution we use the standard trick of swapping it with the last item in the array. Then we update the index so that it points to the new location of the swapped object.

Here is an implementation. To keep things interesting, this time with a fixed array size, a 32 bit ID and a FIFO free list.

typedef unsigned ID;

#define MAX_OBJECTS 64*1024
#define INDEX_MASK 0xffff
#define NEW_OBJECT_ID_ADD 0x10000

struct Index {
	ID id;
	unsigned short index;
	unsigned short next;
};

struct System
{
	unsigned _num_objects;
	Object _objects[MAX_OBJECTS];
	Index _indices[MAX_OBJECTS];
	unsigned short _freelist_enqueue;
	unsigned short _freelist_dequeue;

	System() {
		_num_objects = 0;
		for (unsigned i=0; i<MAX_OBJECTS; ++i) {
			_indices[i].id = i;
			_indices[i].next = i+1;
		}
		_freelist_dequeue = 0;
		_freelist_enqueue = MAX_OBJECTS-1;
	}

	inline bool has(ID id) {
		Index &in = _indices[id & INDEX_MASK];
		return in.id == id && in.index != USHRT_MAX;
	}
	
	inline Object &lookup(ID id) {
		return _objects[_indices[id & INDEX_MASK].index];
	}
	
	inline ID add() {
		Index &in = _indices[_freelist_dequeue];
		_freelist_dequeue = in.next;
		in.id += NEW_OBJECT_ID_ADD;
		in.index = _num_objects++;
		Object &o = _objects[in.index];
		o.id = in.id;
		return o.id;
	}
	
	inline void remove(ID id) {
		Index &in = _indices[id & INDEX_MASK];
		
		Object &o = _objects[in.index];
		o = _objects[--_num_objects];
		_indices[o.id & INDEX_MASK].index = in.index;
		
		in.index = USHRT_MAX;
		_indices[_freelist_enqueue].next = id & INDEX_MASK;
		_freelist_enqueue = id & INDEX_MASK;
	}
};

24 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I don't really see the point of your suggestion. What you are returning externally is a pointer to an object member. For that to remain valid, the object cannot be moved or deleted. But it seems in that case, you might just as well return a pointer to the object itself.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. By using the described method in this article:

    1) How are the systems only coupled on a higher level if they for example share the same transforms? I am not sure if in your examples you are coupling from inside the systems (lower level) or from a higher level. Currently I do this by feeding the systems batches of arrays (transforms, bodies, render properties) from an higher level. Multiple systems may share the same arrays but internally keep them in a different order. Is that what you are doing?

    2) How do the systems share the same transforms but access them in a different order while keeping all access patterns sequential? For ex. the renderer needs to access the same transforms in texture order, the physics engine needs to access the same transforms in proximity order. Currently I do this by sorted lists for the renderer (initial radix sort, then merge sort every several frames) and keeping separate lists for physics (for each hashgrid cells).

    3) When an object/entity gets deleted or a component from an object/entity is getting removed: How are you finally removing the holes? Are you swapping out one element every frame? How are you skipping the holes when sequentially processing the arrays. Are you using the index array to only process what is needed? I am not sure if this is what you are already saying in the article.

    Sorry for the many questions and thank you for this great article.

    ReplyDelete
  8. Forgot to ask one last question:

    4) Are you only keeping a ID lookup array per system or do you also have a high level lookup array of some sort?

    Actually. Asking all those questions are helping me to anwer them myself :P. Of course I am still curious about your anwers.

    ReplyDelete
  9. 1) Usually by feeding, the higher level system extracts a list of transforms from one low level system and sends them to another low level system. Both systems keep there own local lists of transform.

    2) Each system has its own list of data that it needs to "crunch" on. So the animation system, the scene graph and the renderer all have separate lists of object transforms, each sorted according to that system's preferences.

    3) With the last solution, sequential processing is only done on the "object array", not on the "index array", which means there are no holes.

    4) Each system keeps its own lookup table... there is no master lookup table.

    ReplyDelete
  10. @Niklas. Thank you for the answers. I was a bit confused by the terminology used, so I understand the part about the lookup arrays now. Sorry for that ;)

    About the answer on the 1st question. How exactly is the higher level system feeding the transforms from one low level system to the other? Are you doing this every frame? As I can imagine that could potentially cause each low level system to constantly (each frame) re-sort its list of transforms to its preferred order (for example when every transform was changed by the physics system and is fed into another system which requires another order).

    What I am currently doing is keeping the list of transforms the same for most systems so I can pass them around without re-sorting. The only system that will re-sort the list of transforms, is the render system. Can you advice me a better way?

    ReplyDelete
  11. An example...the animation system has a list of bone transforms

    B_1 B_2 B_3 ...

    The scenegraph has a list of node transforms

    N_1 N_2 N_3 ...

    Some of the nodes correspond to bones, but there are also some nodes that don't have corresponding bones.

    A higher level system AnimationToSceneGraph knows about both systems and the connection between nodes and bones. Each update it does (in pseudocode):

    for (i,tm) in Bones:
    Node[bone_to_node[i]].tm = tm

    As you see this does not cause any list resorting. There is some extra copying going on, that you wouldn't get if you shared the data, but I think that is outweighed by the fact that each system can organize its own data so that it can process it as fast as possible.

    ReplyDelete
  12. Hoi Niklas. How do you manage one-to-many relations between the index array and the object array. In case of your bone transform example: an entity having multiple bones?

    ReplyDelete
  13. There are no one-to-many relationships here. An ID is a way for an external system to refer to a single object. If the external system needs to refer to more than one object, it can keep a list of IDs.

    ReplyDelete
  14. Niklas, correct me please if I'm wrong, but there seems to be a small bug in your packed array implementation. Steps to reproduce:

    1) Add MAX_OBJECTS number of objects without any removals. After adding the last one _freelist_dequeue will be equal MAX_OBJECTS.
    2) Remove any object
    3) Add a new object. Since _freelist_dequeue still points MAX_OBJECTS a memory corruption will occur

    I guess, the fix can be is to add the following to the end of the remove() method:

    if(_freelist_dequeue == MAX_OBJECTS)
    _freelist_dequeue = _freelist_enqueue;

    ReplyDelete
  15. You are right. The code as written only supports a maximum of MAX_OBJECTS-1 objects.

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. How do you handle the case there a lower system resorts a list, so that the higher system still have the right indices?

    ReplyDelete
  19. The ID doesn't have to be an index directly into the lower systems list. There can be a lookup array that offers a level of indirection. That allows for resorting.

    ReplyDelete
  20. Hi, there is a part in your packed array code I don't understand. What is the point of doing this :
    in.id += NEW_OBJECT_ID_ADD;
    ?

    ReplyDelete
  21. To ensure that each object gets a new unique ID, so that you can call has() on an old object ID to check whether that object still exists or not.

    ReplyDelete
  22. Currently trying to grasp those (that I find great) design ideas of DoD and back-to-C I still have trouble wrapping my head around something.

    When you do (in your example of October 10, 2011 in the comments here):

    for (i,tm) in Bones:
    Node[bone_to_node[i]].tm = tm

    Doesn't that make random access through the Node array completely defeating the cache access? Granted you will have at some point to do it, and here it's the Bone that update the node. But what when it's the other way around? A renderer, audio source, script needing the node transform? Wouldn't it be faster to have each renderer stocking the id of the corresponding node, and grab the transform during the update of the RendererManager? Instead of copying it and then iterating over the array, because in most case, you will only need it once for each object update.

    I just can't manage to find a design where I won't have to do those random array access during the update anyway, loosing the point of having my data well contiguous in memory, because I always need to do random access for each of them each update, will it be in a pre-fetch like your exemple, or in the loop iterating over each of them to update them...

    ReplyDelete
  23. It is not complete random access, because you could take steps to ensure that you bone list is sorted in the same order as the node list and then it would still be in-order access (just skipping the nodes that don't have associated bones).

    Even if you don't do that, random access in a small contiguous memory area (the node matrix list) is still a lot better than random access all over memory, because you will get fewer L2 cache misses.

    ReplyDelete
  24. For fun and new ideas: Interface implementation in Go
    http://research.swtch.com/interfaces

    ReplyDelete