Locking content files in a CVS is annoying, doesn't scale well and prevents multiple people from working on different parts of the same level (unless you split the level in many small files which have to be locked individually -- which is even more annoying).
But having content conflicts is no fun either. A level designer wants to work in the level editor, not manage strange content conflicts in barely understandable XML-files. The level designer should never have to mess with WinMerging the engine's file formats.
And conflicts shouldn't be necessary. Most content conflicts are not actual conflicts. It is not that often that two people have moved the exact same object or changed the exact same settings parameter. Rather, the conflicts occur because a line-based merge tool tries to merge hierarchical data (XML or JSON) and messes up the structure.
In those rare cases when there is an actual conflict, the content people don't want to resolve it in WinMerge. If two level designers have moved the same object, we don't really help them address the issue by bringing up a dialog box with a ton of XML mumbo-jumbo. Instead, it is much better to just pick one of the two locations and go ahead with merging the file. Then, the level designers can fix any problems that might have occurred in the level editor -- the right tool for the job.
At BitSquid we use JSON for all our content files (actually, a slightly simplified version of JSON that we call SJSON). So to get rid of our conflict issues, I have written a 3-way merger that understands the structure of JSON files and resolves any remaining actual conflicts by always picking the right-hand branch.
If we disregard arrays for the moment, merging JSON files is quite simple. A diff between two JSON files can be expressed as a list of object[key] = value operations. Deleting a key is represented by changing its value to null. Adding a key is represented by changing a null value to something else. Merging these operations is simple. We only have trouble when the same key in the same object is changed to two different values, but then we just pick one of the values, as explained above.
Arrays are trickier because without context, it is impossible to tell what a change to an array means semantically. If the array [1, 2, 3] is changed to [1, 2, 4] is that a single operation that changed the last value from 3 to 4. Or is it two operations, deleting the 3 from the array and inserting 4. How we interpret it will affect the result of our 3-way merges. For example, the 3-way merge of [1, 2, 3], [1, 2, 4] and [1, 2, 5] can give either the result [1, 2, 5] or [1, 2, 4, 5].
I have resolved this by adding extra information to the arrays in our source files. Most of our arrays are arrays of objects. For such arrays, I require that the objects have an "id"-field with a GUID that uniquely identifies the object. With such an id in our array [ {x = 1, id = a}, {x = 2, id = b}, {x = 3, id = c} ] it becomes possible to distinguish between updating an existing value [ {x = 1, id = a}, {x = 2, id = b}, {x = 4, id = c} ] and removing + adding a value [ {x = 1, id = a}, {x = 2, id = b}, {x = 4, id = d} ].
The 3-way merge algorithm I'm using applies some heuristics to guess array transformations even when no id-field is present, but the recommendation is to always add id-fields to array elements to get perfect merges.
You can download my 3-way Json merger here. I just wrote it today, so it haven't received much testing yet. But it is public domain software, so free free to fix the bugs and do whatever else you like with it.
Thursday, June 3, 2010
Friday, May 28, 2010
Practical Examples in Data Oriented Design
Here are the slides from my talk at the Sthlm Game Developer Forum yesterday:
Friday, April 23, 2010
Our Tool Architecture
The BitSquid tool architecture is based on two main design principles:
By decoupling the tools from the engine we achieve freedom and flexibility, both in the design of the tools and in the design of the engine. The tools can be written in any language (C#, Ruby, Java, Lisp, Lua, Python, C++, etc), using any methodology and design philosophy. The engine can be optimized and the runtime data formats changed without affecting the tools.
What we envision is a Unix-like environment with a plethora of special purpose tools (particle editor, animation editor, level editor, material editor, profiler, lua debugger, etc) rather than a single monolithic Mega-Editor. We want it to be easy for our licensees to supplement our standard tool set with their own in-house tools, custom written to fit the requirements of their particular games. For example, a top-down 2D game may have a custom written tile editor. Another programmer may want to hack together a simple batch script that drops a MIP-step from all vegetation textures.
At first glance, our two design goals may appear conflicting. How can we make our tools use the engine for all visualization without strongly coupling the tools to the engine? Our solution is shown in the image below:
Note that there is no direct linkage between the tool and the engine. The tool only talks to the engine through the network. All messages on the network connection are simple JSON structs, such as:
{
"type" : "message",
"level" : "info",
"system" : "D3DRenderDevice",
"message" : "Resizing swap chain: 1626 1051"
}
This applies for all tools. When the lua debugger wants to set a breakpoint, it sends a message to the engine with the lua file and line number. When the breakpoint is hit, the engine sends a message back. (So you can easily swap in your own lua debugger integrated with your favorite editor, by simply receiving and sending these messages.) When the engine has gathered a bunch of profiling data, it sends a profiler message. Et cetera.
For visualization, the tool creates a window where it wants the engine to render and sends the window handle to the engine. The engine then creates a swap chain for that window and renders into it.
(In the future we may also add support for a VNC-like mode where we instead let the engine send the content of the frame buffer over the network. This would allow the tools to work directly against consoles, letting the artists see, directly in their editors, how everything will look on the lead platform.)
A tool typically boots the engine in a special mode where it runs a custom lua script designed to collaborate with that particular tool. For example, the particle editor boots the engine with particle_editor_slave.lua which sets up a default scene for viewing particle effects with a camera, skydome, lights, etc. The tool then sends script commands over the network connection that tells the engine what to do, for example to display a particular effect:
{
type = "script",
script = "ParticleEditorSlave:test_effect('fx/grenade/explosion')"
}
These commands are handled by the slave script. The slave script can also send messages back if the tool is requesting information.
The slave scripts are usually quite simple. The particle editor slave script is just 120 lines of lua code.
To make the tools independent of the engine data formats we have separated the data into human-readable, extensible and backwards compatible generic data and fast, efficient, platform specific runtime data. The tools always work with the generic data, which is pretty much all in JSON (exceptions are textures and WAVs). Thus, they never need to care about how the engine represents its runtime data and the engine is free to change and optimize the runtime format however it likes.
When the tool has changed some data and wants to see the change in-engine, it launches the data compiler to generate the runtime data. (The data compiler is in fact just the regular Win32 engine started with a -compile flag, so the engine and the data compiler are always in sync. Any change of the runtime formats triggers a recompile.) The data compiler is clever about just compiling the data that has actually changed.
When the compile is done, the tool sends a network message to the engine, telling it to reload the changed data file at which point you will see the changes in-game. All this happens nearly instantaneously allowing very quick tweaking of content and gameplay (by reloading lua files).
This system has worked out really well for us. The decoupling has allowed for fast development of both the tools and the engine. Today we have about ten different tools that use this system and we have been able to make many optimizations to the engine and the runtime formats without affecting the tools or the generic data.
- Tools should use the "real" engine for visualization.
- Tools should not be directly linked or otherwise strongly coupled to the engine.
Using the real engine for visualization means that everything will look and behave exactly the same in the tools as it does in-game. It also saves us the work of having to write a completely separate "tool visualizer" as well as the nightmare of trying to keep it in sync with changes to the engine.
By decoupling the tools from the engine we achieve freedom and flexibility, both in the design of the tools and in the design of the engine. The tools can be written in any language (C#, Ruby, Java, Lisp, Lua, Python, C++, etc), using any methodology and design philosophy. The engine can be optimized and the runtime data formats changed without affecting the tools.
What we envision is a Unix-like environment with a plethora of special purpose tools (particle editor, animation editor, level editor, material editor, profiler, lua debugger, etc) rather than a single monolithic Mega-Editor. We want it to be easy for our licensees to supplement our standard tool set with their own in-house tools, custom written to fit the requirements of their particular games. For example, a top-down 2D game may have a custom written tile editor. Another programmer may want to hack together a simple batch script that drops a MIP-step from all vegetation textures.
At first glance, our two design goals may appear conflicting. How can we make our tools use the engine for all visualization without strongly coupling the tools to the engine? Our solution is shown in the image below:
Note that there is no direct linkage between the tool and the engine. The tool only talks to the engine through the network. All messages on the network connection are simple JSON structs, such as:
{
"type" : "message",
"level" : "info",
"system" : "D3DRenderDevice",
"message" : "Resizing swap chain: 1626 1051"
}
This applies for all tools. When the lua debugger wants to set a breakpoint, it sends a message to the engine with the lua file and line number. When the breakpoint is hit, the engine sends a message back. (So you can easily swap in your own lua debugger integrated with your favorite editor, by simply receiving and sending these messages.) When the engine has gathered a bunch of profiling data, it sends a profiler message. Et cetera.
For visualization, the tool creates a window where it wants the engine to render and sends the window handle to the engine. The engine then creates a swap chain for that window and renders into it.
(In the future we may also add support for a VNC-like mode where we instead let the engine send the content of the frame buffer over the network. This would allow the tools to work directly against consoles, letting the artists see, directly in their editors, how everything will look on the lead platform.)
A tool typically boots the engine in a special mode where it runs a custom lua script designed to collaborate with that particular tool. For example, the particle editor boots the engine with particle_editor_slave.lua which sets up a default scene for viewing particle effects with a camera, skydome, lights, etc. The tool then sends script commands over the network connection that tells the engine what to do, for example to display a particular effect:
{
type = "script",
script = "ParticleEditorSlave:test_effect('fx/grenade/explosion')"
}
These commands are handled by the slave script. The slave script can also send messages back if the tool is requesting information.
The slave scripts are usually quite simple. The particle editor slave script is just 120 lines of lua code.
To make the tools independent of the engine data formats we have separated the data into human-readable, extensible and backwards compatible generic data and fast, efficient, platform specific runtime data. The tools always work with the generic data, which is pretty much all in JSON (exceptions are textures and WAVs). Thus, they never need to care about how the engine represents its runtime data and the engine is free to change and optimize the runtime format however it likes.
When the tool has changed some data and wants to see the change in-engine, it launches the data compiler to generate the runtime data. (The data compiler is in fact just the regular Win32 engine started with a -compile flag, so the engine and the data compiler are always in sync. Any change of the runtime formats triggers a recompile.) The data compiler is clever about just compiling the data that has actually changed.
When the compile is done, the tool sends a network message to the engine, telling it to reload the changed data file at which point you will see the changes in-game. All this happens nearly instantaneously allowing very quick tweaking of content and gameplay (by reloading lua files).
This system has worked out really well for us. The decoupling has allowed for fast development of both the tools and the engine. Today we have about ten different tools that use this system and we have been able to make many optimizations to the engine and the runtime formats without affecting the tools or the generic data.
Friday, April 9, 2010
Distance Field Based Rendering of AngelCode Fonts
This morning, we added support for distance field based font rendering to the BitSquid engine (from Valve's paper http://www.valvesoftware.com/publications/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf). An example is shown below:
The top row shows the original font, below that is the original font rendered with alpha test. The third row is a distance field representation of the font and the final row shows the distance field representation rendered with alpha test. Note that the distance field version gives better quality in the diagonal lines.
(Note: The last row looks thicker than the second row, because it was generated from a large font size and scaled down, while the second row was generated from a small font size. Because of true type font hinting at small sizes, the result is different. The last row gives a truer representation of the "actual" thickness of the font.)
A quick Google search didn't show any good tools for generating distance field font maps, so I decided to write my own. We use the excellent AngelCode Bitmap Font Generator (http://www.angelcode.com/products/bmfont/) to generate our font maps, so I decided to make a tool that works with the files generated by AngelCode:
The tool takes a high resolution AngelCode .fnt file as input. It scales it down by the specified scale factor and converts it to a distance field. The spread specifies how many pixels the distance field should extend outside the character outline before it clamps to zero. (It is useful if you want to add things such as glow effects to the font rendering.) After the conversion, the tool outputs new scaled down .tga images of the fonts and a new .fnt file with all measurements converted to work with the scaled down textures.
So to use it, you first generate a font bitmap and .fnt file using AngelCode at 8 x the font size and 8 x the texture size you want in the final image. (Make sure to add 8 x spread pixels of padding around the characters or else the distance fields will bleed into each other.) Then you run the tool to convert it to a distance field texture.
The tool is a bit limited -- it only works with monochrome uncompressed .tga files. It only reads and writes the XML version of the AngelCode font format. The distance field generation isn't particularly clever or fast. But I thought I should share it anyway since I couldn't find any other tools for generating distance field based font maps. Modifying it to support more formats shouldn't be much work.
Grab a binary version here:
Or the C# project files here:
Feel free to do whatever you want with it!
(Note: The last row looks thicker than the second row, because it was generated from a large font size and scaled down, while the second row was generated from a small font size. Because of true type font hinting at small sizes, the result is different. The last row gives a truer representation of the "actual" thickness of the font.)
A quick Google search didn't show any good tools for generating distance field font maps, so I decided to write my own. We use the excellent AngelCode Bitmap Font Generator (http://www.angelcode.com/products/bmfont/) to generate our font maps, so I decided to make a tool that works with the files generated by AngelCode:
So to use it, you first generate a font bitmap and .fnt file using AngelCode at 8 x the font size and 8 x the texture size you want in the final image. (Make sure to add 8 x spread pixels of padding around the characters or else the distance fields will bleed into each other.) Then you run the tool to convert it to a distance field texture.
The tool is a bit limited -- it only works with monochrome uncompressed .tga files. It only reads and writes the XML version of the AngelCode font format. The distance field generation isn't particularly clever or fast. But I thought I should share it anyway since I couldn't find any other tools for generating distance field based font maps. Modifying it to support more formats shouldn't be much work.
Grab a binary version here:
Or the C# project files here:
Feel free to do whatever you want with it!
Thursday, March 25, 2010
Task Management -- A Practical Example
I've spent the last couple of days rewriting the task manager in the BitSquid engine. Task management is an important topic in our glorious multicore future, but it is hard to find good practical information about it. GDC was also a bit of a disappointment in this regard. So I thought I should share some of my thoughts and experiences.
The previous iteration of our task scheduler was based on Vista ThreadPools and mainly supported data parallelism. (Though we still had a degree of task parallelism from running two main threads -- an update thread and a render thread -- which both posted batches of jobs to the task manager.)
For the rewrite, I had a number of goals:
The previous iteration of our task scheduler was based on Vista ThreadPools and mainly supported data parallelism. (Though we still had a degree of task parallelism from running two main threads -- an update thread and a render thread -- which both posted batches of jobs to the task manager.)
For the rewrite, I had a number of goals:
- Move away from Vista Thread Pools. We want complete control over our job threads.
- Minimize context switching. This is a guessing game on Windows, since the OS will do what the OS will do, but minimizing oversubscription of threads should help.
- Make a system that can run completely task based. I. e., everything in the system is run as a task and there are no explicit wait() calls. Instead the entire code flow is controlled by task dependencies. Such a design allows us to exploit all possibilities for parallelism in the code which leads to maximum core utilization.
- Still be "backwards compatible" with a system that uses one or more "main threads" that wait() for data parallel jobs to complete, so that we can move incrementally to a more and more task based code flow.
- Support tasks that run on external processors, such as SPUs or GPUs.
- Support hierarchical decomposition of tasks.
By hierarchical decomposition I mean that it should be possible to analyze the system in terms of tasks and subtasks. So that, at a higher level, we can regard the animation system as a single task that runs in parallel to other system tasks:
But then we can zoom in on the animation task and see that in fact is composed of a number of subtasks which in turn parallelize:
Hierarchical decomposition makes it possible to analyze systems and subsystems at different levels of abstraction rather than having to keep the entire task dependency graph in our heads. This is good because my head just isn't big enough.
A task in the new implementation is a simple data structure:
Here work is a work item to be performed on an SPU, CPU or GPU. affinity can be set for items that must be performed on particular threads.
parent specifies child/parent relationships between tasks. A task can have any number of children/subtasks. A task is considered completed when its work has been executed and all its children has completed. In practice this is implemented by the open_work_items counter. The counter is initially set to the number of child tasks + 1 (for the task's own work item). When a task completes, it reduces the open_work_items count of its parent and when that figure reaches zero, the parent work is completed.
I do not explicitly track completed task. Instead I keep a list of all open (i.e. not completed) tasks. Any task that is not in the open list is considered completed. Note that the open list is separate from the queue of work items that need to be performed. Items are removed from the queue when they are scheduled to a worker thread and removed from the open list when they have completed.
The dependency field specifies a task that the task depends on. The task is not allowed to start until its dependency task has completed. Note that a task can only have a single dependency. The reason for this is that I wanted the task structure to be a simple POD type and not include any arrays or other external memory references.
Having a single dependency is not a limitation, because if we want to depend on more than one task we can just introduce an anonymous task with no work item that has all the tasks we want to depend on as children. That task will complete when all its children has completed, so depending on that task gives us the wanted dependencies.
The priority field specfies the importance of the task. When several tasks are available, we will pick the one with the highest priority. I will discuss this a bit more in a minute.
The Task Manager has a number of threads for processing tasks. Some of these are "main threads" that are created by other parts of the system and registered with the thread manager (in our case, an update thread and a render thread). The rest are worker threads created internally by the task manager. The number of worker threads is:
The total number of threads managed by the task manager thus equals the number of cores in the system, so we have no over- or undersubscription.
The worker threads are in a constant loop where they check the task manager for work items to perform. If a work item is available, they perform it and then notify the task manager of its completion. If no work items are available, they sleep and are woken by the task manager when new work items become available.
The main threads run their normal serial code path. As part of that code path, they can create tasks and subtasks that get queued with the task manager. They can also wait() for tasks to complete. When a thread waits for a task it doesn't go idle. Instead it loops and helps the task manager with completing tasks. Only when there are no more tasks in the queue does the thread sleep. It wakes up again when there are more tasks to perform or when the task it originally waited for has completed.
The main threads can also process tasks while waiting for other events by calling a special function in the task manager do_work_while_waiting_for(Event &). For example, the update thread calls this to wait for the frame synchronization event from the render thread.
This means that all task manager threads are either running their serial code paths or processing jobs -- as long as there are jobs to perform and they don't get preempted by the OS. This means that as long as we have lots of jobs and few sync points we will achieve 100 % core utilization.
This approach also allows us to freely mix serial code with a completely task based approach. We can start out with a serial main loop (with data parallelization in the update() functions):
void World::update()
{
_animation->update()
_scene_graph->update();
_gui->update();
render();
_sound->update();
}
And gradually convert it to fully braided parallelism (this code corresponds to the task graph shown above):
void World::update()
{
TaskId animation = _tasks->add( animation_task(_animation) );
TaskId scene_graph = _tasks->add( scene_graph_task(_scene_graph) );
_tasks->depends_on(scene_graph, animation);
TaskId gui = _tasks->add( gui_task(_gui) );
TaskId gui_scene = _tasks->add_empty();
_tasks->add_child(gui_scene, scene_graph);
_tasks->add_child(gui_scene, gui);
TaskId render = _tasks->add( render_task(this) );
_tasks->depends_on(render, gui_scene);
TaskId sound = _tasks->add( sound_update_task(_sound) );
TaskId done = _tasks->add_empty();
_tasks->add_child(done, render);
_tasks->add_child(done, sound);
_tasks->wait(done);
}
Note that tasks, subtasks and dependencies are created dynamically as part of the execution of serial code or other tasks. I believe this "immediate mode" approach is more flexible and easier to work with than some sort of "retained" or "static" task graph building.
A screenshot from our profiler shows this in action for a scene with 1000 animated characters with state machines:
Notice how the main and render threads help with processing tasks while they are waiting for tasks to be completed.
Once we have a task graph we want to make sure that our scheduler runs it as fast possible. Theoretically, we would do this by finding the critical path of the graph and making sure that tasks along the critical path are prioritized over other tasks. It's the classical task scheduling problem.
In a game, the critical path can vary a lot over different scenes. Some scenes are render bound, others are CPU bound. Of the CPU bound scenes, some may be bounded by script, others by animation, etc.
To achieve maximum performance in all situations we would have to dynamically determine the critical path and prioritize the tasks accordingly. This is certainly feasible, but I am a bit vary of dynamically reconfiguring the priorities in this way, because it makes the engine harder to profile, debug and reason about. Instead I have chosen a simpler solution for now. Each job is given a priority and the highest priority jobs are performed first. The priorities are not fixed by the engine but configured per-game to match its typical performance loads.
This seems like a resonable first approach. When we have more actual game performance data it would be interesting to compare this with the performance of a completely dynamic scheduler.
In the current implementation, all tasks are posted to and fetched from a global task queue. There are no per thread task queues and thus no task stealing. At our current level of task granularity (heavy jobs are split into a maximum of 5 * thread_count tasks) the global task queue should not be a bottleneck. And a finer task granularity won't improve core utilization. When we start to have >32 cores the impact of the global queue may start to become significant, but until then I'd rather keep the system as simple as possible.
OS context switching still hits us occasionally in this system. For example one of the animation blending tasks in the profiler screenshot takes longer than it should:
I have an idea for minimizing the impact of such context switches that I may try out in the future. If a task is purely functional (idempotent) then it doesn't matter how many times we run the task. So if we detect a situation where a large part of the system is waiting for a task on the critical path (that has been switched out by the OS) we can allocate other threads to run the same task. As soon as any of the threads has completed the task we can continue.
I haven't implemented this because it complicates the model by introducing two different completion states for tasks. One where some thread has completed the task (and dependent jobs can run) and another where all threads that took on the task have completed it (and buffers allocated for the task can be freed). Also, context switching is mainly a problem on PC which isn't our most CPU constrained platform anyway.
A task in the new implementation is a simple data structure:
Here work is a work item to be performed on an SPU, CPU or GPU. affinity can be set for items that must be performed on particular threads.
parent specifies child/parent relationships between tasks. A task can have any number of children/subtasks. A task is considered completed when its work has been executed and all its children has completed. In practice this is implemented by the open_work_items counter. The counter is initially set to the number of child tasks + 1 (for the task's own work item). When a task completes, it reduces the open_work_items count of its parent and when that figure reaches zero, the parent work is completed.
I do not explicitly track completed task. Instead I keep a list of all open (i.e. not completed) tasks. Any task that is not in the open list is considered completed. Note that the open list is separate from the queue of work items that need to be performed. Items are removed from the queue when they are scheduled to a worker thread and removed from the open list when they have completed.
The dependency field specifies a task that the task depends on. The task is not allowed to start until its dependency task has completed. Note that a task can only have a single dependency. The reason for this is that I wanted the task structure to be a simple POD type and not include any arrays or other external memory references.
Having a single dependency is not a limitation, because if we want to depend on more than one task we can just introduce an anonymous task with no work item that has all the tasks we want to depend on as children. That task will complete when all its children has completed, so depending on that task gives us the wanted dependencies.
The priority field specfies the importance of the task. When several tasks are available, we will pick the one with the highest priority. I will discuss this a bit more in a minute.
The Task Manager has a number of threads for processing tasks. Some of these are "main threads" that are created by other parts of the system and registered with the thread manager (in our case, an update thread and a render thread). The rest are worker threads created internally by the task manager. The number of worker threads is:
worker_thread_count = number_of_cores - main_thread_count
The total number of threads managed by the task manager thus equals the number of cores in the system, so we have no over- or undersubscription.
The worker threads are in a constant loop where they check the task manager for work items to perform. If a work item is available, they perform it and then notify the task manager of its completion. If no work items are available, they sleep and are woken by the task manager when new work items become available.
The main threads run their normal serial code path. As part of that code path, they can create tasks and subtasks that get queued with the task manager. They can also wait() for tasks to complete. When a thread waits for a task it doesn't go idle. Instead it loops and helps the task manager with completing tasks. Only when there are no more tasks in the queue does the thread sleep. It wakes up again when there are more tasks to perform or when the task it originally waited for has completed.
The main threads can also process tasks while waiting for other events by calling a special function in the task manager do_work_while_waiting_for(Event &). For example, the update thread calls this to wait for the frame synchronization event from the render thread.
This means that all task manager threads are either running their serial code paths or processing jobs -- as long as there are jobs to perform and they don't get preempted by the OS. This means that as long as we have lots of jobs and few sync points we will achieve 100 % core utilization.
This approach also allows us to freely mix serial code with a completely task based approach. We can start out with a serial main loop (with data parallelization in the update() functions):
void World::update()
{
_animation->update()
_scene_graph->update();
_gui->update();
render();
_sound->update();
}
And gradually convert it to fully braided parallelism (this code corresponds to the task graph shown above):
void World::update()
{
TaskId animation = _tasks->add( animation_task(_animation) );
TaskId scene_graph = _tasks->add( scene_graph_task(_scene_graph) );
_tasks->depends_on(scene_graph, animation);
TaskId gui = _tasks->add( gui_task(_gui) );
TaskId gui_scene = _tasks->add_empty();
_tasks->add_child(gui_scene, scene_graph);
_tasks->add_child(gui_scene, gui);
TaskId render = _tasks->add( render_task(this) );
_tasks->depends_on(render, gui_scene);
TaskId sound = _tasks->add( sound_update_task(_sound) );
TaskId done = _tasks->add_empty();
_tasks->add_child(done, render);
_tasks->add_child(done, sound);
_tasks->wait(done);
}
Note that tasks, subtasks and dependencies are created dynamically as part of the execution of serial code or other tasks. I believe this "immediate mode" approach is more flexible and easier to work with than some sort of "retained" or "static" task graph building.
A screenshot from our profiler shows this in action for a scene with 1000 animated characters with state machines:
Notice how the main and render threads help with processing tasks while they are waiting for tasks to be completed.
Once we have a task graph we want to make sure that our scheduler runs it as fast possible. Theoretically, we would do this by finding the critical path of the graph and making sure that tasks along the critical path are prioritized over other tasks. It's the classical task scheduling problem.
In a game, the critical path can vary a lot over different scenes. Some scenes are render bound, others are CPU bound. Of the CPU bound scenes, some may be bounded by script, others by animation, etc.
To achieve maximum performance in all situations we would have to dynamically determine the critical path and prioritize the tasks accordingly. This is certainly feasible, but I am a bit vary of dynamically reconfiguring the priorities in this way, because it makes the engine harder to profile, debug and reason about. Instead I have chosen a simpler solution for now. Each job is given a priority and the highest priority jobs are performed first. The priorities are not fixed by the engine but configured per-game to match its typical performance loads.
This seems like a resonable first approach. When we have more actual game performance data it would be interesting to compare this with the performance of a completely dynamic scheduler.
In the current implementation, all tasks are posted to and fetched from a global task queue. There are no per thread task queues and thus no task stealing. At our current level of task granularity (heavy jobs are split into a maximum of 5 * thread_count tasks) the global task queue should not be a bottleneck. And a finer task granularity won't improve core utilization. When we start to have >32 cores the impact of the global queue may start to become significant, but until then I'd rather keep the system as simple as possible.
OS context switching still hits us occasionally in this system. For example one of the animation blending tasks in the profiler screenshot takes longer than it should:
I have an idea for minimizing the impact of such context switches that I may try out in the future. If a task is purely functional (idempotent) then it doesn't matter how many times we run the task. So if we detect a situation where a large part of the system is waiting for a task on the critical path (that has been switched out by the OS) we can allocate other threads to run the same task. As soon as any of the threads has completed the task we can continue.
I haven't implemented this because it complicates the model by introducing two different completion states for tasks. One where some thread has completed the task (and dependent jobs can run) and another where all threads that took on the task have completed it (and buffers allocated for the task can be freed). Also, context switching is mainly a problem on PC which isn't our most CPU constrained platform anyway.
Friday, February 12, 2010
The Blob and I
Having resource data in a single binary blob has many advantages over keeping it in a collection of scattered objects:
The problem with pointer patching is solved in the simplest way possible -- I don't use pointers in the resource data. Instead, I just use offsets all the time, both in memory and on disk. For example, the resource data for our particle systems looks something like this (simplified):
- Shorter load times. We can just stream the entire blob from disk to memory.
- Cache friendly. Related objects are at close locations in memory.
- DMA friendly. An entire blob can easily be transferred to a co-processor.
In past engines I've used placement new and pointer patching to initialize C++ objects from a loaded blob. To save a resource with this system all the objects are allocated after each other in memory, then their pointers are converted to local pointers (offsets from the start of the blob). Finally all the allocated data is written raw to disk.
When loading, first the raw data blob is loaded from disk. Then placement new is used with a special constructor to create the root object at the start of the blob. The constructor takes care of pointer-patching, converting the offsets back to pointers. Let's look at an example:
class A
{
int _x;
B *_b;
public:
A(int x, B *b) : _x(x), _b(b) {}
A(char* base) {
_b = (B*)( (char *)_b + (base - (char *)0) );
new (_b) B(base);
}
...
};
...
A *a = new (blob) A(blob);
Note that the constructor does not initialize _x. a is placement new:ed into an area that already contains an A object with the right value for _x (the saved value). By not initializing _x we make sure that it keeps its saved value. The constructor does three things:
- Initializes the vtable pointer of a. This is done "behind the scenes" by C++ when we call new. It is necessary for us to be able to use a as an A object, since the vtable pointer of A saved in the file during data compilation will typically not match the vtable pointer of A in the runtime.
- Pointer patches _b, converting it from an offset from the blob base to its actual memory location.
- Placement new:s B into place so that B also gets the correct vtable, patched pointers, etc. Of course B's constructor may in turn create other objects.
Like many "clever" C++ constructs this solution gives a smug sense of satisfaction. Imagine that we are able to do this using our knowledge of vtables, placement new, etc. Truly, we are Gods that walk the earth!
Of course it doesn't stay this simple. For the solution to be complete it must also be able to handle base class pointers (call a different new based on the "real" derived class of the object, which must be stored somewhere), arrays and collection classes (we can't use std::vector, etc because they don't fit into our clever little scheme).
Lately, I've really come to dislike these kinds of C++ "framework" solutions that require that every single class in a project conform to a particular world view (implement a special constructor, a special save() function, etc). It tends to make the code very coupled and rigid. God forbid you ever had to change anything in the serialization system, because now the entire WORLD depends on it. The special little placement constructors creep in everywhere and pollute a lot of classes that don't really want to care about serialization. This makes the entire code base complicated and ugly.
Also, it should be noted that naively "blobbing" a collection of scattered objects by just concatenating them in memory does not necessarily lead to optimal memory access patterns. If the memory access order does not match the serialization order there can still be a lot of jumping around in memory. The serialization order with this kind of solution tends to be depth-first and can be tricky to change. (Since the entire WORLD depends on the serialization system!)
In the BitSquid engine I use a much simpler approach to resource blobs. The BitSquid engine is data-centric rather than class-centric. The data design is done first -- laid out in simple structs, optimized for the typical access patterns and DMA transfers. Then functions are defined that operate on the data. Classes are used to organize higher level systems, not in the low level processing intensive systems or resource definitions. Inheritance is very rarely used. (Virtual function calls are always cache unfriendly since they resolve to different code locations for each object. It is better to keep objects sorted by type and then you don't really need virtual calls.)
I believe this "old-school" C-like approach not only gives better performance, but also in many cases a better design. A looser coupling between data and processing makes it easier to modify things and move them around. And deep, bad inheritance structures are the main source of unnecessary coupling in C++ programs.
Since the resource data is just simple structs, not classes with virtual functions, we can just write it to disk and read it back as we please. We don't need to initialize any vtable pointers, so we don't need to call new on the data.
The problem with pointer patching is solved in the simplest way possible -- I don't use pointers in the resource data. Instead, I just use offsets all the time, both in memory and on disk. For example, the resource data for our particle systems looks something like this (simplified):
Yes, having offsets in the resource data instead of pointers means that I occasionally need to do a pointer add to find the memory location of an object. I'm sure someone will balk at this "unnecessary" computation, but I can't see it having any significant performance impact whatsoever. (If you have to do it a lot, then you are jumping around in memory a lot and then that is the main source of your performance problem.)
The advantage is that since I'm only storing offsets I don't need to do any pointer patching at all. I can move the data around in memory as I like, make copies of it, concatenate it to other blobs to make bigger blobs, save it to disk and read it back with a single operation and no need for pre- or post-processing. There is no complicated "serialization framework". No system in the engine needs to care about how any other system stores or reads it data.
As in many other cases the data-centric approach gives a solution that is simpler, faster, more flexible and more modular.
As in many other cases the data-centric approach gives a solution that is simpler, faster, more flexible and more modular.
Tuesday, January 19, 2010
Content Repositories and Databases
I've been toying with the idea of replacing game content repositories (Perforce, Subversion) with something else. After all, nobody really likes content repositories -- they are slow, non-intuitive, give rise to merge problems, etc. Version control systems were primarily designed for code, not for content, and that shows. So what could replace them? One option is to use a central database. There are a number of superficial advantages to that approach:
- Simpler -- no need to update or check-in.
- Changes are immediately visible to everyone.
- No merge issues.
- Collaborative editing (several designers working on the same level) is possible.
But we would loose all the nice features of version control:
- Accountability, history tracking and reversion.
- Branching and tagging.
- Having local, uncommitted changes in a working copy.
How necessary are those features? I would say that they are essential. But I also have a small nagging doubt that maybe this opinion is just the result of my own prejudices as a programmer. After all, people in many industries do lots of serious collaborative work using databases without branching, reversion or working copies. Still, I'm not ready to take the plunge and give up on version control features. (Though if anyone has tried it, I would certainly like to hear about it.)
Having those features by necessity implies some of the complexities associated with version control. For example, if we want a local working copy we need some explicit check-in/update mechanism. If we don't need a local copy we can just make the editor do svn update, svn commit on each change and the repository will be as "immediate" as a database.
Collaborative editing depends more on how the editor is implemented than on the storage backend. Regardless of whether we are using a database or a repository the editor will at some point have to fetch and display the changes made by other users as well as submit the changes made by the local user. With a repository backend, svn update and svn commit could be used for that purpose.
The only issue then is to avoid merge conflicts as much as possible, since they force the user to interact with the svn update command and ruin the collaborative editing experience. Fortunately, that should be relatively easy. At BitSquid, we store most of our data in JSON-like structures. With a JSON-aware 3-way-merger, conflicts will only arise if the same field in the same JSON-object is changed, which should happen rarely.
So, no great new way of storing content. Instead I just have to write a 3-way JSON-merger to protect the content people from merge conflicts. And then start working on the collaborative level editor...
Subscribe to:
Posts (Atom)








