Friday, April 20, 2012

Inheriting Velocity in Ragdolls

After a slew of abstract articles about C++ and code structuring I'd like to get back to some more meaty game engine stuff. So today I'll talk about ragdolls. In particular, how to preserve the momentum of animated objects, so that when you switch over to the ragdoll it continues to stumble forward in the same direction that the animation was moving, before crashing to a gruesome death.

So this is a small, but important problem. We want to somehow get the velocities of the animated objects and then apply them to the bodies in the ragdoll. The only snag is that animated objects typically don't know anything about velocities. Also, we need some way of matching up the physics bodies with the animated objects.

First, some background information. In the Bitsquid engine, physics, scene graph and animation are completely separate systems. We strongly believe in minimizing the couplings between different systems since that makes the engine easier to understand, reason about, modify, optimize and rewrite.

  • The physics system simulates a number of bodies, possibly connected by joints.

  • The scene graph handles local-to-world transforms for a collection of nodes in a hierarchy.

  • The animation system evaluates and blends animation curves for bones.

Bones and bodies hold references (just integer indices, really) to nodes in the scene graph and this how the systems communicate. After the animation has been evaluated, the resulting local transforms are written to the bones' nodes in the scene graph.

For keyframed physics (animated hit bodies), the animation drives the physics, which means the physics' bodies will read their world transforms from the corresponding nodes in the scene graph. For ragdolled physics, the world transforms of the bodies are written to the scene graph after the simulation has completed.

For partial ragdolls (such as a non-functioning, but still attached limb) or powered ragdolls (ragdolls driven by motors to achieve animation poses) it gets a little more involved (perhaps a topic for a future post), but the basic setup is the same.

Given this setup there are two ways of calculating the animation velocities:

  • We can calculate the velocities directly by differentiating the animation curves.

  • We can record a node's transform at two different time steps and compute the velocity from the difference.

The first approach is doable, but not very practical. Not only do we have to differentiate all the animation curves, we must also take into account how those velocities are affected by the blend tree and local-to-world transforms. And even if we do all that, we still don't account for movements from other sources than animation, such as scripted movements, IK or interactions with the character controller.

The second option is the more reasonable one. Now all we need is a way of obtaining the transforms from two different time steps. There are a number of possible options:

  • We could add an array of Matrix4x4:s to our scene graph's last_world where we store the last world transform of every object. So whenever we want to go to ragdoll we always have a last_world transform to calculate velocities from.

  • We could simulate the character backwards in time when we want to go to ragdoll and obtain a last_world transform that way.

  • We could delay the transition to ragdoll one frame, so that we have enough time to gather two world transforms for computing the velocity.

The first approach is conceptually simple, but costly. We are increasing the size of all our scene graphs by about 50 % (previously they contained local and world transforms, now they will also need last_world). In addition we must memcpy(last_world, world) before we compute new world transforms. That's a significant cost to pay all the time for something that happens very seldom (transition to ragdoll).

The second appraoch sounds a bit crazy, but some games actually already have this functionality. Servers in competetive multi-player fps games often need to rewind players in time in order to accurately determine if they were able to hit each other. Still, I find the approach to be a bit too complicated and invovled just to get a velocity.

The third aproach seems simple and cheap, but it violates one of our Bitsquid principles: Thou Shalt Not Have Any Frame Delays. Delaying something a frame can be a quick fix to many hairy problems, but it puts your game in a very weird transitional state where it at the same time both is and isn't (yet) something. The character isn't really a ragdoll yet, but it will be the next frame, whether I want to or not.

This new slightly self-contradictory state invites a host of bugs and before you know it, little logic pieces will start to seep into the code base "do this unless you are in the special transition-to-ragdoll state". Congratulations, you have just made your codebase a lot more complicated and bug prone.

If this is not enough, consider the poor sucker who just wants to write a routine that does A, B, C and D, when A, B and C requires frame delays. Suddenly what was supposed to be simple function got turned into a state machine that needs to run for four frames to produce it result.

The simple rule that actions should take place immediately protects against such insanity.

So three options, none of them especially palpable.

I actually went with the first one, to always compute and store last_world in the scene graph, but with a flag so that this is only used for units that actually need it (characters that can go to ragdoll). When there is no clear winner, I always pick the simplest solution, because it is a lot easier to optimize later if the need should arise. (We could for example track last_world only for the nodes which have a corresponding ragdoll actor. Also we could store last_world as (p,q) instead of as a matrix.)

For completion, given the two transforms, the code for compting the velocities will look something like this:

Vector3 p0 = translation(tm_0);
Vector3 p1 = translation(tm_1);
Vector3 velocity = (p1 - p0) / dt

Quaternion q0 = rotation(tm_0);
Quaternion q1 = rotation(tm_1);
Quaternion q = q1 * inverse(q0);
AxisAngle aa = q.decompose();
Vector3 angular_velocity = aa.axis * aa.angle / dt;

Wednesday, March 21, 2012

PIMPL vs Pure Virtual Interfaces

In C++, separating the interface (public declarations) of a class from its implementation (private methods, members and definitions) serves several useful purposes:

  • Implementation details are hidden, making interfaces easier to read and understand.

  • Smaller header files with fewer dependencies means faster compile times.

  • A weaker coupling between the interface and the implementation gives greater freedom in reorganizing and refactoring the implementation internals.

In pure C, we can achieve this separation by using a pointer to a forward declared struct:

struct SoundWorld;
typedef unsigned SoundInstanceId;

SoundWorld *make_sound_world();
void destroy_sound_world(SoundWorld *world);
SoundInstanceId play(SoundWorld *world, SoundResource *sound);
void stop(SoundWorld *world, SoundInstanceId id);

The struct is opaque to the users of the API. The actual content is defined in the .cpp file:

struct SoundWorld {
    SoundInstance playing_instances[MAX_PLAYING_INSTANCES];
    Matrix4x4 listener_pose;
    ...
};

C++ programmers are often recommended to use the PIMPL idiom (pointer to implementation) to achieve the same thing:

class SoundWorldImplementation;

class SoundWorld
{
public:
    typedef unsigned InstanceId;

    SoundWorld();
    ~SoundWorld();

    InstanceId play(SoundResource *sound);
    void stop(InstanceId id);

private:
    SoundWorldImplementation *_impl;
};

Here, SoundWorld is the external interface of the class. All the messy stuff: instance variables, private methods, etc is found in the SoundWorldImplementation class, which is in the .cpp file.

The _impl pointer is created in the constructor and calls to the methods in SoundWorld are forwarded to the implementation object via method stubs:

SoundWorld::SoundWorld()
{
    _impl = new SoundWorldImplementation();
}

InstanceId SoundWorld::play(SoundResource *sound)
{
    return _impl->play(sound);
}

Another solution to the same problem is to write the interface as an abstract, pure virtual class in the .h file and then create the implementation as a subclass in the .cpp file.

You don't see this solution recommended as much (at least not as a solution to this particular problem), but I actually like it better. With this approach, the header file will look something like this:

class SoundWorld
{
public:
    typedef unsigned InstanceId;

    virtual ~SoundWorld() {}
    virtual InstanceId play(SoundResource *sound) = 0;
    virtual void stop(InstanceId id) = 0;

    static SoundWorld *make(Allocator &a);
    static void destroy(Allocator &a, SoundWorld *sw);
};

Note that since the class is now abstract, we cannot create actual instances of it, to do that we need the factory functions make() and destroy(). I've added an allocator parameter for good measure, because I always want to specify explicit allocators for all memory operations.

The corresponding .cpp file looks something like:

class SoundWorldImplementation : public SoundWorld
{
    friend class SoundWorld;

    SoundInstance _playing_instances[MAX_PLAYING_INSTANCES];
    Matrix4x4 _listener_pose;

    SoundWorldImplementation()
    {
        ...
    }

    virtual InstanceId play(SoundResource *sound) 
    {
        ...
    }

    virtual void stop(InstanceId) 
    {
        ...
    }
};

SoundWorld *SoundWorld::make(Allocator &a)
{
    return a.make<SoundWorldImplementation>();
}

SoundWorld *SoundWorld::destroy(Allocator &a, SoundWorld *sw)
{
    return a.destroy<SoundWorldImplementation>(sw);
}

The reason why most people recommend the PIMPL approach is that it has some distinct advantages:

  • Factory functions are not needed, you can use new(), delete() or create objects on the stack.

  • The SoundWorld class can be subclassed.

  • The interface methods are not virtual, so calling them might be faster. (On the other hand, we need an extra memory fetch to get to the implementation object.)

  • PIMPL can be introduced in an existing class without changing its external interface or its relation to other classes.

For my use cases, none of these advantages matter that much. Since I want to supply my own allocators, I'm not interested in new and delete. And I only use this for "big" objects, that are always heap (rather than stack) allocated.

I don't make much use of implementation inheritance. In my opinion, it is almost always a bad design decision that leads to strongly coupled code and hard to follow code paths. Inheritance should be limited to interface inheritance.

The performance issue of virtual calls is not a huge issue, since I only use this for "big" objects (Systems and Managers). Also, I design the API so that the number of API calls is minimized. I.e., instead of a function:

void set_sound_position(InstanceId id, const Vector3 &pos);

I have:

void set_sound_positions(unsigned count, const InstanceId *ids, const Vector3 *positions);

This reduces the virtual call overhead, but also has additional benefits, such as being DMA friendly and allowing for parallelization and batch optimizations.

In the words of Mike Acton: Where there's one, there's more than one.

The abstract class method has some advantages of its own:

  • Cleaner code and a lot less typing, since we don't have to write forwarding stubs for the methods in the public interface.

  • Multiple classes can implement the same interface. We can statically or dynamically select which particular implementation we want to use, which gives us more flexibility.

To me, not having to write a ton of stupid boilerplate cruft is actually kind of a big deal. I know some people don't mind boilerplate. It's just a little extra typing, they say. Since there is nothing complicated or difficult in the boilerplate code, it doesn't pose a problem. Programmers are not limited by typing speed, so how much you have to type doesn't matter.

I don't agree at all. In my view, every line of code is a burden. It comes with a cost that you pay again and again as you write, read, debug, optimize, improve, extend and refactor your code. For me, the main benefit of "higher-level" languages is that they let me do more with less code. So I'm happy to pay the overhead of a virtual call if it saves me from having 150 lines of idiotic boilerplate.

A nice thing about the interface and implementation separation is that it gets rid of another piece of hateful C++ boilerplate: method declarations (hands up everybody who enjoys keeping their .h and .cpp files synchronized).

Methods defined inside a C++ class do not have to be declared and can be written in any order. So if we want to add helper methods to our implementation class, that are not part of the public interface, we can just write them anywhere in the class:

class SoundWorldImplementation : public SoundWorld
{
    virtual InstanceId play(SoundResource *resource) {
        InstanceId id = allocate_id();
        ...
    }

    // A private method - no declaration necessary.
    InstanceId allocate_id() {
        ...
    }
};

It's interesting that this small, purely syntactical change -- getting rid of method declarations -- makes a significant different in how the language "feels". At least to me.

With this approach, adding a helper method feels like "less work" and so I'm more inclined to do it. This favors better structured code that is decomposed into a larger number of functions. More like Smalltalk than traditional C (home of the mega-method). The Sapir-Worf hypothesis appears to hold some merit, at least in the realm of programming languages.

Another interesting thing to note is that the pure C implementation of opaque pointers stacks up pretty well against the C++ variants. It is simple, terse and fast (no virtual calls, no forwarding functions).

Every year I'm a little more impressed by C and a little more depressed by C++.

Sunday, March 4, 2012

Caring by Sharing: The Bitsquid Documentation System

In a previous article I talked a bit about our documentation system. It has now solidified into something interesting enough to be worth sharing.

The system consists of a collection of Ruby files that read input files (with extension .bsdoc) written in a simple markup language:

# Header

Some text.

* And
* A list

and converts them to HTML:

<h1>Header</h1>

<p>Some text.</p>

<ul>
 <li><p>And</p></li>
 <li><p>A list</p></li>
</ul>

We then use the HTML Help Compiler to convert the help files to .chm.

You can find the repository at:

Motivation

Why have we created our own markup system instead of just using an existing one? (Markdown, Textile, RDoc, POD, Restructured Text, Doxygen, BBDoc, Wikimedia, Docbook, etc.)

For two reasons. First, none of these existing systems work exactly the way that we want.

An example. A large part of our documentation consists of Lua interface documentation. To make that as easy to possible to write, we use a special tag @api to enter an API documentation mode. In that mode, each unindented line documents a new Lua function. The indented lines that follow contain the documentation for the function.

## Application (singleton)

Interface to access global application functionality. Note that since the
application is a singleton (there is only one application), you don’t need
to pass any %r Application object to the application functions. All the
functions operate on the application singleton.

@api

resolution() : width, height
 Returns the screen resolution.
 
argv() : arg1, arg2, arg3, ...
 Returns the command line arguments supplied to the application.

The documentation system recognizes the Lua function definitions and formats them appropriately. It also creates index entries for the functions in the .chm file. In addition, it can create cross-references between classes and functions (with the %r marker).

No out-of-the-box system can provide the same level of convenience.

In any documentation system, the documentation files are the most valuable resource. What really matters is that documentation is easy to write and easy to modify. In particular, my main concerns are:

  • Preserving semantic information.

  • Avoiding unnecessary markup and clutter.

By preserving semantic information I mean that we should be able to say, for example, that something is a Lua function definition, or a piece of sample C++ code, rather than just saying that something is italic or preformatted. If we have enough semantic information, we can do all kinds of things to the data in post-processing. We can parse the function definition using a Lua parser, or run the C++ code through a syntax highlighter. We can convert the files to some other format if we ever decide to switch documentation system.

If the documentation format doesn't preserve semantic data, there is no way of getting that data back, except by going through all the documentation and adding it manually. That's painful.

Avoiding markup and clutter is all about making the documents easy to write and easy to modify. That's the whole point of using a markup language (instead of plain HTML) in the first place.

Our custom markup language lets us achieve both these goals in a way that no off-the-shelf solution could.

The second reason for writing our own system is that there is no fundamentally hard problem that the existing systems solve. If they did something really advanced that would take us months to duplicate, then it might be better to use an existing system even if it wasn't perfectly adapted to our needs. But parsing some text and converting it to HTML isn't hard. The entire documentation system is just a few hundred lines of Ruby code.

(In contrast, Doxygen actually does solve a hard problem. Parsing general C++ code is tricky. That's why we use Doxygen to document our C++ code, but our own system for stand-alone documentation.)

The System Design

If I've done my job and convinced you that the best thing to do is to write your own documentation system, then what's the point of sharing my code with you?

Well, the system we use consists of two parts. One part (the bulk of it) is generic and can be used to implement any markup language. The rules that are specific to our markup language are all kept in a single file (bsdoc.rb). To write your own documentation system, you could re-use the generic parts and just write your own markup definition.

The generic part of the system consists of four files:

paragraph_parser.rb

Parses the paragraphs of a document into block-level HTML code.

span_parser.rb

Does span-level parsing inside a HTML block.

generator.rb

Generates the output HTML.

toc.rb

Adds section numbering and a table of contents to an HTML file.

Most of the code is pretty straight forward. A rule set is a collection of regular expressions. The expressions are tested in turn against the content and the first one that matches is applied. There are separate rules for parsing the document on the block level (the ParagraphParser) and inside each line (the SpanParser).

There are some ideas in the system that I think are interesting enough to mention though:

Line-by-line parsing

On the paragraph level, the document is parsed line-by-line. Each rule regex is tested in turn and the first one that matches is applied. This ensures that the process is speedy for all kinds of input (O(N) in the number of lines). It also makes the system simpler to reason about.

No intermediate representation

The system does not build any intermediate representation of the document. It is converted directly from the .bsdoc source format to HTML. This again simplifies the system, because we don't have to device an intermediate representation for all kinds of data that we want to handle.

HTML "contexts" for lines

When a rule is applied, it doesn't write raw HTML code to the output. Instead, it gives the generator a piece of text and a list of tags that should be applied to it. I call this the "context" of the text.

env.write(%w(ul li p), "Hi!")

The generator will add tags as appropriate to ensure that the line is printed in the right context:

<ul><li><p>Hi!</p></li></ul>

When several lines are printed, the generator only opens and closes the minimum number of tags that are necessary to give each line the right context. It does this by matching the list of contexts for neighboring lines:

This:

env.write(%w(ul li p), "First item!")
env.write(%w(ul li p), "First paragraph!")
env.write(%w(ul li), nil)
env.write(%w(ul li p), "First item, second paragraph!")
env.write(%w(ul), nil)
env.write(%w(ul li p), "Second item!")

ends up as:

<ul>
 <li>
  <p>
   First item!
   First paragraph!
  <p>
  <p>First item, second paragraph!</p>
 </li>
 <li><p>Second item!</p></li>
</ul>

Note the trick of writing nil to explicitly close a scope.

Since I really, really hate badly formatted HTML documents, I've made sure that the output from the generator looks (almost) as good as hand-written HTML.

Using contexts in this way gets rid of a lot of the complexities of HTML generation. When we write our rules we don't have to think about opening and closing tags, we just have to make sure that we use an appropriate context for each line.

Nested scopes

The final idea is to automatically handle nested markup by applying the rules recursively. Consider this input document:

* Caught in the jungle
 * By a bear 
 * By a lion
 * By something else
* Caught in the woods

I don't have any special parsing rules for dealing with nested lists. Instead, the first line of this document creates a scope with the context %w(ul li). That scope is applied to all indented lines that follow it. The system strips the indentation from the line, processes it using the normal rule set, and then prepends %w(ul li) to its context. When it reaches a line without indentation, it drops the scope. Scopes can be stacked for multiple levels of nesting.

This way we can deal with arbitrarily complex nested structures (a code sample in a list in a blockquote) without any special processing rules.

A Bonus for AltDevBlogADay Writers

As a bonus for my fellow AltDevBlogADay writers I've added a syntax module for writing AltDevBlogADay articles. It converts source documents to a format suitable for publishing on AltDevBlogADay. (This includes taking care of the tricky <pre> tags.)

There is also a package for Sublime Text 2 (my favorite text editor) that gives you syntax highlighting and a build command for converting a document to HTML and previewing it in a browser. I'm currently writing all my AltDevBlogADay articles in this way.

(This article has also been posted to The Bitsquid blog.)

Monday, February 20, 2012

Sensible Error Handling — Part 3

In my epic trilogy on sensible error handling I've arrived at the third and final category of errors -- warnings.

Warnings happen when the user does something that is kinda sorta bad, but not exactly wrong per se. It can be things like having two nodes with the same name in an entity's scene graph, a particle effect with 1 000 000 particles or a 4096 x 4096 texture mapped to a 3 x 3 cm flower.

Not necessarily wrong -- perhaps there will be a sequence where the player is miniaturized and has to walk on the surface of the flower, fighting off hostile pollen -- but definitely fishy.

The problem with warnings is that they are so easy to ignore. When a project has hundreds of warnings that scroll by every time you start it, no one will pay any particular attention to them and no one will notice a new one.

But then of course, if warnings are not easy to ignore, everyone on the project will have to spend a lot of their valuable time ignoring them.

So the real problem, as in so many cases, is that we don't really know what we want. We want warnings to be both hard to ignore and easy to ignore. We can't get good warnings in our tools without resolving this conflict in our minds.

Types of warnings

To progress we need more information. We need to think about what kind of warnings there are and how we want to handle them.

In the Bitsquid engine, our warnings can be classified into three basic types:

  • Performance warnings

  • Suspicion warnings

  • Deprecation warnings

Performance warnings occur when the user does something that is potentially bad for performance, such as using a texture without a MIP chain or loading a 300 MB sound file into memory.

Suspicion warnings occur when we detect other kinds of suspicious behavior and want to ask the user "did you really mean to do X?". An example might be defining a font without any glyphs. It is not exactly an error, but it is not very useful either, and most likely, not what the user wanted.

Deprecation warnings, finally, are warnings that really should be errors. We want all our data to follow a particular rule, but we have too much legacy data to be able to strictly enforce it.

A typical example might be naming conventions. We may want to force all nodes in a scene graph to have unique names or all mesh names to start with mesh_, but unless we laid down that rule at the start of the project it might be too much work to fix all the old data.

Another example of deprecation is when a script function is deprecated. We may want to get rid of the function AudioWorld.set_listener(pos) (because it assumes that there is only one listener in the world) and replace it with AudioWorld.set_listeners(table), but there is a lot of script code that already uses set_listener and no time to rewrite it.

As for when warnings should be shown, I think there are only two times when you really care about warnings:

  • When you are working with a particular object (unit, mesh, level, sound, etc), you want to see all the warnings pertaining to that object.

  • When you are doing a review of the game (e.g., a performance review), you want to look through all warnings pertaining to the aspect that you are reviewing (in this case, all performance warnings).

Armed with this information, we can come up with some useful strategies for dealing with warnings.

Treat warnings as errors

Errors are a lot easier to deal with than warnings, at least if you adhere to the philosophy of "asserting on errors" that was outlined in the first part of this series. An error is always an error, it doesn't require a judgement call to determine whether it is right or wrong. And since the engine doesn't proceed until the error has been fixed, errors get fixed as soon as possible (usually before the content is checked in, and in the rare occasions when some one checks in bad data without test running it -- as soon as someone else pulls). Once the error is fixed it will never bother us again.

In contrast, warnings linger and accumulate into an ever expanding morass of hopelessness.

So, one of the best strategies for dealing with warnings is to make them errors. If there is any way you can convert the warning into an error, do that. Instead of warning if two nodes in a scene graph have the same name, make it an error. Instead of warning when an object is set to be driven by both animation and physics, make it an error.

Of course, when we want to make an error of something that was previously just a warning, we run into the deprecation problem.

Ideas for deprecation warnings

The strategy for deprecation warnings is clear. We want to get rid of them and treat them as "real errors" instead. This gives us cleaner data, better long term maintainability and cleaner engine code (since we can get rid of legacy code paths for backward compatibility).

Here are some approaches for dealing with deprecation, in falling order of niceness:

1. Write a conversion script

Write a conversion script that converts all the old/deprecated data into the new format. (An argument for keeping your source data in a nice, readable, script-friendly format, such as JSON.)

This is by far the nicest solution, because it means you can just run the script on the content to patch it up, and then immediately turn the warning into an error. But it does require some programming effort. (And we programmers are so overworked, couldn't an artist/slave spend three weeks renaming the 12 000 objects by hand instead?)

Of course, sometimes this approach isn't possible. I.e., when there is no nice 1-1 mapping from the current (bad) state to the desired (good) state.

One thing I've noticed though, is that we programmers can have a tendency to get caught up in binary thinking. If a problem can't be solved for every possible edge case we might declare it "theoretically unsolvable" and move on to other things. When building stable systems with multiple levels of abstractions, that is a very sound instinct (a sort function that works 98 % of the time is worse than useless -- it's dangerous). But when it comes to improving artist workflows it can lead us astray.

For example, if our script manages to rename 98 % of our resources automatically and leaves 2 % tricky cases to be done by hand, that means we've reduced the workload on the artist from three weeks to 2.5 hours. Quite significant.

So even if you can't write a perfect conversion script, a pretty good one can still be very helpful.

2. Implement a script override

This is something I've found quite useful for dealing with deprecated script functions. The idea is that when we want to remove a function from the engine API, we replace it with a scripted implementation.

So when we replace AudioWorld.set_listener() with AudioWorld.set_listeners(), we implement AudioWorld.set_listener() as a pure Lua function, using the new engine API:

function AudioWorld.set_listener(pos)
 local t = {pos}
 AudioWorld.set_listeners(t)
end

This leaves it up to the gameplay programmers to decide if they want to replace all calls to set_listener() with set_listeners() or if they want to continue to use the script implementation of set_listener().

This technique can be used whenever the old, deprecated interface can be implemented in terms of the new one.

3. Use a doomsday clock

Sometimes you are out of luck and there simply is no other way of converting the data than to fix it by hand. You need the data to be fixed so that you can change the warnings to errors, but it is a fair amount of work and unless you put some pressure on the artists, it just never happens. That's when you bring out the doomsday clock.

The doomsday clock is a visible warning message that says something like:

Inconsistent naming. The unit 'larch_03' uses the same name 'branch' for two different scene graph nodes. This warning will become a hard on error on the 1st of May, 2012. Fix your errors before then.

This gives the team ample time to address the issue, but also sets a hard deadline for when it needs to be fixed.

For the doomsday clock to work you need a producer that is behind the idea and sees the value of turning warnings into errors. If you have that, it can be a good way of gradually cleaning up a project. If not, the warnings will never get fixed and instead you'll just be asked again and again to move the doomsday deadline forward.

4. Surrender

Sometimes you just have to surrender to practicality. There might be too much bad data and just not enough time to fix it. Which means you just can't turn that warning into an error.

But even if you can't do anything about the old data, you can at least prevent any new bad data from entering the project and polluting it further.

One way of doing that is to patch up your tools so that they add a new field to the source data (another argument for using an easily extensible source data format, such as JSON):

bad_name_is_error = true

In the data compiler, you check the bad_name_is_error flag. If it is set, a bad name generates a hard error, if not a warning. This means that for all new data (created with the latest version of the tool) you get the hard error check that you want, but the old data continues to work as before.

Design the tools to avoid warnings

Warnings are generated when the users do stuff they did not intend to. The warnings we see thus tell us something of the mistakes that users typically make, using our tools.

One way of reducing the amount of warnings is to use this information to guide the design of the tools. When we see a warning get triggered we should ask ourselves why the user wasn't able to express her intents and how we could improve our tools to make that easier.

For example, if there are a lot of warnings about particle system overdraw, perhaps our particle system editor could have on screen indicators that showed the amount of overdraw.

There are lot of other ways in which we can improve our tools so that they help users to do the right thing, instead of blaming them for doing wrong.

Put the warnings in the tools

The most useful time to get a warning is when you are working on an object. At that time, you know exactly what you want to achieve, and it is easy to make changes.

It follows then that the best place to show warnings is in the tools, rather than during game play. You may have that as well, to catch any strays that don't get vetted by the tools, but it should not be the first line of defense.

For every tool where it makes sense, there should be a visible warning icon displaying the number of warnings for the currently edited object. For added protection, you could also require the user to check off these warnings before saving/exporting the object to indicate: "yes I really want to do this".

Make a review tool for warnings

Apart from when a particular object is edited, the other time when displaying warnings is really useful is when doing a project review in order to improve performance or quality.

I haven't yet implemented it, but the way I see it, such a tool would analyze all the content in the project and organize the warnings by type. One category might be "Potentially expensive particle systems" -- it would list all particle systems with, say, > 2000 particles, ordered by size. Another category could be: "Possibly invisible units" -- a list of all the units placed below the ground in the levels.

The tool would allow a producer to "tick off" warnings for things that are really OK. Perhaps, the super duper effect really needs to have 50 000 particles. The producer can mark that as valid which means the warning is hidden in all future reviews.

Hiding could be implemented real simply. We could just hash the object name together with the warning message and make sure we don't show that particular message for that particular object again.

Sunday, February 5, 2012

Sensible Error Handling - Part 2

In my last post I wrote that there are three kinds of errors that we game programmers need to deal with:

  • Unexpected errors
  • Expected errors
  • Warnings

An unexpected error is an error that is unlikely happen and that the caller of our API has no sensible way of handling, such as a corrupted internal state, a failed memory allocation, a bad parameter supplied to a function or a file missing from a game disc. I also argued that the best way of dealing with such errors was to crash fast and hard with an assert, to expose the error and avoid "exporting" it in the API.

In this post I'm going to look at the expected errors.

Expected errors


An expected error is an error that we expect to happen and that the caller must have a plan for dealing with. A typical example is an error when fetching a web page or saving data to a memory card (which can be yanked at any moment).

If you are familiar with Java, the distinction between "expected" and "unexpected" errors matches quite closely Java's concept of "checked" and "unchecked" errors. Checked errors are errors that the caller must deal with (or explicitly rethrow). Unchecked errors are errors that the caller is not expected to deal with. They will typically cause a crash or a long jump out to the main loop, for the applications where that makes sense.

My main rule for dealing with expected errors is:

    Minimize the points and types of failures

In other words, just as our APIs abstract functionality -- replacing low-level calls with high-level concepts -- they should also abstract dysfunctionality and replace a large number of low-level failure states with a few high-level ones.

Minimizing the points of failure means that instead of having every function (enumerate(), open(), read(), close, etc) return an error code, we design the API so that errors occur in as few places as possible. This reduces the checks that the caller needs to do and the number of different possible paths through her code.

Minimizing the types of failure means that when we fail we only do it in one of a very small number of well-defined ways. We don't return an int error code that can take on 4 billion different values with vaguely defined, ambiguous and overlapping meanings (quick: what is the difference between EWOULDBLOCK and EAGAIN?).

In most cases true/false is enough (together with a log entry with more details). If the caller needs more information, we can use an enum for that specific function, with a very specific small range of values.

Again, the idea behind all this is to reduce the burden on the caller. If there is only a small number of errors that can happen, it is easy for her to verify that she has all the bases covered.

As an example, a (partial) save game interface may look like:

class SaveSystem
{
 struct Data {const char *p; unsigned len;};
 enum LoadResult {IN_PROGRESS, COMPLETED, FAILED};

 unsigned num_saved_games();
 LoadId start_loading_game(unsigned i);
 LoadResult load_result(LoadId id);
 Data loaded_data(LoadId id);
 void free_data(LoadId id);
};

Note that there is only a single place where the caller needs to check for errors (in the reply to load_result()). And there is only one possible fail state, either the load completes successfully or it fails.

To except or not to except


Exceptions are often touted as the latest and greatest in error handling, but as you know from my previous post I am not too found of them.

Exceptions can work for unexpected errors. I still prefer to use asserts, but if you are writing a program that cannot crash, an exception can be a reasonable way to get back to the main loop if you reach an unexpected failure state. (It's not the only option though. Lua's pcall() mechanism is an elegant and minimalistic alternative.)

But for the expected errors, the errors that are a part of the API, exceptions have a number of serious problems.

The first is that exceptions do not have to be declared in the API, so if you encounter an API that looks like this:

class SaveSystem
{
 struct Data {const char *p; unsigned len;};
 class LoadException : public Exception {};

 unsigned num_saved_games();
 LoadId start_loading_game(unsigned i);
 bool load_completed(LoadId id);
 Data loaded_data(LoadId id);
 void free_data(LoadId id);
};

you are immediately faced with a number of questions. Which functions in the API can throw a LoadException? All of them or just some? Do I need to check for it everywhere? Are there any other exceptions that can be thrown, like FileNotFoundException or IJustMadeUpThisException. Should I just catch everything everywhere to be safe?

In my view, this is unacceptable. The errors are an important part of the API. If you don't know what errors can occur and where, you have an incomplete picture of the API. Fine, we can address that with throw-declarations:

class SaveSystem
{
 struct Data {const char *p; unsigned len;};
 class LoadException : public Exception {};

 unsigned num_saved_games() throw();
 LoadId start_loading_game(unsigned i) throw();
 bool load_completed(LoadId id) throw(LoadException);
 Data loaded_data(LoadId id) throw();
 void free_data(LoadId id) throw();
};

Now the interface is at least well-defined, if a bit cluttered. Note that if you go down this route every single function in your code base should have a throw declaration. Otherwise you are back in no man's land, without any clue about which functions throw exceptions and which don't.

But declaring exceptions can have its drawbacks too. If you require all functions to declare exceptions, a function that just wants to "pass along" some exceptions up the call stack must declare them. This gives the exceptions an infectious tendency. Unless you are careful with your design the high level functions will gather longer and longer lists of exceptions that become harder and harder to maintain. Templates cause additional problems, because you can't know what exceptions a templated object might throw.

These issues have sparked a heated debate in the Java-community about whether checked (declared) exceptions are a good idea or not. C# has chosen not to support exception declarations.

At the heart of the debate is (I think) a confusion about what exceptions are for. Are they for diagnosing and recovering from unforeseen errors, or are they a convenient control structure for dealing with expected errors? By explicitly distinguishing "unexpected errors" from "expected errors" we make these two roles clearer and can thus avoid a lot of the confusion.

Anyways, the declarations are not my only gripe with exceptions. My second issue is that they introduce additional "hidden" code paths, which makes the code harder to read, understand and reason about.

Consider the following piece of code:

if (ss->load_completed(id)) {
 Data data = ss->loaded_data(id);
 ...
}

By just glancing at this code, it is pretty hard to tell that an error in load_completed() will cause it to leave the current function and jump to some other location higher up in the call stack.

When exceptions are used you can't just read the code straight up. You have to consider that at every single line you are looking at, an exception might be raised and the code flow changed.

This leads me to the concept of exception safety. Is your code "exception safe"? I'll go out on a limb and say: probably not. Writing "exception safe" code requires having a mindset where you view every single function in your code base as a "transaction" that can be fully or partially rolled back in the case of an exception. That is a lot of extra effort, especially if you need to do it in every single line in your code base.

It might still be worth it, of course, if exceptions had many other advantages. But as a method for dealing with expected errors, I just don't see those advantages, so I'd rather use my brain cycles for something else.

So what do I propose instead? Error codes!

Yes, yes I know, we all hate error codes, but why do we hate them? As I see it, there are three main problems with using error codes for error reporting:

  1. The code gets littered with error checks, making it hard to read.
  2. Undescriptive error codes lead to confusion about what errors a function can return and what they mean.
  3. Since C functions cannot return multiple values, we cannot both return an error code and a result. If we use error codes, the result must be returned in a parameter, which is inelegant.

I have already addressed the first two points. By designing our API so that errors only happen in a few places, we minimize the checks that are needed. And instead of returning an undescriptive generic error code, we should return a function-specific enum that exactly describes the errors that the function can generate:

enum LoadResult {IN_PROGRESS, COMPLETED, FILE_NOT_FOUND, FILE_COULD_NOT_BE_READ, FILE_CORRUPTED};
LoadResult load_result(LoadId id);

As for the third problem, I don't know why C programmers are so adverse to just putting two values in a struct and returning that. In my opinion, this:

struct Data {const char *p; unsigned len;};
Data loaded_data();

Is a lot nicer than this:

const char *loaded_data(unsigned &len);

Maybe in them olden days, returning 8 bytes on the stack was such a horrible inefficient operation that it caused your vacuum tubes to explode. But clearly, it is time to move on. If you want to return multiple value -- just do it! The "return in parameter" idiom should only be used for types where returning on the stack would cause memory allocation, such as strings or vectors.

This is how you return an error code in 2012:

struct SaveResult {
 enum {NO_ERROR, DISK_FULL, WRITE_ERROR} error;
 unsigned saved_bytes;
};
SaveResult save_result(SaveId id);

In the next and final part of this series I'll look at warnings.