Monday, February 20, 2012

Sensible Error Handling — Part 3

In my epic trilogy on sensible error handling I've arrived at the third and final category of errors -- warnings.

Warnings happen when the user does something that is kinda sorta bad, but not exactly wrong per se. It can be things like having two nodes with the same name in an entity's scene graph, a particle effect with 1 000 000 particles or a 4096 x 4096 texture mapped to a 3 x 3 cm flower.

Not necessarily wrong -- perhaps there will be a sequence where the player is miniaturized and has to walk on the surface of the flower, fighting off hostile pollen -- but definitely fishy.

The problem with warnings is that they are so easy to ignore. When a project has hundreds of warnings that scroll by every time you start it, no one will pay any particular attention to them and no one will notice a new one.

But then of course, if warnings are not easy to ignore, everyone on the project will have to spend a lot of their valuable time ignoring them.

So the real problem, as in so many cases, is that we don't really know what we want. We want warnings to be both hard to ignore and easy to ignore. We can't get good warnings in our tools without resolving this conflict in our minds.

Types of warnings

To progress we need more information. We need to think about what kind of warnings there are and how we want to handle them.

In the Bitsquid engine, our warnings can be classified into three basic types:

  • Performance warnings

  • Suspicion warnings

  • Deprecation warnings

Performance warnings occur when the user does something that is potentially bad for performance, such as using a texture without a MIP chain or loading a 300 MB sound file into memory.

Suspicion warnings occur when we detect other kinds of suspicious behavior and want to ask the user "did you really mean to do X?". An example might be defining a font without any glyphs. It is not exactly an error, but it is not very useful either, and most likely, not what the user wanted.

Deprecation warnings, finally, are warnings that really should be errors. We want all our data to follow a particular rule, but we have too much legacy data to be able to strictly enforce it.

A typical example might be naming conventions. We may want to force all nodes in a scene graph to have unique names or all mesh names to start with mesh_, but unless we laid down that rule at the start of the project it might be too much work to fix all the old data.

Another example of deprecation is when a script function is deprecated. We may want to get rid of the function AudioWorld.set_listener(pos) (because it assumes that there is only one listener in the world) and replace it with AudioWorld.set_listeners(table), but there is a lot of script code that already uses set_listener and no time to rewrite it.

As for when warnings should be shown, I think there are only two times when you really care about warnings:

  • When you are working with a particular object (unit, mesh, level, sound, etc), you want to see all the warnings pertaining to that object.

  • When you are doing a review of the game (e.g., a performance review), you want to look through all warnings pertaining to the aspect that you are reviewing (in this case, all performance warnings).

Armed with this information, we can come up with some useful strategies for dealing with warnings.

Treat warnings as errors

Errors are a lot easier to deal with than warnings, at least if you adhere to the philosophy of "asserting on errors" that was outlined in the first part of this series. An error is always an error, it doesn't require a judgement call to determine whether it is right or wrong. And since the engine doesn't proceed until the error has been fixed, errors get fixed as soon as possible (usually before the content is checked in, and in the rare occasions when some one checks in bad data without test running it -- as soon as someone else pulls). Once the error is fixed it will never bother us again.

In contrast, warnings linger and accumulate into an ever expanding morass of hopelessness.

So, one of the best strategies for dealing with warnings is to make them errors. If there is any way you can convert the warning into an error, do that. Instead of warning if two nodes in a scene graph have the same name, make it an error. Instead of warning when an object is set to be driven by both animation and physics, make it an error.

Of course, when we want to make an error of something that was previously just a warning, we run into the deprecation problem.

Ideas for deprecation warnings

The strategy for deprecation warnings is clear. We want to get rid of them and treat them as "real errors" instead. This gives us cleaner data, better long term maintainability and cleaner engine code (since we can get rid of legacy code paths for backward compatibility).

Here are some approaches for dealing with deprecation, in falling order of niceness:

1. Write a conversion script

Write a conversion script that converts all the old/deprecated data into the new format. (An argument for keeping your source data in a nice, readable, script-friendly format, such as JSON.)

This is by far the nicest solution, because it means you can just run the script on the content to patch it up, and then immediately turn the warning into an error. But it does require some programming effort. (And we programmers are so overworked, couldn't an artist/slave spend three weeks renaming the 12 000 objects by hand instead?)

Of course, sometimes this approach isn't possible. I.e., when there is no nice 1-1 mapping from the current (bad) state to the desired (good) state.

One thing I've noticed though, is that we programmers can have a tendency to get caught up in binary thinking. If a problem can't be solved for every possible edge case we might declare it "theoretically unsolvable" and move on to other things. When building stable systems with multiple levels of abstractions, that is a very sound instinct (a sort function that works 98 % of the time is worse than useless -- it's dangerous). But when it comes to improving artist workflows it can lead us astray.

For example, if our script manages to rename 98 % of our resources automatically and leaves 2 % tricky cases to be done by hand, that means we've reduced the workload on the artist from three weeks to 2.5 hours. Quite significant.

So even if you can't write a perfect conversion script, a pretty good one can still be very helpful.

2. Implement a script override

This is something I've found quite useful for dealing with deprecated script functions. The idea is that when we want to remove a function from the engine API, we replace it with a scripted implementation.

So when we replace AudioWorld.set_listener() with AudioWorld.set_listeners(), we implement AudioWorld.set_listener() as a pure Lua function, using the new engine API:

function AudioWorld.set_listener(pos)
 local t = {pos}
 AudioWorld.set_listeners(t)
end

This leaves it up to the gameplay programmers to decide if they want to replace all calls to set_listener() with set_listeners() or if they want to continue to use the script implementation of set_listener().

This technique can be used whenever the old, deprecated interface can be implemented in terms of the new one.

3. Use a doomsday clock

Sometimes you are out of luck and there simply is no other way of converting the data than to fix it by hand. You need the data to be fixed so that you can change the warnings to errors, but it is a fair amount of work and unless you put some pressure on the artists, it just never happens. That's when you bring out the doomsday clock.

The doomsday clock is a visible warning message that says something like:

Inconsistent naming. The unit 'larch_03' uses the same name 'branch' for two different scene graph nodes. This warning will become a hard on error on the 1st of May, 2012. Fix your errors before then.

This gives the team ample time to address the issue, but also sets a hard deadline for when it needs to be fixed.

For the doomsday clock to work you need a producer that is behind the idea and sees the value of turning warnings into errors. If you have that, it can be a good way of gradually cleaning up a project. If not, the warnings will never get fixed and instead you'll just be asked again and again to move the doomsday deadline forward.

4. Surrender

Sometimes you just have to surrender to practicality. There might be too much bad data and just not enough time to fix it. Which means you just can't turn that warning into an error.

But even if you can't do anything about the old data, you can at least prevent any new bad data from entering the project and polluting it further.

One way of doing that is to patch up your tools so that they add a new field to the source data (another argument for using an easily extensible source data format, such as JSON):

bad_name_is_error = true

In the data compiler, you check the bad_name_is_error flag. If it is set, a bad name generates a hard error, if not a warning. This means that for all new data (created with the latest version of the tool) you get the hard error check that you want, but the old data continues to work as before.

Design the tools to avoid warnings

Warnings are generated when the users do stuff they did not intend to. The warnings we see thus tell us something of the mistakes that users typically make, using our tools.

One way of reducing the amount of warnings is to use this information to guide the design of the tools. When we see a warning get triggered we should ask ourselves why the user wasn't able to express her intents and how we could improve our tools to make that easier.

For example, if there are a lot of warnings about particle system overdraw, perhaps our particle system editor could have on screen indicators that showed the amount of overdraw.

There are lot of other ways in which we can improve our tools so that they help users to do the right thing, instead of blaming them for doing wrong.

Put the warnings in the tools

The most useful time to get a warning is when you are working on an object. At that time, you know exactly what you want to achieve, and it is easy to make changes.

It follows then that the best place to show warnings is in the tools, rather than during game play. You may have that as well, to catch any strays that don't get vetted by the tools, but it should not be the first line of defense.

For every tool where it makes sense, there should be a visible warning icon displaying the number of warnings for the currently edited object. For added protection, you could also require the user to check off these warnings before saving/exporting the object to indicate: "yes I really want to do this".

Make a review tool for warnings

Apart from when a particular object is edited, the other time when displaying warnings is really useful is when doing a project review in order to improve performance or quality.

I haven't yet implemented it, but the way I see it, such a tool would analyze all the content in the project and organize the warnings by type. One category might be "Potentially expensive particle systems" -- it would list all particle systems with, say, > 2000 particles, ordered by size. Another category could be: "Possibly invisible units" -- a list of all the units placed below the ground in the levels.

The tool would allow a producer to "tick off" warnings for things that are really OK. Perhaps, the super duper effect really needs to have 50 000 particles. The producer can mark that as valid which means the warning is hidden in all future reviews.

Hiding could be implemented real simply. We could just hash the object name together with the warning message and make sure we don't show that particular message for that particular object again.

Sunday, February 5, 2012

Sensible Error Handling - Part 2

In my last post I wrote that there are three kinds of errors that we game programmers need to deal with:

  • Unexpected errors
  • Expected errors
  • Warnings

An unexpected error is an error that is unlikely happen and that the caller of our API has no sensible way of handling, such as a corrupted internal state, a failed memory allocation, a bad parameter supplied to a function or a file missing from a game disc. I also argued that the best way of dealing with such errors was to crash fast and hard with an assert, to expose the error and avoid "exporting" it in the API.

In this post I'm going to look at the expected errors.

Expected errors


An expected error is an error that we expect to happen and that the caller must have a plan for dealing with. A typical example is an error when fetching a web page or saving data to a memory card (which can be yanked at any moment).

If you are familiar with Java, the distinction between "expected" and "unexpected" errors matches quite closely Java's concept of "checked" and "unchecked" errors. Checked errors are errors that the caller must deal with (or explicitly rethrow). Unchecked errors are errors that the caller is not expected to deal with. They will typically cause a crash or a long jump out to the main loop, for the applications where that makes sense.

My main rule for dealing with expected errors is:

    Minimize the points and types of failures

In other words, just as our APIs abstract functionality -- replacing low-level calls with high-level concepts -- they should also abstract dysfunctionality and replace a large number of low-level failure states with a few high-level ones.

Minimizing the points of failure means that instead of having every function (enumerate(), open(), read(), close, etc) return an error code, we design the API so that errors occur in as few places as possible. This reduces the checks that the caller needs to do and the number of different possible paths through her code.

Minimizing the types of failure means that when we fail we only do it in one of a very small number of well-defined ways. We don't return an int error code that can take on 4 billion different values with vaguely defined, ambiguous and overlapping meanings (quick: what is the difference between EWOULDBLOCK and EAGAIN?).

In most cases true/false is enough (together with a log entry with more details). If the caller needs more information, we can use an enum for that specific function, with a very specific small range of values.

Again, the idea behind all this is to reduce the burden on the caller. If there is only a small number of errors that can happen, it is easy for her to verify that she has all the bases covered.

As an example, a (partial) save game interface may look like:

class SaveSystem
{
 struct Data {const char *p; unsigned len;};
 enum LoadResult {IN_PROGRESS, COMPLETED, FAILED};

 unsigned num_saved_games();
 LoadId start_loading_game(unsigned i);
 LoadResult load_result(LoadId id);
 Data loaded_data(LoadId id);
 void free_data(LoadId id);
};

Note that there is only a single place where the caller needs to check for errors (in the reply to load_result()). And there is only one possible fail state, either the load completes successfully or it fails.

To except or not to except


Exceptions are often touted as the latest and greatest in error handling, but as you know from my previous post I am not too found of them.

Exceptions can work for unexpected errors. I still prefer to use asserts, but if you are writing a program that cannot crash, an exception can be a reasonable way to get back to the main loop if you reach an unexpected failure state. (It's not the only option though. Lua's pcall() mechanism is an elegant and minimalistic alternative.)

But for the expected errors, the errors that are a part of the API, exceptions have a number of serious problems.

The first is that exceptions do not have to be declared in the API, so if you encounter an API that looks like this:

class SaveSystem
{
 struct Data {const char *p; unsigned len;};
 class LoadException : public Exception {};

 unsigned num_saved_games();
 LoadId start_loading_game(unsigned i);
 bool load_completed(LoadId id);
 Data loaded_data(LoadId id);
 void free_data(LoadId id);
};

you are immediately faced with a number of questions. Which functions in the API can throw a LoadException? All of them or just some? Do I need to check for it everywhere? Are there any other exceptions that can be thrown, like FileNotFoundException or IJustMadeUpThisException. Should I just catch everything everywhere to be safe?

In my view, this is unacceptable. The errors are an important part of the API. If you don't know what errors can occur and where, you have an incomplete picture of the API. Fine, we can address that with throw-declarations:

class SaveSystem
{
 struct Data {const char *p; unsigned len;};
 class LoadException : public Exception {};

 unsigned num_saved_games() throw();
 LoadId start_loading_game(unsigned i) throw();
 bool load_completed(LoadId id) throw(LoadException);
 Data loaded_data(LoadId id) throw();
 void free_data(LoadId id) throw();
};

Now the interface is at least well-defined, if a bit cluttered. Note that if you go down this route every single function in your code base should have a throw declaration. Otherwise you are back in no man's land, without any clue about which functions throw exceptions and which don't.

But declaring exceptions can have its drawbacks too. If you require all functions to declare exceptions, a function that just wants to "pass along" some exceptions up the call stack must declare them. This gives the exceptions an infectious tendency. Unless you are careful with your design the high level functions will gather longer and longer lists of exceptions that become harder and harder to maintain. Templates cause additional problems, because you can't know what exceptions a templated object might throw.

These issues have sparked a heated debate in the Java-community about whether checked (declared) exceptions are a good idea or not. C# has chosen not to support exception declarations.

At the heart of the debate is (I think) a confusion about what exceptions are for. Are they for diagnosing and recovering from unforeseen errors, or are they a convenient control structure for dealing with expected errors? By explicitly distinguishing "unexpected errors" from "expected errors" we make these two roles clearer and can thus avoid a lot of the confusion.

Anyways, the declarations are not my only gripe with exceptions. My second issue is that they introduce additional "hidden" code paths, which makes the code harder to read, understand and reason about.

Consider the following piece of code:

if (ss->load_completed(id)) {
 Data data = ss->loaded_data(id);
 ...
}

By just glancing at this code, it is pretty hard to tell that an error in load_completed() will cause it to leave the current function and jump to some other location higher up in the call stack.

When exceptions are used you can't just read the code straight up. You have to consider that at every single line you are looking at, an exception might be raised and the code flow changed.

This leads me to the concept of exception safety. Is your code "exception safe"? I'll go out on a limb and say: probably not. Writing "exception safe" code requires having a mindset where you view every single function in your code base as a "transaction" that can be fully or partially rolled back in the case of an exception. That is a lot of extra effort, especially if you need to do it in every single line in your code base.

It might still be worth it, of course, if exceptions had many other advantages. But as a method for dealing with expected errors, I just don't see those advantages, so I'd rather use my brain cycles for something else.

So what do I propose instead? Error codes!

Yes, yes I know, we all hate error codes, but why do we hate them? As I see it, there are three main problems with using error codes for error reporting:

  1. The code gets littered with error checks, making it hard to read.
  2. Undescriptive error codes lead to confusion about what errors a function can return and what they mean.
  3. Since C functions cannot return multiple values, we cannot both return an error code and a result. If we use error codes, the result must be returned in a parameter, which is inelegant.

I have already addressed the first two points. By designing our API so that errors only happen in a few places, we minimize the checks that are needed. And instead of returning an undescriptive generic error code, we should return a function-specific enum that exactly describes the errors that the function can generate:

enum LoadResult {IN_PROGRESS, COMPLETED, FILE_NOT_FOUND, FILE_COULD_NOT_BE_READ, FILE_CORRUPTED};
LoadResult load_result(LoadId id);

As for the third problem, I don't know why C programmers are so adverse to just putting two values in a struct and returning that. In my opinion, this:

struct Data {const char *p; unsigned len;};
Data loaded_data();

Is a lot nicer than this:

const char *loaded_data(unsigned &len);

Maybe in them olden days, returning 8 bytes on the stack was such a horrible inefficient operation that it caused your vacuum tubes to explode. But clearly, it is time to move on. If you want to return multiple value -- just do it! The "return in parameter" idiom should only be used for types where returning on the stack would cause memory allocation, such as strings or vectors.

This is how you return an error code in 2012:

struct SaveResult {
 enum {NO_ERROR, DISK_FULL, WRITE_ERROR} error;
 unsigned saved_bytes;
};
SaveResult save_result(SaveId id);

In the next and final part of this series I'll look at warnings.

Sunday, January 22, 2012

Sensible Error Handling: Part 1

To err is human. But it is also quite computery. Unfortunately, error handling tends to bring out the worst in APIs.

Error handling is what makes your program go from something nice, clear and readable such as:

Stuff s = open_something(x);
int len = get_size(s);
...

To some horrible monstrosity such as:

Stuff s;
int err = open_something(x, &s);
if (err == E_STUFFNOTFOUND) {
    fprintf(stderr, ”Something was not found”);
    goto exit;
} else if (err == E_INVAL) {
    fprintf(stderr, ”Something was invalid”);
    goto exit;
} else if (err == E_RETRY || err = E_COMPUTERNOTINTHEMOOD) 
    goto exit;
int len = 0;
err = get_size(s, &len);
if (err == E_HULLNOTPOLARIZED)
    goto close_and_exit;
...

In this article (and the follow-up) I’m going to discuss how I think you should design systems so that the error handling is as sensible as possible and the burden on the callers is minimized.

Note that I’m discussing this from the perspective of game development where errors will never cause serious damage to humans or property (I’m disregarding the keyboards smashed in frustration when a game crashes during the final minutes of a three hour boss fight).

Types of Errors


There are three main types of errors that we need to deal with:

  • Expected errors
  • Unexpected errors
  • Warnings

By an expected error I mean any kind of error that happens in a situation where the caller can reasonably expect that something might go wrong and has a plan for dealing with that. The most typical example is network code. Since the network may die at anytime, the caller cannot just call fetch_web_page() and assume that she will get a valid result. She must always check for and be prepared to handle errors.

An unexpected error is an error that happens when the caller has no reason to assume that something might go wrong. A typical example might be a NULL pointer returned by an allocator that is out of memory or a corrupted internal state caused by a buffer overflow problem.

What errors can be considered ”expected” depends on context. When opening a saved game or a user config file, File Not Found might be an expected error, because we can expect the user to muck around with those files. When opening our main .pak bundles, File Not Found is an unexpected error, because we don’t expect the user to partially delete an installed game. And besides, there is not much we can do beyond displaying an error message if our data isn’t there.

A warning happens when someone has done something that is kind-of sort-of bad, probably, but we are able to continue running without any ill effects. An example might be a call to a deprecated function.

Unexpected Errors


The unexpected errors are the most common ones. Expected errors only happen in a few well-defined places, such as network code. Unexpected errors can happen everywhere. It is always safe to assume that you program contains lots of bugs that you have no idea about.

My policy for handling unexpected errors is simple:

Crash the engine as soon as possible with an informative error message.

This may seem like a totally irresponsible thing to do. Crashing is... bad, right?

Actually it is exactly the opposite.

If we didn’t crash it would be up to the caller to handle the error. So the programmer writing that code wouldn’t only have to think about what she wanted to achieve with our API, but also in what ways our code might fail and how she would have to handle that. That is more work and leads to cluttered code, as in the example above. It is also nearly impossible to do in a good way. Remember, these are unexpected errors. Anything might happen.

By crashing, the API is taking full responsibility for performing what the caller asks of it. We are saying: either we will do what you wanted or, if there is a problem with that, we will deal with that too. In either case, you don’t have to worry about it.

Crashing makes APIs simpler and reduces the mental burden of the caller. Here is what a file API might look like if designed with the ”crash”-philosophy in mind.

bool exists(const char *path);
Archive open(const char *path);

Note the curious absence of any error codes. If the caller passes a malformed path, we crash, we do not return an E_INVALIDARGUMENT error. If the file doesn’t exist, we crash. The caller is responsible for using exist() to check for files that might not exist. There are no errors for the caller to handle and the code will be clean and readable.

Since life is so much simpler for the caller when she doesn’t have to think about errors, we write our code with that in mind. Instead of functions returning error codes, such as:

/// Returns E_PARSE_ERROR on badly formatted Json, E_NULL if
/// passed a null pointer, E_OVERFLOW if too big, etc.
int parse_json_number(const char *s, double &number);

we have functions that crash on errors:

double parse_json_number(const char *s);

In most cases this is all we need, because we expect the Json to be well formed. If it isn't, some other part of our tech has made an error that needs to be fixed. If we had any situations where we could expect bad Json (perhaps hand-entered through the in-game console), we would add a validating function:

bool is_valid_json_number(const char *s);

Now we can have some code that deals with bad data without forcing error handling into all our code.

But do we really need to crash?


At this point, some people will probably agree with most of the things I say, but still feel uneasy about crashing. Because crashing is... bad, right? Nobody wants to be the programmer that crashed the engine. Surely it is better to write a really serious, really super-stern error message that can’t be ignored but then try to patch things up and solider on so that we don’t crash. If a file doesn’t exist perhaps we can pretend that it did exist but was empty. If the Json we tried to parse was malformed, perhaps we can just return the part of it that we managed to parse. If the caller wants to access data beyond the end of an array, perhaps we can just return the last element.

No thanks.

I have two problems with this.

First, this makes programmers expend a lot of mental energy thinking about how to patch up an erroneous state. Most likely, this work is completely futile. They won’t be able to think about all the errors that might possibly occur. The attempts of patching things up will probably just cause a cascade of other errors and a more serious (and confusing) crash later on. And the ”error fixing” code will be strange and ugly. More code is always a burden, a cost. Let’s not spend it on magically patching up errors in ways that won’t work. Let’s focus on fixing the errors instead.

Second, I don’t care how stern your error message is, I promise you it will be ignored. If it happens infrequently, if it is just on one machine, if it is in a new system, if we just need to send these screen shots off to day, if a deadline is coming up, if we’re past the deadline, if there’s another deadline. It will be ignored. Your code will gather more and more errors that don’t get fixed, until it is a glitchy, horrible mess.

That’s why I love crashing. It is an error that can’t be ignored. Of course it is unacceptable for an engine to crash. And that’s why the error will be fixed. Which will make everybody happier in the long run. Crashes improve the production process and lead to better quality code.

Nobody wants the game to crash for the end user, but the way to achieve that is with testing and bug fixing, not by finding ways of ignoring the errors that you detect.

Exceptions


Rather than crashing isn’t it better to throw an exception? If the exception isn’t caught we get a crash, just as before. But we also have the option, if we really want to, to catch the exception and handle the error. It would seem that by using exceptions we can have our cake and eat it too.

Low-level programmers tend to abhor exceptions because they come with some performance overheads, even when they aren’t thrown. I’m not actually sure what the current status is, whether this is something that you still have to worry about or if exceptions are ”fast enough” on all current compilers and platforms.

I haven’t needed to care about that, because I dislike exceptions for the complexity they add. The crash model is dead simple, the code either works or not. The caller knows that she is not responsible for any error handling.

With exceptions, this clear and useful distinction between expected and unexpected errors is muddled and the caller is faced with a number of questions:

This function throws exceptions. Do I need to handle those? What kind of exceptions might it throw? Even if I don’t catch the exception, might someone higher up in the call hierarchy do it? Does this mean that I need to write all my code so that the state is valid if an exception is thrown somewhere (might be anywhere, really) by one of the functions I call? What if I’m in a constructor? What if I’m in a destructor.

By using exceptions instead of just crashing we are creating a more complicated API (the API now includes all the different exceptions that the different functions might call) and significantly increasing the mental burden on the caller for very little gain.

Good error reports


When we crash, we try to create an error message and a log report that is as informative as possible to facilitate debugging of the problem. Our reports always include:

  • A description of the error
  • The call stack
  • The error context

We use printf-formatting to create an the error message. Note that the C preprocessor supports variadic macros, so you can create macros that work like printf:

#if defined(DEVELOPMENT)
    #define XASSERT(test, msg, ...) do {if (!(test)) error(__LINE__, __FILE__, \
        "Assertion failed: %s\n\n" msg, #test,  __VA_ARGS__);} while (0)
#else
    #define XASSERT(test, msg, ...) ((void)0)
#endif

XASSERT(exists(file), ”File %s does not exist”, %s);

Call stack generation and translation from raw addresses to file names and line numbers is platform specific and a lot more cumbersome than it ought to be. But it is still well worth doing. Call stacks let you diagnose many errors with a glance. It is a lot faster than loading up crash dumps in the debugger.

On Windows, use StalkWalk64 to generate the call stack and the Sym* functions to translate it.

The error context is our way of providing contexts for error messages. The problem is that sometimes crashes happen in deeply nested code that doesn’t have all the information we would like to give to the user. For example:

double parse_json_number(const char *s);

If there is a parse error, it would be very helpful for the user to know in which file the error occurred. But the parse_json_number function doesn’t know that. It doesn’t even know if there is a file. It might have been asked to parse data from network or memory.

If we were using exceptions we could handle this by catching the exception at a higher level, adding some information to it (such as the file name) and rethrowing it. But that is rather tedious and also tricky to do in a good way. If we want to add the information to the original exception, then it must already have members for all the possible information that all higher level functions might want to add. That’s a bit strange. Should we throw a new exception? Then the exception gets thrown from the ”wrong place”. The result of all this is that people seldom bother ”decorating” their exceptions in this way. At least I’ve never seen a code base that does it systematically.

What we do instead, is to allow the programmer to define error contexts using scope variables:

void init(const char *file)
{
    ErrorContext ec("Parsing JSON:", file);
    JsonDoc *doc = parse_json(file);
}

The error contexts get stored on a stack:

__THREAD Array<const char *> *_error_context_name;
__THREAD Array<const char *> *_error_context_data;

class ErrorContext
{
public:
    ErrorContext(const char *name, const char *data) {
        _error_context_name->push_back(name);
        _error_context_data->push_back(data);
    }
    ~ErrorContext() {
        _error_context_name->pop_back();
        _error_context_data->pop_back();
    }
};

Note that we only store string pointers, not the full string data. We assume that whatever string the user gives us lives in the same scope as the error context and is valid as long as the error context is. This means that setting the error context just requires pushing 8 bytes to a stack, so the performance overhead is very small.

Note also that the stack uses thread local storage, so we have separate error context stacks for our different execution threads.

When an error occurs, we print all the contexts in the stack, giving the user a good idea of where the error occurred:

When spawning level: big_world
When spawning unit: big_bird
When applying material: feathers
Assertion failed: texture != NULL
    Texture not loaded: yellow_feathers
    In material_manager.cpp:1337

Next time


Next time, I’ll look at the other kinds of errors: expected errors and warnings.

Friday, January 6, 2012

5 Tips for Programmer Productivity

1. Embrace the now-principle


If something takes less than five minutes to do, do it immediately.

It seems like the lazy option, but postponing something actually takes a lot of effort. The task needs to be written down somewhere. Then you need to track it and prioritize it with respect to other tasks. You will probably think about doing it lots of times, before you actually get down to doing it. And then you have to understand what you meant when you wrote it down and try to get back in that same mindset.

For small tasks it is just not worth it.

Instead, just do it. Fix the issue right now while you are already thinking about it. It is faster, simpler and saves you the agony of an ever growing todo-list.

2. Fix the cause, not just the symptom


Don’t just fix problems. Fix the processes that allowed the problems to occur, so that the same problems never occur again. See bugs not as nuisances, but as chances to improve your processes and increase the quality of your code.

If an artist tells you that she gets an ”Error when compiling unit” error, don’t just diagnose it and tell her: ”It’s because you have two nodes with the same name, that is not allowed.” At the very least fix the error message so that it says ”Error when compiling unit ‘bed’. Two nodes have the same name ‘pillow’. One of them must be renamed so that names are unique.” Even better, fix the exporter or the tool, so that it is impossible for the artist to create two nodes with the same name.

If you find an error that could have been caught by an assert, then add that assert so that it will find the error next time.

If someone asks you ”How can I configure the animation compression?”, don’t just answer them. Also write a short text that explains how it is done, and add that text to the documentation.

In this way, you are not just patching holes and fixing leaks, you are actively making things better. This not only pleases the people who come to you with problems, it also makes your work feel more meaningful.

3. Try not to break concentration while ”the computer is working”


The job of programming is filled with a lot of weird little micro pauses. The code is compiling. The console is rebooting. The level is loading. The client is connecting. Etc.

In the best of worlds, these pauses would not exist, and by all means, do all you can to get rid of them. Make your code build faster. Hot reload data and scripts. Make a tool for quickly setting up a bunch of PS3s for a network test. Etc.

But even with the best of efforts, some pauses will likely remain. The question is what to do with them. The temptation is to take a short break from programming and do something else: check mail, answer a Skype, read two paragraphs of an interesting article, update the twitter feed, etc.

For me, these constant mental context switches can put a real damper on productivity, since they make it impossible to maintain concentration and flow.

Nor are these micro-excursions particularly relaxing. Reading two paragraphs of a web page while constantly glancing at a progress bar on the other monitor is not something that soothes my mind. Quite contrary, it is much more stressful than remaining in the zen-like state of concentrated work. It is much better to take one real break than a hundred micro breaks.

So for both productivity and peace of mind I now make a conscious effort to stay focused on the problem at hand while ”the computer is working”. I have a separate text editor, unaffected by IDE freezes, where I can work on related tasks, such as:

  • Adding documentation
  • Refactoring and code review
  • Planning the next stage of implementation
  • Writing script code that tests the system

It is still an ongoing battle against the Lure of the Internet, but I find that when I manage to stay focused I am both more productive and more relaxed.

4. Use source control even more than you think you should


Source control is not just for source code. With modern distributed source control systems such as Mercurial and Git it is dead simple to create a source repository anywhere and then later (if needed) push it to a server for backup/sharing.

Do you have configuration and settings files for your text editor, IDE, etc? Put them in source control so that you can easily share them between your different machines. Do they need to be installed in special locations. Put them in source control together with a script that installs them in the right place.

Do you use any third party libraries such as zlib, LuaJIT or stb_vorbis? Check them into source control. That way, if you have to do any modifications (bug fixes, fixes for compiler warnings, platform fixes, your own personal optimizations, etc) you will know exactly what you have changed. If a new version of the library is released you can use the source control diff to see what has changed upstream and merge it with your own local changes.

Does an API come with sample code? Before you start playing with that sample code, check it into source control. That way, you can always revert the samples back to their pristine state, without having to reinstall the API. And if you find a bug in the APIs and manage to reproduce it in one of the samples, you can use the source control tool to produce a .patch file for the sample that you can send to the API manufacturers as part of your bug report. That will keep both you and them happy.

5. Monitor your builds


Set up a build server that continuously builds all your executables (engine, tools, exporters, ...) in all configurations (debug, development, release) for all platforms, so you know as soon as possible if something breaks. Fixing a problem right away is much easier than doing it two months down the line.

The build server doesn’t have to be a complicated thing. It is more important that it exists than that it has all the bells and whistles. If you don’t have time to do something advanced just write a script that compiles everything and reports the result. You can expand on that later.

Do the same for content. Write a script that loads all levels and spawns all units.

Use the report system that works best for you. We use Skype for internal real-time communication, so it makes sense to report broken builds over Skype. If e-mail or IRC works better for you, use that instead.

Wednesday, December 28, 2011

Code Share: Patch link.exe to ignore LNK4099

By default, Visual Studio's link.exe does not let you ignore the linker warning LNK4099 (PDB file was not found).

This can be a real nuisance when you have to link with third party libraries that reference (but do not come with) PDBs. You can get hundreds of linker warnings that you have no way of getting rid of.

The only way I've found of fixing the problem is to patch link.exe so that it allows warning 4099 to be ignored. Luckily, it is not as scary as it sounds. You only need to patch a single location to remove 4099 from a list of warnings that cannot be ignored. An outline of the procedure can be found here.

Following my general philosophy to write-a-script-for-it I wrote a short ruby script that does the patching. I'm sharing it here for everybody that want to do voodoo on their link.exe and get rid of the warning.

(Click here for pastebin version.)

# This ruby program will patch the linker executable (link.exe)
# so that linker warning LNK4099 is ignorable.
#
# Reference: http://www.bottledlight.com/docs/lnk4099.html

require "fileutils"

def link_exes()
 res = []
 res << File.join(ENV["VS90COMNTOOLS"], "../../VC/bin/link.exe") if ENV["VS90COMNTOOLS"]
 res << File.join(ENV["VS100COMNTOOLS"], "../../VC/bin/link.exe") if ENV["VS100COMNTOOLS"]
 res << File.join(ENV["XEDK"], "bin/win32/link.exe") if ENV["XEDK"]
 return res
end

def patch_link_exe(exe)
 data = nil
 File.open(exe, "rb") {|f| data = f.read}
 unpatched = [4088, 4099, 4105].pack("III")
 patched = [4088, 65535, 4105].pack("III")

 if data.scan(patched).size > 0
  puts "* Already patched #{exe}"
  return
 end

 num_unpatched = data.scan(unpatched).size
 raise "Multiple patch locations in #{exe}" if num_unpatched > 1
 raise "Patch location not found in #{exe}" if num_unpatched == 0

 offset = data.index(unpatched)
 puts "* Found patch location #{exe}:#{offset}"
 bak = exe + "-" + Time.now.strftime("%y%m%d-%H%M%S") + ".bak"
 puts "  Creating backup #{bak}"
 FileUtils.cp(exe, bak)
 puts "  Patching exe"
 data[offset,unpatched.size] = patched
 File.open(exe, "wb") {|f| f.write(data)}
 return true
end

link_exes.each do |exe|
 patch_link_exe(exe)
end

Thursday, December 22, 2011

Platform Specific Resources

I recently added a new feature to the BitSquid tool chain – support for source and destination platforms in the data compiler. What it means is that you can take the data for one platform (the source) and compile it to run on a different platform (the destination). So you can take the data for the mobile version of a game (with all its content optimizations) and compile it so that it runs on your development PC.

This is nice for two reasons. First, access to target hardware can be limited. In a perfect world, every artist would have a dev kit for every target platform. In practice, this might not be economically possible. It might not even be electrically possible (those main fuses can only take so much). Being able to preview and play console/handheld content on PC is better than nothing, in this less-than-perfect world.

Second, since all our editors use the engine for visualization, if we have specified a handheld device as our source platform, all the editors will automatically show the resources as they will appear on that device.

This new feature gives me a chance to talk a little bit about how we have implemented support for platform specific resources, something I haven’t touched on before in this blog.

The BitSquid Tech uses the regular file system for its source data. A resource is identified by its name and type, both of which are determined from the path to the source file:


Note that even though the name is a path, it is not treated as one, but as a unique identifier. It is hashed to a 64-bit integer by the engine and to refer to a resource you must always specify its full name (and get the same hash result). In the compiled data, the raw names don’t even exist anymore, the files are stored in flat directories indexed by the hash values.

In addition to name and type a resource can also have a number of properties. Properties are dot-separated strings that appear before the type in the file name:


Properties are used to indicate different variants of the same resource. So all these files represent variants of the same resource:

buttons.texture
buttons.ps3.texture
buttons.en.x360.texture
buttons.fr.x360.texture

The two most important forms of properties are platforms and languages.

Platform properties (x360, ps3, android, win32, etc) are used to provide platform specific versions of resources. This can be used for platform optimized versions of units and levels. Another use is for controller and button images that differ from platform to platform. Since BitSquid is scripted in Lua and Lua files are just a resource like any other, this can also be used for platform specific gameplay code:

PlayerController.android.lua

Language properties (en, fr, jp, it, sv, etc) are used for localization. Since all resources have properties, all resources can be localized.

But the property system is not limited to platforms and languages. A developer can make up whatever properties she needs and use them to provide different variants of resources:

bullet_hit.noblood.particle_effect
foilage.withkittens.texture

Properties can be resolved either at data compile time or at runtime.

Platform properties are resolved at compile time. When we compile for PS3 and a resource has ps3 specific variants, only those variants are included in the compiled data. (If the resource doesn’t have any ps3 variants, we include all variants that do not have a specified platform.)

Language properties and other custom properties are resolved at runtime. All variants are compiled to the runtime data. When running, the game can specify what resource variants it wants with a property preference order. The property preference order specifies the variants it wants to use, in order of preference.

Application.set_property_preference_order {”withkittens”, ”noblood”, ”fr”}

This means that the game would prefer to get a resource that has lots of kittens, no blood and is in French. But if it can’t get all that, it will rather have something that is kitten-full than blood-free. And it prefers a bloodless English resource to a bloody French one.

In other words, if we requested the resource buttons.texture with these settings, the engine would look for variants in the order:

buttons.withkittens.noblood.fr.texture
buttons.withkittens.noblood.texture
buttons.withkittens.fr.texture
buttons.withkittens.texture
buttons.noblood.fr.texture
buttons.noblood.texture
buttons.fr.texture
buttons.texture

To add support for different source and destination platforms to this system all I had to do was to add a feature that lets the data compiler use one platform for resolving properties and a different platform as the format for the runtime files it produces.

Thursday, December 8, 2011

A Pragmatic Approach to Performance

Is premature optimization the root of all evil? Or is the fix-it-later attitude to performance turning programmers from proud ”computer scientists” to despicable ”script kiddies”?

These are questions without definite answers, but in this article I’ll try to describe my own approach to performance. How I go about to ensure that my systems run decently, without compromising other goals, such as modularity, maintainability and flexibility.

§1 Programmer time is a finite resource

If you are writing a big program, some parts of the code will not be as fast as theoretically possible. Sorry, let me rephrase. If you are writing a big program, no part of the code will be as fast as theoretically possible. Yes, I think it is reasonable to assume that every single line of your code could be made to run a little tiny bit faster.

Writing fast software is not about maximum performance all the time. It is about good performance where it matters. If you spend three weeks optimizing a small piece of code that only gets called once a frame, then that’s three weeks of work you could have spent doing something more meaningful. If you had spent it on optimizing code that actually mattered, you could even have made a significant improvement to the game’s frame rate.

There is never enough time to add all the features, fix all the bugs and optimize all the code, so the goal should always be maximum performance for minimum effort.

§2 Don’t underestimate the power of simplicity

Simple solutions are easier to implement than complex solution. But that’s only the tip of the iceberg. The real benefits of simple solutions come in the long run. Simple solutions are easier to understand, easier to debug, easier to maintain, easier to port, easier to profile, easier to optimize, easier to parallelize and easier to replace. Over time, all these savings add up.

Using a simple solution can save so much time that even if it is slower than a more complex solution, as a whole your program will run faster, because you can use the time you saved to optimize other parts of the code. The parts that really matter.

I only use complex solutions when it is really justified. I.e. when the complex solution is significantly faster than the simple one (a factor 2 or so) and when it is in a system that matters (that consumes a significant percentage of the frame time).

Of course simplicity is in the eyes of the beholder. I think arrays are simple. I think POD data types are simple. I think blobs are simple. I don’t think class structures with 12 levels of inheritance are simple. I don’t think classes templated on 8 policy class parameters are simple. I don’t think geometric algebra is simple.

§3 Take advantage of the system design opportunity

Some people seem to think that to avoid ”premature optimization” you should design your systems without any regard to performance whatsoever. You should just slap something together and fix it later when you ”optimize” the code.

I wholeheartedly disagree. Not because I love performance for its own sake, but for purely pragmatic reasons.

When you design a system you have a clear picture in your head of how the different pieces fit together, what the requirements are and how often different functions get called. At that point, it is not much extra effort to take a few moments to think about how the system will perform and how you can setup the data structures so that it runs at fast as possible.

In contrast, if you build your system without considering performance and have to come in and ”fix it” at some later point, that will be much harder. If you have to rearrange the fundamental data structures or add multithreading support, you may have to rewrite the entire system almost from scratch. Only now the system is in production, so you may be restricted by the published API and dependencies to other systems. Also, you cannot break any of the projects that are using the system. And since it was several months since you (or someone else) wrote the code, you have to start by understanding all the thoughts that went into it. And all the little bug fixes and feature tweaks that have been added over time will most likely be lost in the rewrite. You will start again with a fresh batch of bugs.

So by just following our general guideline ”maximum efficiency with minimum effort”, we see that it is better to consider performance up front. Simply since that requires a lot less effort than fixing it later.

Within reason of course. The performance improvements we do up front are easier, but we are less sure that they matter in the big picture. Later, profile-guided fixes require more effort, but we know better where to focus our attention. As in whole life, balance is important.

When I design a system, I do a rough estimate of how many times each piece of code will be executed per frame and use that to guide the design:

  • 1-10 Performance doesn’t matter. Do whatever you want.
  • 100 Make sure it is O(n), data-oriented and cache friendly
  • 1000 Make sure it is multithreaded
  • 10000 Think really hard about what you are doing

I also have a few general guidelines that I try to follow when writing new systems:

  • Put static data in immutable, single-allocation memory blobs
  • Allocate dynamic data in big contiguous chunks
  • Use as little memory as possible
  • Prefer arrays to complex data structures
  • Access memory linearly (in a cache friendly way)
  • Make sure procedures run in O(n) time
  • Avoid ”do nothing” updates -- instead, keep track of active objects
  • If the system handles many objects, support data parallelism

By now I have written so many systems in this ”style” that it doesn’t require much effort to follow these guidelines. And I know that by doing so I get a decent baseline performance. The guidelines focus on the most important low-hanging fruit: algorithmic complexity, memory access and parallelization and thus give good performance for a relatively small effort.

Of course it is not always possible to follow all guidelines. For example, some algorithms really require more than O(n) time. But I know that when I go outside the guidelines I need to stop and think things through, to make sure I don’t trash the performance.

§4 Use top-down profiling to find bottlenecks

No matter how good your up front design is, your code will be spending time in unexpected places. The content people will use your system in crazy ways and expose bottlenecks that you’ve never thought about. There will be bugs in your code. Some of these bugs will not result in outright crashes, just bad performance. There will be things you haven’t really thought through.

To understand where your program is actually spending its time, a top down profiler is an invaluable tool. We use explicit profiler scopes in our code and pipe the data live over the network to an external tool that can visualize it in various ways:


An (old) screenshot of the BitSquid Profiler.


The top-down profiler tells you where your optimization efforts need to be focused. Do you spend 60 % of the frame time in the animation system and 0.5 % in the Gui. Then any optimizations you can make to the animations will really pay off, but what you do with the Gui won’t matter one iota.

With a top-down profiler you can insert narrower and narrower profiler scopes in the code to get to the root of a performance problem -- where the time is actually being spent.

I use the general design guidelines to get a good baseline performance for all systems and then drill down with the top-down profiler to find those systems that need a little bit of extra optimization attention.

§5 Use bottom-up profiling to find low-level optimization targets

I find that as a general tool, interactive top-down profiling with explicit scopes is more useful than a bottom-up sampling profiler.

But sampling profilers still have their uses. They are good at finding hotspot functions that are called from many different places and thus don’t necessary show up in a top-down profiler. Such hotspots can be a target for low-level, instruction-by-instruction optimizations. Or they can be an indication that you are doing something bad.

For example if strcmp() is showing up as a hotspot, then your program is being very very naughty and should be sent straight to bed without any cocoa.

A hotspot that often shows up in our code is lua_Vexecute(). This is not surprising. That is the main Lua VM function, a big switch statement that executes most of Lua’s opcodes. But it does tell us that some low level, platform specific optimizations of that function might actually result in real measurable performance benefits.

§6 Beware of synthetic benchmarks

I don’t do much synthetic benchmarking, i.e., looping the code 10 000 times over some made-up piece of data and measuring the execution time.

If I’m at a point where I don’t know whether a change will make the code faster or not, then I want to verify that with data from an actual game. Otherwise, how can I be sure that I’m not just optimizing the benchmark in ways that won’t carry over to real world cases.

A benchmark with 500 instances of the same entity, all playing the same animation is quite different from the same scene with 50 different unit types, all playing different animations. The data access patterns are completely different. Optimizations that improve one case may not matter at all in the other.

§7 Optimization is gardening

Programmers optimize the engine. Artists put in more stuff. It has always been thus. And it is good.

Optimization is not an isolated activity that happens at a specific time. It is a part of the whole life cycle: design, maintenance and evolution. Optimization is an ongoing dialog between artists and programmers about what the capabilities of the engine should be.

Managing performance is like tending a garden, checking that everything is ok, rooting out the weeds and finding ways for the plants to grow better.

It is the job of the artists to push the engine to its knees. And it is the job of the programmers’ job to bring it back up again, only stronger. In the process, a middle ground will be found where the games can shine as bright as possible.