Wednesday, September 25, 2013

Scripted Network Debugging

Debugging network problems is horrible. Everything is asynchronous. Messages can get lost, scrambled or delivered out-of-order. The system is full of third-party black boxes: external transport layers (PSN, Steam, etc), routers, firewalls and ineptly written packet intercepting anti-virus programs. (I've got my eye on you!)

Reproducing a problem requires setting up multiple machines that are all kept in sync with any changes you make to the code and the data in order to try to fix the problem. It might also require roping in multiple players to actually sit down and play the game on all those machines. This can make a simple bug turn into a multi-day or even a multi-week problem.

Here are some quick tips for making network debugging easier:

  • Have a single place for disabling timeouts. Few things are as frustrating as looking at a problem in the debugger, almost finding the solution and then having the entire game shutdown because the server flagged your machine as unresponsive while you where broken in the debugger. Having a single place where you can disable all such timeouts makes the debugger a lot more useful for solving network problems.

  • Attach Visual Studio to multiple processes. Not everybody is aware of this, but you can actually attach the Visual Studio debugger to multiple processes simultaneously. So you can start a network session with eight players and then attach your debugger to all of them. This can be used to follow messages and code flow between different network nodes.

  • Make sure you can start multiple nodes on the same machine (using different ports). This allows you to debug many network issues locally, without running between different machines in the office or gather a stack of laptops on your desk. Of course this doesn't work if you are targeting consoles or Steam, since you can't use multiple Steam accounts simultaneously on the same machine. (Please fix, k thx bye!)

  • Have a way to log network traffic. We have a switch that allows us to log all network traffic (both incoming and outgoing) to a file. That file can be parsed by a custom GUI program that understands our network protocol. This allows us to see all the messages to and from each node, when they were sent and when they were received. This allows us to diagnose many network issues post-fact. We also have a replay functionality, where we can replay such a saved log to the network layer and get it to behave exactly as it did in the recording session.

But today I'm going to focus on a different part of network debugging: scripted tests.

The idea is that instead of running around manually to a lot of different machines, copying executables and data, booting the game, jumping into menus, etc, etc, we write a little Ruby script that does all that for us:

  • Distribute executables

  • Distribute project data

  • Start the game

  • Set up a multi-player session

  • Play the game

I recently had to debug a network issue with a low reproduction rate. With the script I was able to set up and run 500 sample matches in just a few hours and reproduce the bug. Doing that by hand wouldn't even have been possible.

Let's look at each of the tasks above and see how we can accomplish them:

Distribute executables

This could be done by the script, but to simplify things as much as possible, I just use a Bittorrent Sync folder to this. I've shared the tool-chain directory on my development machine (the tool-chain contains the tools and executables for all platforms) and connected all the other machines to that directory. That way, whenever I build a new executable it will automatically be distributed to all the nodes.

I have a nodes-config.rb file for defining the nodes, where I specify the tool-chain directory used by each node:

 :toolchain => 'c:\work\toolchain')

 :ip => '', 
 :toolchain => 'd:\toolchain', 
 :exec => => 'bitsquid-msi', :user => 'bitsquid', :password => ask_password('bitsquid-msi')))

 :ip => '', 
 :toolchain => 'c:\toolchain', 
 :exec => => 'bitsquidmacbook', :user => 'bitsquid', :password => ask_password('bitsquidmacbook')))


Distribute project data

Since the Bitsquid engine can be run in file server mode I don't actually need to distribute the project data. All I have to do is start a file server on my development machine and then tell all the network nodes to pull their data from that file server. I do that by starting the engine with the arguments:

-host -project samples/network

The nodes will pull the data for the project samples/network from the file server at IP and all get the latest data.

Start the game

On the local machine I can start the game directly with a system() call. To start the game on the remote machines I use PsExec. The relevant source code in the script looks like this:

require_relative 'console'

# Class that can launch executables on the local machine.
class LocalExec
 def launch(arg)
  system("start #{arg}")

# Class used for executables launched by other means.
class ExternalExec
 def launch(arg)

# Class used for executables launched on remote machines with psexec.
class PsExec
 def initialize(args)
  @name = args[:name]
  @user = args[:user]
  @password = args[:password]

 def launch(arg)
  system("psexec \\\\#{@name} -i -d -u #{@user} -p #{@password} #{arg}")

# Class that represents a node in the network test.
class Node
 # Initializes the node from hash data
 # :ip => ''
 #   The IP address of the node.
 # :toolchain
 #   Path to the toolchain folder on the node machine.
 # :exec =>
 #   Class for executing programs (LocalExec, ExternalExec, PsExec)
 # :port => 64000
 #   Port that the node should use.
 def initialize(args)
  @ip = args[:ip] || ''
  @toolchain = args[:toolchain]
  @exec = args[:exec] ||
  @port = args[:port] || 64000

 # Starts the project on the remote node and returns a console connection for talking to it.
 def start_project(arg)
  @exec.launch "#{exe_path} -port #{@port} #{arg}"
  return, @port)

 def exe_path
  return @toolchain + '\engine\win32\bitsquid_win32_dev.exe'

Each node specifies its own method for launching the game with the :exec parameter, and that method is used by start_project() to launch the game.

Additional execution methods could be added. For example for launching the game on PS3s and X360s.

Setup a multi-player session

To get the game to do what we want once it has started we use the in-game console.

All Bitsquid games act as TCP/IP servers when running in development mode. By connecting to the server port of a running game we can send Lua script commands to that game. The Ruby code for doing that is mercifully short:

require 'socket'

# Class that handles console communication with a running bitsquid executable.
class Console
 JSON = 0

 # Create a new console connection to specified host and port.
 def initialize(host, port)
  @socket =, port)

 # Send the specified JSON-encoded string to the target.
 def send(json)
  msg = [JSON, json.length].pack("NN") + json

 # Send the specified lua script to be executed on the target.
 def send_script(lua)
  lua = lua.gsub('"', '\\"')
  send("{type: \"script\", script: \"#{lua}\"}")

# Handles multiple console connections
class Consoles
 attr_reader :consoles

 def initialize(arg)
  @consoles = arg.respond_to?(:each) ? arg : [arg]

 def send_script(lua)
  @consoles.each do |c| c.send_script(lua) end

Node.start_project() returns a Console object that can be used to talk with the newly created network node. Since all the gameplay code for Bitsquid games is written in Lua, setting up a multi-player game is just a matter of sending the right Lua commands over that connection.

Those commands will be game specific. In the network sample where I implemented this, there is a global variable called force_menu_choice which when set will force a selection in the in-game menus. We can use this to set up a network game:

require_relative 'nodes-config'

consoles = NODES.collect do |n| n.start_project("-host -project samples/network") end
server = consoles[0]
clients =[1..-1])
all =

puts "Waiting for exes to launch..."
puts "Launching steam..."
all.send_script %q{force_menu_choice = "Steam Game"}
server.send_script %q{force_menu_choice = "Create Lobby"}
clients.send_script %q{force_menu_choice = "Find Lobby"}
clients.send_script %q{force_menu_choice = "Niklas Test Lobby"}
server.send_script %q{force_menu_choice = "Start Game"}

This will start a Steam Lobby on the server, all the clients will search for and join this lobby and then the server will start the game.

Play the game

Playing the game is again just a question of sending the right script commands to expose the bugs you are interested in. In my case, I just tested spawning some network synchronized boxes:

server.send_script %q{
 local self = Sample.screen
 local camera_pos = Unit.world_position(self.world_screen.camera_unit, 0)
 local camera_forward = Quaternion.forward(Unit.world_rotation(self.world_screen.camera_unit, 0))
 local box_unit = World.spawn_unit(, "units/box/box", camera_pos)
 local box_id = GameSession.create_game_object(, "box", {position=camera_pos})
 self.my_boxes[box_id] = box_unit
 Actor.set_velocity(, 0), camera_forward*20)
clients.send_script %q{
 local self = Sample.screen
 local camera_pos = Unit.world_position(self.world_screen.camera_unit, 0)
 local camera_forward = Quaternion.forward(Unit.world_rotation(self.world_screen.camera_unit, 0))
 local box_unit = World.spawn_unit(, "units/box/box", camera_pos)
 local box_id = GameSession.create_game_object(, "box", {position=camera_pos})
 self.my_boxes[box_id] = box_unit
 Actor.set_velocity(, 0), camera_forward*20)

And that is really all. I also added some similar code for shutting down the gameplay session and returning to the main menu so that I could loop the test.

And 500 iterations later, running on the three machines on my desk, the bug was reproduced.

Friday, August 16, 2013

Finding nearby stuff

A problem that crops up quite often in game programming is the need to find stuff that is "close to" other stuff. For example, you may want to find all enemies close to the player, all treasures close to a goblin, etc.

Most recently, I ran into this problem when I added support for merging navigation meshes to the Bitsquid engine.

To merge meshes, I need to find all the vertices that are "sufficiently close" (within some tolerance limit) in the meshes that I'm merging. These vertices should be considered "the same" in the merged mesh and need to be treated specially.

The naive algorithm for finding these coinciding vertices is to just do a double loop:

foreach (v1 in vertices)
   foreach (v2 in vertices)
      if (distance(v1, v2) < tolerance)

But since this algorithm is O(n^2) in the number of vertices, it is often prohibitively expensive.

To improve the performance you need some kind of data structure for accelerating spatial queries. There are lots of different possibilities. Real-Time Collision Detection by Christer Ericsson has a good coverage of them.

One of the simplest approaches is to use some kind of grid bucket system. You divide the world into a grid and for each grid cell you store a list of the vertices that fall in that cell. To check for "close" vertices you look through the list of vertices in your cell:

Depending on what your definition of "close" is (how big the search radius is compared to the grid size), you may need to search more cells. Note that if the search starts close enough to a grid corner you always have to search at least four cells, no matter how small the search radius is. (For a two-dimensional grid, a three-dimensional grid requires you to search eight cells.)

If you know what the search radius is going to be (for example, in my vertex merging case I always use the same merge distance) you can adapt the grid so that you never have to search more than four cells, by setting the grid size to be equal to the search diameter.

With a larger grid size you can sometimes get by with searching just one or two cells, depending on how close to a grid line the search starts, but there will always be some positions where you will have to look at four cells.

The fact that you at most have to look at four cells can be used to make an interesting variant of the algorithm. Instead of checking the four neighboring cells in the lookup, you can store the item in all four neighboring cells. That way you will only have to check a single cell when you do the lookup. This approach will make lookups four times faster, insertions four times slower and use four times as much memory. It can be a good optimization if you have a high ratio of lookups to insertions.

For my particular case (vertex merging) I only have one lookup per insertion, so this would not be a net win.

Designing the grid can be tricky.

How large should it be? You may need to do a pre-pass over your data to find the range, which isn't possible if you don't have all the data up front. (Letting the grid coordinates wrap around can solve this in some cases.)

How big should the cells be? If you make them too small, the grid will consume too much memory. If you make them too big, you may end up with lots of points that you need to check in each cell, which will make the algorithm slow.

What if the data is laid out on a diagonal? In this case most grid cells will be empty, wasting memory:

Most of these concerns can be fixed by not storing the grid in a traditional matrix, but instead use a hash that maps from grid coordinates to cell data:

struct GridCoordinate {
   int x;
   int y;

HashMap<GridCoordinate, CellData> grid;

// To insert an item
GridCoordinate gc;
gc.x = (int)floor(p.x / cell_size);
gc.y = (int)floor(p.y / cell_size);
grid[gc] = cell_data;

This way, you will only use memory for the cells that actually contain data and lookups are still O(1). You also don't have to care about how big the grid is or what happens if data ends up outside the grid. In fact, you only have one parameter left to worry about, the cell_size.

As mentioned above, a good heuristic is to use:

float cell_size = 1.0 * search_diameter;

That way you have to investigate exactly four cells in each lookup.

If the data is sparse compared to the search radius you can use a bigger cell size, which means that you don't always have to check all four neighbors. But note that the average number of cells you need to search in each lookup goes down slowly, while the cell area (and correspondingly the average number of items in each cell) grows quickly. So only use this optimization when you know that the data is sparse.

Multiplier Avg. cells to search Cell area
1.0 4.00 1.00
1.5 2.78 2.25
2.0 2.25 4.00
3.0 1.78 9.00
5.0 1.44 25.00
10.0 1.21 100.00

The final piece of this puzzle is what CellData should look like. It might be tempting to do something like:

typedef Vector<VertexId> CellData;

However, this would be highly inefficient. In many cases cells will only contain a few items and allocating a Vector for them is total overkill. Using a Vector will mean tons of pointer chasing, cache misses and data copying.

For me, a general rule of thumb when writing high performance C++ code is to avoid collections stored inside other collections. If you store collections in collections it is very easy to create innocuous looking code that does tons of data copying behind your back.

So what can you do instead. If you have a good MultiHashMap implementation, you can use that. Otherwise, you can do something like this:

struct CellData {
   VertexId id;
   unsigned next; 
Array<CellData> items;

Here, items contains linked lists of the items in each cell, stored in a continuous array (which is the only sane way to store linked lists). The HashMap gives the first cell item. Then you follow the next references in the items list to find subsequent items in the same cell until next == UINT_MAX, which marks the end of the list.

This basic pattern: grid coordinates -> hash map -> linked list in array is my standard go-to solution for making spatial queries. It is straightforward, easy to understand and uses memory reasonably while providing O(1) lookups.

Tuesday, April 30, 2013

Code Share: Source Censoring, Part 2

A while ago I shared the tool we use for censoring source code in the Bitsquid engine.

Quick recap: We need to censor the code in our source distributions because there are parts of the code that are covered by NDAs to third parties and cannot be publicly disclosed. We do this with a tool that strips out the secret code and replaces it with blank lines, based on preprocessor definitions.

The stripping tool is only part of the solution, though. It works well if you only distribute code drops. You take a snapshot of the code, run the stripping tool to strip out secrets, zip it up. Done!

But frankly this is a terrible way of distributing source code. There is no history, no indication of what has changed from version to version and keeping local changes merged with the mainline is a constant pain in the ass.

The only sane way of distributing source code is to expose a mercurial (or git) source repository that you can pull changes from. This lets customers examine the history, find out which version introduced a particular bug, maintain their own branches that they merge with the mainline at their convenience, etc.

But of course, we cannot just share our own internal repository (because it contains secrets).


We handle this with another tool, that we have inventively decided to call hg-clone.rb.

What hg-clone.rb does is pretty straight forward. Given two repositories as argument, a SOURCE and a DESTINATION, it checks out each revision in the SOURCE repository, runs a filter program (to strip out any secrets) and checks the result into the destination repository.

SRC:    0  --> 1  --> 2  --> 3  --> 4  --> 5  --> ...
     |      |      |      |      |      |
     F      F      F      F      F      F
     |      |      |      |      |      |
        v      v      v      v      v      v
DST:    0' --> 1' --> 2' --> 3' --> 4' --> 5' --> ...

You call the program as

hg-clone SOURCE DESTINATION --filter FILTER --target TARGET-REV --cutoff CUTOFF-REV

SOURCE and DESTINATION are the source and destination repositories. DESTINATION does not need to exist, if it doesn't it will be created. FILTER is the filter program, it will be run once in the destination directory before each revision is committed.

TARGET-REV is the target revision that should be copied from the source to the destination. hg-clone will first transfer the parent(s) of the target revision to the destination repository (if they haven't already been transfered), then it will transfer the target revision. This process is applied recursively, so if the parents' parents haven't been transferred, they will be transferred first, etc. Only the revisions that are ancestors of TARGET-REV will be transferred, so you can have secret development branches that won't be copied to the destination until they have been merged with your release branch.

If you don't specify a TARGET-REV, the last revision in the source repository will be used.

CUTOFF-REV can be used to cutoff the recursive parent transfer at a specific revision. If you set the cutoff to revision 1000, then any revision that has a parent before revision 1000 will be reparented to revision 1000 in the destination repository. Essentially, in the destination repository, history will start at revision 1000. This can be used to hide a shady past.

hg-clone tries its best to preserve authors, dates, messages, branches, etc between the source and destination repositories. It cannot however preserve version numbers, since those are based on a content hash, which changes when the filter is applied. What it does instead is to insert a marker [clonedfrom:116:91fe33c1a569] in the commit message that specifies which revision in the source repository the current revision in the destination repository was cloned from. This commitment marker is also used to determine the mapping between revisions in the source and the destination and whether a particular revision has already been copied over or not.

To use this in practice, you would typically set up one external repository for each customer with a corresponding filter program for stripping out the things that customer is not allowed to see. Then you would set up a cron job to run hg-clone and copy revisions from the main repository to the customer's.

Instead of having one repository per customer, you could alternatively have one repository for each possible NDA combination (e.g., +PS3 +PS4 -X360). However, this can be problematic, because if a customer becomes disclosed for a platform you will have to switch them over to a new repository, which might be painful. If you have one repository per customer you can just change the filter function.

The hg-clone program is available from our bitbucket repository.

Friday, March 15, 2013

What is gimbal lock and why do we still have to worry about it?

If you have ever worked with rotations and Euler angles you are probably at least somewhat familiar with the phrase "gimbal lock". But like many things concerning rotations, angles and spaces it can be tricky to visualize and get a good grasp of.

Sometimes it feels like every time I need to think about gimbal lock I have forgotten everything about it and have to go back the beginning and ask myself: OK, but what is it really that is happening?

Hopefully, this article will take care of that problem.

The Wikipedia page shows how gimbal lock can happen in a mechanical system. But it isn't necessarily self-evident how this translates to the computer game world. In the computer there are no mechanical limitations, we can rotate an object however we like. How can anything be "locked"?

Euler angles

When we are using Euler angles, we represent an object's orientation as three consecutive rotations around the object's axes. We can choose the axes and the order in which we apply the rotations arbitrarily, and depending on what we choose we get different Euler representations. So XYZ is the Euler representation where the first angle rotates the object around its X-axis, the second around its (new) Y-axis and the third around its (new) Z-axis. YZX gives us a different representation. We can even have representations with repeated axes, such as XZX.

So if we want to talk about the "Euler angles" of an object, we really must also talk about what axes we are rotating around and in what order. Otherwise we have no idea at all what we are talking about. Unfortunately, many articles about Euler angles are pretty sloppy with this and throw around terms like yaw, pitch and roll as if they had completely well-defined and unambiguous meanings. I prefer to use more wordy, but descriptive names, such as euler_xyz[0] that unambiguously state the axis rotation order and the index of the angle we are talking about.

An object has three rotational degrees of freedom and it is quite easy to see that the three Euler angles for a particular axis order (XYZ) are enough to define any possible orientation of an object. Note though that the representation is not unique. There are many possible Euler angles that represent the same orientation. For example, adding 360 degrees to any of the three angles will give us a different representation that results in the same object orientation.

So "gimbal lock" doesn't mean that there are rotations that can't be expressed as Euler angles. We can express any rotation in Euler angle form. Given an object, we can convert its orientation to Euler angles, and from that orientation we can rotate the object however we like and convert the new orientation to other Euler angles.

So what exactly is it that is "locked"? It seems we can do whatever we like.

The "lock" in gimbal lock

The term "gimbal lock" comes from the mechanical world. If the problem had originated in the world of computers, it would probably have been called something less confusing, such as "Euler angle flip" or "coordinate singularity".

Because, in the computer world, there is really nothing that gets "locked". Instead, the problem is this: when the Euler angles have particular values, there are orientations that are very similar to the current orientation which can't be achieved by just making small changes to the Euler angles. In particular, this happens when one of the angles is at 90 degrees, so that two rotation axes coincide.

So even though the orientations are "close" in the real world, they are not close in the Euler representation. In fact, at least one of the Euler angles will have to flip 180 degrees in order for us to represent the new orientation.

So one of the angles have to flip? What is the big deal? Can't we just flip it and get on with our stuff?

We can, as long as the angles only represent instantaneous "snapshots" of the object's orientation. However, if the angles represent key frames in an animation and we want to interpolate between those key frames we run into trouble. If one of the angles flips 180 degrees between two key frames and we interpolate between those values, we will see the the object animating through all those 180 degrees. In the viewport, we will see the object doing a "flip" or "roll" that shouldn't be there.

Note that it is only the interpolation that creates this unwanted behavior. If we just displayed the actual key frames and didn't interpolate between them -- everything would look right. We could work in Euler angles as much as we liked and be as close to the gimbal lock position as we wanted and no-one would ever know.

So the only thing we need to fix to get rid of gimbal lock is the interpolation. If you have done any work with 3D graphics you probably already know the answer -- to use quaternions instead of Euler angles to represent angles. Quaternions don't have the weird singularity points that Euler angles have and we can interpolate between any keyframes by just lerping the quaternions.

It doesn't matter if the animation package is using Euler angles internally, as long as we convert everything to quaternions before we do the interpolation. Note that interpolation in quaternion space is not the same as interpolation in Euler space though, so to get as close as possible to what the animator intended, we probably want to sample the animation at our target frame rate and generate our quaternion key frames from those samples, rather than directly converting the animator's key frames (which may be further apart).

Well, there is one caveat actually. If we have more than 180 degrees of rotation in a single frame we can't represent that nicely with quaternions. Quaternions always lerp the shortest path between two orientations and you can't represent several "laps" of rotation with quaternions as you can do with Euler angles (by setting one of the angles to 9000 degrees, for example). But you can fix that by sampling at a higher frame rate if you need to represent really fast rotations with quaternions.

So with that we can say good bye and good riddance to Euler angles and never have to worry about their sorry gimbal locking asses ever again.

Or so you may think...

The return of gimbal lock

I certainly thought so, until I started working on the new cutscene animation system for our level editor.

You see, animators really like to work with curves. They like to see a visual representation of what the animation will do to an object over time with key points that can be moved and handles that can be adjusted to change the slope of the curve.

Curves! Animators love them!

Quaternions are great for interpolation, but they are no good for curve editing.

Sure, you could probably draw some curves that represented a quaternion (the laziest thing would be to just draw the x, y and z components of the quaternion), but those curves wouldn't mean anything to an animator, the way the Euler angle curves do. They wouldn't be able to do anything with them.

So, animators want curves with keyframe interpolation. Curves need Euler angles. But what happens when we mix Euler angles with keyframe interpolation? Presto! Our old friend the gimbal lock is back again! Haven't we missed him.

That's it. We're stuck. Gimbal lock is here to stay and the animators will just have to work around it.

And we have to add support for all the usual tricks and workarounds that animators use to get around gimbal lock, such as changing the axis order (from XYZ, to XYZ, XZX or another of the twelve possible permutations), converting to quaternion and back again, applying an "Euler filter", etc.

But who said this game engine gig should be easy?

Friday, March 1, 2013

A Bug in Object Replication and Message Reordering

The Bitsquid network system supports a peer-to-peer model with migration of network objects -- i.e., changing the owner of a network object from one peer to another. This recently lead to an rare race condition.

To understand this bug you must first understand a little bit about how our network system works.


The entire network model is based on an packet delivery system (on top of UDP) that provides ACKs for unreliable packets as well as a reliable (and ordered) packet stream between any two network endpoints. At the next layer we have implemented a remote-procedure-call service for Lua as well as an object replication system.

Games can use these services however they like, but our recommendation is to do as much as possible with the object replication system and as little as possible with RPC calls, since using explicit RPC messages tends to require more bandwidth and be more error prone.

The network be run in both client-server and peer-to-peer mode. The only difference is that in client-server mode, the server relays all messages (clients never talk directly to each other) and owns most of the network objects. (Clients can own objects in client-server mode, in that case the changes to the objects are relayed by the server.)

Objects are replicated with a message stream that looks something like this:


Here, A (the owner of the object) first sends a reliable message that creates the object. When it has received an ACK for that message, it starts sending updates, informing the other players about changes to the object. (By monitoring ACKs, it knows which changes the other players have received, so it only sends updates when necessary and it will resend updates if the messages are lost.) Finally, at some future point, the object is destroyed, through another reliable message.

The UPDATE messages are sent on the unreliable stream (for maximum performance), so they can potentially arrive before CREATE or after DELETE. But this is not a problem, because we simply ignore UPDATE messages that arrive out of order.

This approach ensures that everybody that participates in the game session will see the same set of objects with the same properties (at least eventually, keeping in mind that messages can be delayed).


Migration complicates this picture somewhat.

Migrating a network object means changing the owner of the object from one peer to another. There are a number of reasons why you might want to do that. First, if a player drops out of the game, the objects owned by that player may need to be taken over by somebody else. Second, in a peer-to-peer game we may want to load balance, so that each peer is managing about the same amount of objects. Finally, sometimes a particular player is interacting directly with a particular object (picking up a rock, etc). It can then be beneficial to make that player owner of the object, so that the interaction is not affected by network latency.

In our network, migration is implemented with a reliable MIGRATION message that tells everybody in the session about the object's new owner. The migration message is always sent by a special peer, the HOST of the game session. (To ensure that peers do not compete for the ownership of an object.)

So if we look at a message stream with migration involved, it looks something like this:

   A:  C Ua Ua Ua Ua Ua
HOST:                    M_ab
   B:                          Ub Ub Ub Ub Ub Ub Ub Ub D

If you are an experienced network programmer you should start to smell trouble at this point.

The problem is that while the message system provides an ordered stream of messages between any two endpoints, there is no ordering of messages between different endpoints.

Consider an additional network peer X. There is an ordered stream of messages A → X. There is also an ordered stream of messages B → X. But there is no guaranteed ordering between the messages sent from A and the messages sent from B and HOST. So, suppose the messages from A → X are delayed. Then X could see the following message stream:

M_ab Ub Ub Ub D C Ua Ua Ua

So X gets a request to migrate the object before it has been created. And the creation message arrives after DELETE. In other words, a complete mess.

To be sure, this only happens if the object gets migrated really close to being created or deleted and if there are asymmetric network delays on top of that. But of course, it always happens to someone.

The Fix

There are many possible ways of fixing this. Here are some:

  • We could impose a global message ordering. We could make sure that the reliable message streams are globally ordered to prevent "paradoxes" of this kind. I.e., if HOST sends M_ab after receiving C, no peer should receive M_ab before C. Unfortunately, this is not as easy as it sounds. For example, what if A dies before it has sent C to X? In that case, that failed delivery will also block the channels HOST → X and B → X, since they are not allowed to deliver any messages before X has received C.

  • We could use a migration handshake. We could do some kind of handshake procedure to make sure that everybody has received M_ab, before B takes over ownership. But this would require a lot of extra messages and temporarily put the object in limbo.

  • We could fix the ACKs. We could make it so that X doesn't ACK M_ab until C has arrived, thus forcing HOST to keep resending it, until we are ready to receive it. This would work, but would require us to implement ACKing of individual messages. Currently, we just ACK an entire UDP packet (containing many messages) on reception, which is simpler and more performant.

  • We could create an internal message queue. We could queue up migration, create and delete messages in some sort of internal queue if they arrive out of order and try to fix things up later. This is a truly horrible "solution" that increases code complexity and is likely to cause lots of confusing bugs in the future.

All these solutions are probably workable, but they all have the drawback of increasing complexity. And I really don't like to increase the complexity of network code. Reasoning about network code is hard enough as it is, we should always strive for the simplest solution possible.

So, instead, the first thing I did was to simplify the problem by eliminating the host from the equation. I simply let the new owner send out the migration message instead of the host:

   A:  C Ua Ua Ua Ua Ua
   B:                    M_ab Ub Ub Ub Ub Ub Ub Ub Ub D

This is already a lot better. Now we only have two parties to worry about (apart from X), instead of three.

We still want the host to be in charge of migration. Otherwise we run into tricky problems of what should happen if several peers try to assume ownership of an object at the same time. So we let the host initiate the migration by sending a message to the new owner (B). Then, B is responsible for notifying everybody else about this.

With this approach, we can use the same "wait for ack" trick that we used during creation to make sure that B doesn't send any updates to peers that haven't acked the migration:

   A:  C [wait] Ua Ua Ua Ua Ua
   B:                            M_ab [wait] Ub Ub Ub Ub Ub Ub Ub Ub D

We still haven't completely solved the problem, X can still see weird message orderings such as:

M_ab   C   D
M_ab   D   C

But this won't be a problem as long as we do two things:

  • We let MIGRATE act as a CREATE message, if we get MIGRATE for an object that doesn't exist.

  • We ignore "old" CREATE messages. (The C that arrives after M.)

To be able to distinguish old messages I introduced a migration counter. This is just a number that starts at zero when the object is created and is increased (by HOST) every time the object is migrated.

We tag all CREATE, DESTROY and MIGRATE messages with the migration counter and simply ignore "old" messages. With this approach, the message streams will look like this:

   A:  C_0 [wait] Ua Ua Ua Ua Ua
   B:                             M_ab_1 [wait] Ub Ub Ub Ub Ub Ub Ub Ub D_1

We can now verify that all possible message orderings that X can see work correctly:

C_0      M_ab_1  D_1  -- ok, the expected order
M_ab_1   C_0     D_1  -- ok, M_ab_1 creates the object with migration counter 1 and C_0 is ignored
M_ab_1   D_1     C_0  -- ok, M_ab_1 creates the object with migration counter 1 and C_0 is ignored

The system works equally well if there are multiple migration steps:

   A:  C_0 [wait] Ua Ua 
   B:                   M_ab_1 [wait] Ub Ub Ub
   C:                                            M_bc_2 [wait] Uc Uc Uc D_2

No matter in which order the messages arrive we will end up in the correct state.

Tuesday, February 19, 2013

Why Lua?

A question that I get asked regularly is why we have chosen Lua as our engine scripting language. I guess as opposed to more well-known languages, such as JavaScript or C#. The short answer is that Lua is lighter and more elegant than both those languages. It is also faster than JavaScript and more dynamic than C#.

When we started Bitsquid, we set out four key design principles for the engine:

  • Simplicity. (A small, manageable codebase with a minimalistic, modular design.)

  • Flexibility. (A completely data-driven engine that is not tied to any particular game type.)

  • Dynamism. (Fast iteration times, with hot reload of everything on real target platforms.)

  • Speed. (Excellent multicore performance and cache-friendly data-oriented layouts.)

Whenever we design new systems for the engine, we always keep these four goals in mind. As we shall see below, Lua does very well on all four counts, which makes it a good fit for our engine.

Simplicity in Lua

As I grow older (and hopefully more experienced) I find myself appreciating simplicity more and more. My favorite scripting language has gone from "Swiss army chainsaw" Perl (I claim youthful ignorance!) to "kitchen-drawer-esque" Ruby, to minimalistic Lua.

Lua is really small for a programming language. The entire Lua syntax fits on a single page. In fact, here it is:

chunk ::= {stat [`;´]} [laststat [`;´]]
block ::= chunk
stat ::=  varlist `=´ explist | 
     functioncall | 
     do block end | 
     while exp do block end | 
     repeat block until exp | 
     if exp then block {elseif exp then block} [else block] end | 
     for Name `=´ exp `,´ exp [`,´ exp] do block end | 
     for namelist in explist do block end | 
     function funcname funcbody | 
     local function Name funcbody | 
     local namelist [`=´ explist] 
laststat ::= return [explist] | break
funcname ::= Name {`.´ Name} [`:´ Name]
varlist ::= var {`,´ var}
var ::=  Name | prefixexp `[´ exp `]´ | prefixexp `.´ Name 
namelist ::= Name {`,´ Name}
explist ::= {exp `,´} exp
exp ::=  nil | false | true | Number | String | `...´ | function | 
     prefixexp | tableconstructor | exp binop exp | unop exp 
prefixexp ::= var | functioncall | `(´ exp `)´
functioncall ::=  prefixexp args | prefixexp `:´ Name args 
args ::=  `(´ [explist] `)´ | tableconstructor | String 
function ::= function funcbody
funcbody ::= `(´ [parlist] `)´ block end
parlist ::= namelist [`,´ `...´] | `...´
tableconstructor ::= `{´ [fieldlist] `}´
fieldlist ::= field {fieldsep field} [fieldsep]
field ::= `[´ exp `]´ `=´ exp | Name `=´ exp | exp
fieldsep ::= `,´ | `;´
binop ::= `+´ | `-´ | `*´ | `/´ | `^´ | `%´ | `..´ | 
     `<´ | `<=´ | `>´ | `>=´ | `==´ | `~=´ | 
     and | or
unop ::= `-´ | not | `#´

The same minimalistic philosophy is applied across the entire language. From the standard libraries to the C interface to the actual language implementation. You can understand all of Lua by just understanding a few key concepts.

Lua's simplicity and size does not mean that it lacks features. Rather it is just really well designed. It comes with a small set of orthogonal features that can be combined in lots of interesting ways. This gives the language a feeling of elegance, which is quite rare in the programming world. It is not a perfect language (perfect languages don't exist), but it is a little gem that fits very well into its particular niche. In that way, Lua is similar to C (the original, not the C++ monstrosity) -- it has a nice small set of features that fit very well together. (I suspect that Smalltalk and LISP also have this feeling of minimalistic elegance, but I haven't done enough real-world programming in those languages to really be able to tell.)

As an example of how powerful Lua's minimalism can be, consider this: Lua does not have a class or object system, but that doesn't matter, because you can implement a class system in about 20 lines or so of Lua code. In fact, here is one:

function class(klass, super)
    if not klass then
        klass = {}
        local meta = {}
        meta.__call = function(self, ...)
            local object = {}
            setmetatable(object, klass)
            if object.init then object:init(...) end
            return object
        setmetatable(klass, meta)
    if super then
        for k,v in pairs(super) do
            klass[k] = v
    klass.__index = klass
    return klass

If you prefer prototype based languages -- no problem -- you can make a prototype object system in Lua too.

Smallness and simplicity makes everything easier. It makes Lua easier to learn, read, understand, port, master and optimize. A project such as LuaJIT -- created by a single developer -- would not have been possible in a more complicated language.

Flexibility in Lua

Lua is a fully featured language, and in the Bitsquid engine, Lua is not just used as an extension language, rather it has direct control over the gameplay loop. This means that you have complete control over the engine from Lua. You can create completely different games by just changing the Lua code. (Examples: First person medieval combat War of the Roses, top-down RTS Krater, beat-em-up platformer Showdown and hand-held puzzler Hamilton.)

Dynamism in Lua

Unlike C#, which only has limited support for Edit and Continue, Lua makes it possible to reload everything -- the entire program -- on all target platforms, including consoles, mobiles and tablets.

This means that gameplay programmers can work on the code, tweak constants, fix bugs and add features without having to restart the game. And they can do this while running on the real target hardware, so that they know exactly what performance they get, how the controls feel and how much memory they are using. This enables fast iterations which is the key to increasing productivity and improving quality in game development.

Speed of Lua

Measuring the performance of a language is always tricky, but by most accounts, LuaJIT 2 is one of the fastest dynamic language implementations in the world. It outperforms other dynamic languages on many benchmarks, often by a substantial margin.

On the platforms where JITting isn't allowed, LuaJIT can be run in interpreter mode. The interpreter mode of LuaJIT is very competitive with other non-JITed language implementations.

Furthermore, Lua has a very simple C interoperability interface (simplified further by LuaJIT FFI). This means that in performance critical parts of the code it is really easy to drop into C and get maximum performance.

Lua's weak points

As I said above, no language is perfect. The things I miss most when programming in Lua don't have that much to do with the actual language, but rather with the ecosystem around it. C# has spoiled me with things like an integrated debugger, Intellisense, a very active StackOverflow community and the wonderfully helpful ReSharper. Lua has no "official" debugger, and not much in the way of autocompletion or refactoring tools.

Some people would argue that this shouldn't be counted as an argument against Lua, since it doesn't really concern the language Lua. I disagree. A language is not a singular, isolated thing. It is part of a bigger programming experience. When we judge a language we must take that entire experience into account: Can you find help in online forums? Are there any good free-to-use development tools? Is the user base fragmented? Can you easily create GUIs with native look-and-feel? Etc.

The lack of an official debugger is not a huge issue. Lua has an excellent debugging API that can be used to communicate with external debuggers. Using that API you can quite easily write your own debugger (we have) or integrate a debugger into your favorite text editor. Also, quite recently, the Decoda IDE was open sourced, which means there is now a good open source debugger available.

Getting autocompletion and refactoring to work well with Lua is trickier. Since Lua is dynamically typed the IDE doesn't know the type of variables, parameters or return values. So it doesn't know what methods to suggest. And when doing refactoring operations, it can't distinguish between methods that have the same name, but operate on different types.

But I don't think it necessarily has to be this way. An IDE could do type inference and try to guess the type of variables. For example, if a programmer started to write something like this:

local car = Car()

the IDE could infer that the variable car was of type Car. It could then display suitable autocompletion information for the Car class.

Lua's dynamic nature makes it tricky to write type inference code that is guaranteed to be 100 % correct. For example, a piece of Lua code could dynamically access the global _G table and change the math.sin() function so that returned a string instead of a number. But such examples are probably not that common in regular Lua code. Also, autocompletion backed by type inference could still be very useful to the end user even if it wasn't always 100 % correct.

Type inference could be combined with explicit type hinting to cover the cases where the IDE was not able to make a correct guess (such as for functions exposed through the C API). Hinting could be implemented with a specially formatted comment that specified the type of a variable or a function:

-- @type Car -> number
function top_speed(car)

In the example above, the comment would indicate that top_speed is a function that takes a Car argument and returns a number.

Type hinting and type inference could also be used to detect "type errors" in Lua code. For example, if the IDE saw something like this:

local bike = Bicycle()
local s = top_speed(bike)

it could conclude that since bike is probably a Bicycle object and since top_speed expects a Car object, this call will probably result in a runtime error. It could indicate this with a squiggly red line in the source code.

I don't know of any Lua IDE that really explores this possibility. I might try it for my next hack day.

Thursday, January 31, 2013

Garbage Collection and Memory Allocation Sizes

As a performance conscious programmer in a soft-realtime environment I've never been too fond of garbage collection.

Incremental garbage collectors (like the one in Lua) make it tolerable (you get rid of the horrible garbage collection stalls), but there is still something unsettling about it. I keep looking at the garbage collection time in the profiler, and I can't shake the feeling that all that time is wasted, because it doesn't really do anything.

Of course that isn't true. Garbage collection frees the programmers from a lot of busywork. And the time they gain can go into optimizing other systems, which leads to a net performance win.

It also simplifies some of the hairy ownership questions that arise when data is transferred between systems. Without garbage collection, those questions must be solved in some other way. Either by reference counting (error-prone) or by making local copies of the data to assume ownership (ugly and costly).

But still, there is that annoying performance hit.

I was pretty surprised to see that the developers Go, a language that looks well-designed and targets low-level programmers, decided to go with garbage collection rather than manual memory management. It seemed like a strange choice.

But recently I've started to see things differently.

One thing I've noticed as I delve deeper and deeper into data-oriented design is that I tend to allocate memory in much larger chunks than before. It's a natural consequence of trying to keep things continuous in memory, treating resources as large memory blobs and managing arrays of similar objects together.

This has interesting consequences for garbage collection, because when the garbage collector only has to keep track of a small number of large chunks, rather than a large number of small chunks, it can perform a lot better.

Let's look at a simple example in Lua. Say we want to write a class for managing bullets. In the non-data-oriented solution, we allocate each bullet as a separate object:

function Bullet:update(dt)
    self.position = self.position + self.velocity * dt

function Bullets:update(dt)
    for i,bullet in ipairs(self.bullets) do

In the data-oriented solution, we instead use two big arrays to store the position and velocity of all the bullets:

function Bullets:update(dt)
    for i=1,#self.pos do
        self.pos[i] = self.pos[i] + dt * self.vel[i]

I tested these two solutions with a large number of bullets and got two interesting results:

  • The data-oriented solution runs 50 times faster.

  • The data-oriented solution only needs half as much time for garbage collection.

That the data-oriented solution runs so much faster shows what cache coherence can do for you. It is also a testament to how awesome LuaJIT is when you give it tight inner loops to work with.

Note that in this test, the Bullet code itself did not create any garbage. The speed-up comes from being faster at collecting the garbage created by other systems. And the reason for this is simply that with fewer, larger memory allocations, there is less stuff that the garbage collector has to trawl through. If we add in the benefit that the data-oriented solution will create fewer objects and generate less garbage, the benefits will be even greater.

So maybe the real culprit in isn't garbage collection, but rather having many small memory allocations. And having many small memory allocations does not just hurt the garbage collector, it is bad for other reasons as well. It leads to bad cache usage, high overhead in the memory allocator, fragmentation and bad allocator performance. It also makes all kinds of memory problems harder to deal with: memory leaks, dangling pointers, tracking how much memory is used by each system, etc.

So it is not just garbage-collected languages like Lua that would benefit from allocating memory in larger chunks, but manually managed languages like C++ as well.

Recently, I've come to think that the best solution to memory management issues in C++ is to avoid the kitchen-sink global memory allocator as much as possible and instead let each subsystem take a much more hands-on approach to managing its own memory.

What I mean by this is that instead of having the sound system (for example) send lots of memory requests to the kitchen-sink memory manager, it would only request a few large memory blocks. Then, it would be the responsibility of the system to divide that up into smaller, more manageable pieces that it can make practical use of.

This approach has a number of advantages:

  • Since the system knows the usage patterns for its data, it can arrange the memory efficiently. A global memory allocator has no such knowledge.

  • It becomes much easier to track memory use by system. There will be a relatively small number of global memory allocations, each tagged by system. It becomes obvious how much memory each system is consuming.

  • Memory inside a system can be easily tracked, since the system knows what the memory means and can thus give useful information about it (such as the name of the object that owns it).

  • When a system shuts down it can quickly and efficiently free all of its memory.

  • Fragmentation problems are reduced.

  • It actively encourages good memory behavior. It makes it easier to do good things (achieve cache locality, etc) and harder to do bad things (lots of small memory allocations).

  • Buffer overflows will tend to overwrite data within the same system or cause page faults, which will make them easier to find.

  • Dangling pointer access will tend to cause page faults, which will make them easier to find.

I'm tempted to go so far as to only allow whole page allocations on the global level. I.e., a system would only be allowed to request memory from the global manager in chunks of whole system pages. Then it would be up to the system to divide that up into smaller pieces. For example, if we did the bullet example in C++, we might use one such chunk to hold our array of Bullet structs.

This has the advantage of completely eliminating external fragmentation. (Since everything is allocated in chunks of whole pages and they can be remapped by the memory manager.) We can still get address space fragmentation, but using a 64-bit address space should take care of that. And with this approach using 64-bit pointers is less expensive, because we have fewer individually allocated memory blocks and thus fewer pointers.

Instead we get internal fragmentation. If we allocate the bullet array as a multiple of the page size (say 4 K), we will on average have 2 K of wasted space at the end of the array (assuming the number of bullets is random).

But internal fragmentation is a much nicer problem to deal with than external fragmentation. When we have internal fragmentation, it is one particular system that is having trouble. We can go into that system and do all kinds of things to optimize how its handling memory and solve the problem. With external fragmentation, the problem is global. There is no particular system that owns it and no clear way to fix it other than to try lots of things that we hope might "improve" the fragmentation situation.

The same goes for out-of-memory problems. With this approach, it is very clear which system is using too much memory and easy to fix that by reducing the content or doing optimizations to that system.

Dealing with bugs and optimizations on a system-by-system simplifies things enormously. It is quite easy to get a good grasp of everything that happens in a particular system. Grasping everything happens in the entire engine is a superhuman task.

Another nice thing about this approach is that it is quite easy to introduce it on a system-by-system basis. All we have to do is to change one system at a time so that it allocates its memory using the page allocator, rather than the kitchen-sink allocator.

And if we have some messy systems left that are too hard to convert to this approach we can just let them keep using the kitchen-sink allocator. Or, even better, we can give them their own private heaps in memory that they allocate from the page allocator. Then they can make whatever mess they want there, without disturbing the other systems.