Tuesday, December 11, 2012

Four meditations on bad design decisions

I've recently been doing a major rewrite of one of our core engine systems, the graph that we use for our visual scripting language Flow. Taking it from something that looks like this:

To something that looks like this:

A major rewrite like this is always a humbling experience. When you have to rewrite your own code, every bad decision you made comes back to haunt you. And you don't have anybody else to blame them on.

As if facing your own inadequacy wasn't enough -- rewriting an existing system is always harder than writing one from scratch. When you write a new system you start with a blank slate and can do whatever you want. When you rewrite, you are constrained by what the old system did -- at least if you want to maintain any kind of backwards compatibility.

In addition, a new system can be written iteratively. You can start with a very small, simple system, release early and get feedback. Based on that feedback you can tweak the system. You don't have to think about adding features until you have a good stable base.

When you are doing a rewrite you can't release the new system until it is at least as good as the old one. Otherwise, your users will question why you have spent all that time working on a system that is worse than what you had before. And they will be right.

So a rewrite forces you away from the comfortable land of early releases and quick iterations and into the ugly old waterfall model.

With the power of hindsight, I'd like to reflect a bit on four design mistakes I made when I wrote the first version of the system that made this rewrite a lot harder than it could have been.

Don't use strings for non-text things

Strings have one really good use -- to hold pieces of text that either gets displayed to or inputted by the user. All other use of strings should be regarded as suspicious.

Strings are scary because they are both ambiguous and powerful. Does "a/b.txt" and "A//b.txt" represent the same path? Hard to tell. But maybe you can use case conversion, search and replace and some regular expression monstrosity to figure that out.

If you are doing that kind of string manipulation in any part of the code that is not directly related to user input or output, it is a clear warning sign that your code might be too "stringified".

The most obvious example stringified code is the use of "stringly typed" data, for example, storing a date as the string "2012-12-09". But the problem with strings can also manifest more subtle ways.

The original version of Flow used strings to identify connectors, both internally (as a representation of the connection) and visually (to show the name of the connector):

As a consequence, a Flow node couldn't have two connectors with the same name, and a connector couldn't be renamed (even visually) without breaking all existing connections.

In retrospect, rather than having a single Name property, it would be much better to have separate Id and DisplayName properties. The Id would be a GUID that uniquely identified the property, and the DisplayName would be a (localizable) name, suitable for displaying to the end user.

Using names/strings as identifiers has bitten me in other ways as well. In one system I knew that the names had to be unique (because that is how the script would refer to the objects) so I thought it would be safe to use them as identifiers. What I didn't consider was that there could be situations when there temporarily were two objects that had the same name. For example, if the user had created a rock object, and wanted to create a rock_small object -- as she was half-way through typing that name, there would suddenly be two objects named rock. This created problems for the system.

Lesson learned, I now avoid using strings as identifiers.

When in doubt, you should opt-out

Every system acquires features over time. That is good of course. Those features make the system more powerful and easier to work with.

But among the good features there are usually a few that don't feel quite right. That don't really fit into the design of the system. You can do them of course. You can do anything.

But usually it is best not to. Most of the time when I have added a feature that didn't quite feel right, I have regretted it later. In retrospect it would have been better to try to find a different way of doing what the users wanted that was more natural to the ideas behind the system.

An example: Users of Flow wanted some way to specify the order in which events were triggered, when multiple connections are connected to the same Out connector. This is needed in some situations, for example you may want to make sure that a unit is spawned before it is used.

In the old version of Flow, this was implemented with a context menu on the connection where you could select if it should be a "Do First", "Do Last" or "Do Normal" connection.

This solution never felt 100 % right to me. It was hard to find a good intuitive way to visually represent the "Do First" and "Do Last" connections, and as a result the Flow graphs became harder to understand.

In retrospect, it would have been much better to avoid this feature and wait until I had come up with the more elegant alternative: a sequence node that triggers each of its outputs sequentially:

Be explicit or you'll miss it

Writing code where a lot of things happen implicitly feels great -- to begin with. It is amazing how much you are able to do with just a few lines of code.

But in my experience, implicit code almost always ends up more costly in the long run. It is harder to understand, harder to debug and harder to change. It tends to lock you down in a "local minimum" that can be tricky to come out of.

In Flow, a lot of things are done implicitly. The definition of a Flow node is just a simple C# class:

[Category("Animation")]
public class AnimationEvent : Node
{
    public InVariableUnit Unit;
    public StringVariable Event;
    public InEvent In;
    public OutEvent Out;
}

Through reflection, Flow finds out the members in the class and their types and automatically generates Flow nodes for them. This process involves some ugly string processing (bad decision #1), such as stripping In and Variable from the type name to find the underlying type of members. Reflection is also used to serialize the graphs.

While it is nice to be able to express a node so concisely, there are also a lot of problematic consequences. For example, since the class names get serialized, we can't change the names of classes or properties without breaking the ability to load old files. Also, we have to use some really ugly C# hacks to make sure that the reflection system always returns the members of a class in the order they are declared in the file (so that we can control the order of the connectors).

In retrospect, it would been much better to avoid all this clever reflection stuff and instead just define the node types in configuration files.

Avoid the road of complex code

There is some code that needs to be complex, because it is dealing with fundamentally tricky stuff (like computational geometry) or because it needs to run really, really fast. But in all other cases, complexity is just a cost.

If your code starts to feel complex and hard to keep track of, it is a sign that you are probably doing something wrong. And if you are not careful, you may lock yourself in, so that when you write the next version of the system, you have to recreate all that complex behavior in your new, simpler system. You have to deliberately make your code uglier.

The old version of Flow had a way of "folding" nodes. You could select a number of nodes, group them, and then "fold" the group, collapse it to a single node.

The system had a lot of really hairy code for dealing with this. The code takes a bunch of nodes and creates a new node from them, with connectors matching only the external connectors of the collapsed nodes. it also keeps track of the internal nodes and their connections so they can be recreated if the node is later "expanded".

As you might imagine, this was complicated further by the need for connector names to be unique (see bad decision #1), which meant that some of the external connectors in the new node had to be renamed (since they could come from different internal nodes that had connectors with the same name). So a mapping table was needed to keep track of these renames. Obviously a bad idea, but once you have started down the path of wrongness, it can be hard to turn around.

The new version handles this a lot better. Collapse and expansion is just a visual feature. There are no new nodes created and no other strange things happening to the data, the visualizer just chooses to draw the data in a different way when it is collapsed. In retrospect, that is a much better choice.

That is all, four simple lessons
to guide your future coding sessions
now let your code be light and merry
until its time for Charon's ferry

10 comments:

  1. I don't think it was a bad idea to use reflection, as a user I'm not fond of using configuration files when everything I need was in C#. Couldn't you achieve the same effect with some custom attributes on fields?

    Also, once I had to use the fields in the order they were defined and I believe if you sort by the MetadataToken you could achieve that.

    ReplyDelete
  2. You can, and maybe it is "the C# way"... but I'm not very fond of that approach. I think it muddles responsibilities.

    The end result if you go down this road is that simple classes, such as Vector3 get a lot of attributes related to serialization, how the class will be displayed by the GUI etc. It doesn't feel like they belong there. What if some project that doesn't even have a GUI wants to make use of the Vector3 class? What if we have different GUIs that want to display the Vector3 class in different ways?

    ReplyDelete
  3. Maybe you can separate these situation specific attributes as another class, example of a "view" class for Vector3:

    public class View
    {
    public object Target { get; set; }
    public virtual Draw();
    }

    [CustomView(typeof(Vector3))]
    public class Vector3View : View
    {
    public override Draw() {
    Vector3 v = (Vector3)Target;
    Display(Vector3.x);
    Display(Vector3.y);
    Display(Vector3.z);
    }
    }

    You can scan the assemblies and look for these CustomView atributes, effectively decoupling rendering information from the Vector3 class itself.

    This is a common pattern used in Unity 3D, take a look at http://blogs.unity3d.com/2012/09/07/property-drawers-in-unity-4/

    Does that makes sense?

    ReplyDelete
  4. - Don't use strings for non-text things
    Having a DisplayName separate from the Id field makes total sense. Making the Id field a Guid is only one solution though. In many cases it's still okay and much more readable to use a string for the Id field as long as it doesn't change.

    - When in doubt, you should opt-out
    Sometimes features are not done right the first time. But I think a rewrite is the perfect time to re-do those features properly. Opting out probably wasn't the right thing to do a the time because it's a feature users wanted. Doing it better may have been worth the time.

    - Be explicit or you'll miss it
    I agree that being explicit is often better. There's a use for every tool though, and sometimes implicit is the right tool. Since it resulted in hacky code, in there probably is a better design.

    - Avoid the road of complex code
    I remember reading a quote once that said something like:
    "There are no complex problems, only complex solutions"

    ReplyDelete
    Replies
    1. For Ids there are two cases I think. One is when you have to uniquely identify user created objects. In that case I think that GUIDs are the easiest way and also the way that most clearly states the intended purpose.

      The other case is when the Id's essentially acts as enums -- specifies one of a limited set of predefined options. In that case I agree that it is better to use a predefined static string (e.g. "bold") to identify the option when you save the file to disk (otherwise your save files will be as incomprehensible as RegEdit).

      But I still don't think it is a very good idea to use the string "bold" internally to represent bold text. It is neither clear or readable and likely to cause a lot of confusion. I would much rather read the string "bold" from the file, convert it to an enum (Styles.Bold) and use that in the code.

      Delete
  5. Nice post, Niklas! It would be really interesting to know more about real world uses of Flow.

    What seems inconvenient to me in Flow's approach to implementing game logic is it's event driven basis. As I understand Flow's program might have several entry points - events which connects the program to the engine (or lua scripting layer). Also user is able to make really complex networks as some nodes may have a lot of connections and making cycles in graph is also acceptable. It seems that it isn't so easy to debug such a tangle of nodes. User must be aware of event driven programming pitfalls to make relatively complex programs. But the language is targeted to non-programmers.

    That is why I'm so interested in real use cases of Flow. What is it used for in real projects (AI, game specific logic, camera's behavior and so on)? How do you (or engine users) manage Flow's programs in large projects? Is it useful for non-programmers?

    ReplyDelete
    Replies
    1. Flow is heavily used by artists, animators, level-designers and other non-programmers. It is not intended for use by programmers (they work in Lua).

      I'm not sure why you think the event model is complicated or what you would like to see instead. Note that Flow nodes are not "active". They don't do any processing unless triggered by an event. So the flow is:

      input event (physics collision, animation, etc) ---> some logic --> result (play animation, effect, sound, etc)

      That is pretty straightforward.

      Flow is not intended for "programming in the large" or for solving really tricky problems (for that you want real programmers and a real programming language). It is for connecting things up, with some simple logic.

      Delete
    2. I thought it's intended to be used in more complex cases which programmers solve with FSMs or Behavior Trees usually. But they are more suited for active objects. It definitely makes sense to use this approach in such simple cases like triggers, I agree with you.

      Thank you for the explanation

      Delete
  6. This is somewhat off-topic, but I'm curious how you evaluate the nodes with respect to multiple output connections. If a content creator doesn't use a sequence node to specify the order, does that imply they don't care or don't know which outputs will be executed in which order? Is there a reason to allow this ambiguity i.e. should you enforce that every output can only have 1 output link and if they want more, they should output to a sequence node?

    Additionally, do you evaluate nodes in a depth or breadth first way? That is, if you have a sequence node with 3 outputs, do you evaluate the nodes connected to output 1 then 2 then 3? Or do you evaluate the node connected to output 1 then evaluate its outputs, etc.?

    Some actions may also have logic that occurs over time and cannot finish when initially evaluated. Have you run into content creators finding themselves in situations where race conditions occur with their logic? Do you have best practices or standards for dealing with this?

    Sorry for the amount of questions all in one post. I'm interested if you encounter these types of problems and how you handle them. Thanks!

    ReplyDelete
    Replies
    1. If you have multiple nodes connected to the same output event they are triggered in unspecified/random order. This works well as a default, because usually the order doesn't matter. If you trigger a sound and a particle effect as the response to some action, it doesn't matter which happens first, since they both happen in the same frame. In the few cases where order DOES matter, you can use a sequence node.

      I use depth first evaluation.

      Nodes in a flow network are not "active". I. e., they have no update() action. They only react to impulses. But they can be connected to "active" external system. For example, the Delay node queues an event with a time system and gets an impulse when the specified time has expired.

      Since everything happens with "impulses", it is pretty easy to reason about, which reduces the risk of confusing race situations.

      Delete