Wednesday, October 14, 2009

Parallel rendering

I've spent the last week designing and implementing the low-level parts of the renderer used in our new engine. One of the key design principles of the engine is to go as wide / parallel as possible whenever possible. To be able to do that in a clean and efficient way a good data streaming model with minimal pointer chasing is key.


With the rendering I've tackled that by splitting the batch processing in three passes: batch gathering, merge-n-sort and display list building.


In the batch gathering pass we walk over the visible objects (objects that have survived visibility culling) and let them queue their draw calls to a RenderContext. A RenderContext is a platform independent package stream that holds all data needed for draw calls (and other render jobs/events/state changes etc). This step is easily divided into any number of jobs, by letting each job have its own RenderContext.


After the batch gathering is done we have all data needed to draw the scene in n number of RenderContexts. The purpose of the merge-n-sort step is to take those RenderContexts, merge them to one while at the same time sorting all batches into the desired order (with respect to "layers", minimizing state changes, depth sorting etc).


We now have one sorted package stream containing all the draw calls that we can send off to the rendering back-end. At this point we can again go wide and build the display list in parallel. Here's a small sketch illustrating the data flow:





Red sections belongs to the platform independent renderer. Blue sections belongs to the rendering back-end (in this illustration D3D11).

5 comments:

  1. Hi.

    I couldn't find an email address, so I figured I'd just post a reply here and you'll get notified of it:

    About a few days before the first post on this blog I registered a company here in Holland called Bitsquid. Before I went to go and get it registered, I did a google search and didn't find anything other than bitsquid.com which doesn't seem to hold much. Now I just did a search and found your blog.

    Now, this is probably just a silly coincidence, but I figured I'd send you a message so you won't be surprised or angry if you find out about my company.

    Hopefully we can coexist without any hassle, and if not, we'll have to figure something out.

    Best regards,

    Paul Veer
    http://www.bitsquid.net

    ReplyDelete
  2. Hi, just found your blog. I hope you all the best and good luck with your new company!

    ReplyDelete
  3. An alternative is for the batch gathering to just put all its draw calls into the same shared array (each job would "allocate" around one cache line of this array at a time to avoid contention on any "count" value, leaving empty elements at the end if they run out of things to put into it - the sorting will move empty elements to the end anyway).

    Each draw call would have a 64 bit "key" associated with it, where most significant bits would represent viewports, then passes, then shaders, then textures, then geometry index, etc. so that sorting by the key using a fast integer sorter is enough to order your draw calls.

    The cool thing about organizing things like this is that you could, potentially, do JIT instancing. Once you've sorted you'll have an array of draw calls and any draw calls that could be instanced will be next to each other in this array (and only visible ones too!), so you could just detect that two or more adjacent draw calls only differ by instancable properties and draw them in a single call! Next frame a different set of objects may be visible and that instancing call could look slightly different. Obviously that requires that all shaders have an instancing variant etc.

    ReplyDelete
  4. When I wrote the post I hadn't really come up with a practical solution for the sorting, and what you describe is actually very close to what I ended up doing with some exceptions.

    Every RenderContext still stores batch data in it's own array. In addition to the batch array each RenderContext also has a separate array holding SortCommand structs. The SortCommand struct contains a 64 bit sort key (similar to what you describe), a pointer to a RenderContext + offset and size into the batch array determining what batches goes within the scope of each SortCommand. When it's time to sort I only have to merge the SortCommand arrays into one and can keep the batch data arrays separate.

    One sweet thing with this decoupling between batch data and sort key is that the sort pass becomes very cheap since a SortCommand tends to be a lot smaller in memory than the actual batch data (which can be rather large depending on the amount of shader constants/resource handles etc that's needed to draw it) resulting in much less memory to walk over.
    Another nice thing is that it's very easy to group batches together within the same SortCommand. This can be very useful especially when you start storing other stuff than just batch data in the RenderContexts (such as render target switches, platform specific render back-end data etc).

    At some point you will of course have to walk over all the data in the RenderContexts for building the "display lists/command lists/push buffers" to send of to the render back-end API but with this approach you only need to visit it once.

    ReplyDelete
  5. How do you handle GPU resource updates?
    Do you write the entire resource data into the stream, or just passing a shared pointer to it?
    Imo in a multi-threaded environment the former would be preferable, but how does it impact performance?
    It could be a lot of data to copy every frame.

    ReplyDelete