Wednesday, February 1, 2017

Stingray Renderer Walkthrough #2: Resources & Resource Contexts

Stingray Renderer Walkthrough #2: Resources & Resource Contexts

Render Resources

Before any rendering can happen we need a way to reason about GPU resources. Since we want all graphics API specific code to stay isolated we need some kind of abstraction on the engine side, for that we have an interface called RenderDevice. All calls to graphics APIs like D3D, OGL, GNM, Metal, etc. stays behind this interface. We will be covering the RenderDevice in a later post so for now just know that it is there.

We want to have a graphics API agnostic representation for a bunch of different types of resources and we need to link these representations to their counterparts on the RenderDevice side. This linking is handled through a POD-struct called RenderResource:

struct RenderResource
{
    enum {
        TEXTURE, RENDER_TARGET, DEPENDENT_RENDER_TARGET, BACK_BUFFER_WRAPPER,
        CONSTANT_BUFFER, VERTEX_STREAM, INDEX_STREAM, RAW_BUFFER,
        BATCH_INFO, VERTEX_DECLARATION, SHADER,
        NOT_INITIALIZED = 0xFFFFFFFF
    };

    uint32_t render_resource_handle;
};

Any engine resource that also needs a representation on the RenderDevice side inherits from this struct. It contains a single member render_resource_handle which is used to lookup the correct graphics API specific representation in the RenderDevice.

The most significant 8 bits of render_resource_handle holds the type enum, the lower 24 bits is simply an index into an array for that specific resource type inside the RenderDevice.

Various Render Resources

Let’s take a look at the different render resource that can be found in Stingray:

  • Texture - A regular texture, this object wraps all various types of different texture layouts such as 2D, Cube, 3D.
  • RenderTarget - Basically the same as Texture but writable from the GPU.
  • DependentRenderTarget - Similar to RenderTarget but with logics for inheriting properties from another RenderTarget. This is used for creating render targets that needs to be reallocated when the output window (swap chain) is being resized.
  • BackBufferWrapper - Special type of RenderTarget created inside the RenderDevice as part of the swap chain creation. Almost all render targets are explicitly created by the user, this is the only exception as the back buffer associated with the swap chain is typically created together with the swap chain.
  • ShaderConstantBuffer - Shader constant buffers designed for explicit update and sharing between multiple shaders, mainly used for “view-global” state.
  • VertexStream - A regular Vertex Buffer.
  • VertexDeclaration - Describes the contents of one or many VertexStreams.
  • IndexStream - A regular Index Buffer.
  • RawBuffer - A linear memory buffer, can be setup for GPU writing through an UAV (Unordered Access View).
  • Shader - For now just think of this as something containing everything needed to build a full pipeline state object (PSO). Basically a wrapper over a number of shaders, render states, sampler states etc. I will cover the shader system in a later post.

Most of the above resources have a few things in common:

  • They describe a buffer either populated by the CPU or by the GPU
  • CPU populated buffers has a validity field describing its update frequency:
    • STATIC - The buffer is immutable and won’t change after creation, typically most buffers coming from DCC assets are STATIC.
    • UPDATABLE - The buffer can be updated but changes less than once per frame, e.g: UI elements, post processing geometry and similar.
    • DYNAMIC - The buffer frequently changes, at least once per frame but potentially many times in a single frame e.g: particle systems.
  • They have enough data for creating a graphics API specific representation inside the RenderDevice, i.e they know about strides, sizes, view requirements (e.g should an UAV be created or not), etc.

Render Resource Context

With the RenderResource concept sorted, we’ll go through the interface for creating and destroying the RenderDevice representation of the resources. That interface is called RenderResourceContext (RRC).

We want resource creation to be thread safe and while the RenderResourceContext in itself isn’t, we can achieve free threading by allowing the user to create any number of RRC’s they want, and as long as they don’t touch the same RRC from multiple threads everything will be fine.

Similar to many other rendering systems in Stingray the RRC is basically just a small helper class wrapping an abstract “command buffer”. On this command buffer we put what we call “packages” describing everything that is needed for creating/destroying RenderResource objects. These packages have variable length depending on what kind of object they represent. In addition to that the RRC can also hold platform specific allocators that allow allocating/deallocating GPU mapped memory directly, avoiding any additional memory shuffling in the RenderDevice. This kind of mechanism allows for streaming e.g textures and other immutable buffers directly into GPU memory on platforms that provides that kind of low-level control.

Typically the only two functions the user need to care about are:

class RenderResourceContext
{
public:
  void alloc(RenderResource *resource);
  void dealloc(RenderResource *resource);
};

When the user is done allocating/deallocating resources they hand over the RRC either directly to the RenderDevice or to the RenderInterface.

class RenderDevice
{
public:
    virtual void dispatch(uint32_t n_contexts, RenderResourceContext **rrc, uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

Handing it over directly to the RenderDevice requires the caller to be on the controller thread for rendering as RenderDevice::dispatch() isn’t thread safe. If the caller is on any other thread (like e.g. one of the worker threads or the resource streaming thread) RenderInterface::dispatch() should be used instead. We will cover the RenderInterface in a later post so for now just think of it as a way of piping data into the renderer from an arbitrary thread.

Wrap up

The main reason of having the RenderResourceContext concept instead of exposing allocate()/deallocate() functions directly in the RenderDevice/RenderInterface interfaces is for efficiency. We have a need for allocating and deallocating lots of resources, sometimes in parallel from multiple threads. Decoupling the interface for doing so makes it easy to schedule when in the frame the actual RenderDevice representations gets created, it also makes the code easier to maintain as we don’t have to worry about thread-safety of the RenderResourceContext.

In the next post we will discuss the RenderJobs and RenderContexts which are the two main building blocks for creating and scheduling draw calls and state changes.

Stay tuned.

Stingray Renderer Walkthrough #1: Overview

Stingray Renderer Walkthrough #1: Overview

Introduction

When we started writing Bitsquid back in mid 2009 all platforms we intended to run on were already multi-core architectures. This and the fact that we had some prior experience trying to get our last engine to run efficiently on the PS3 answered the question how not to architecture an efficient renderer that scales to many cores. We knew we needed more than functional parallelism, we wanted data-parallelism.

To solve that we divide the CPU view of a rendered frame into three stages:

  1. Culling - Filter out visible renderable objects with respect to a camera from a potentially huge set of different type of objects (meshes, particle systems, lights, etc).
  2. Render - Iterate over the filtered result from Culling and “record” an intermediate representation of draw calls/state switches to a command buffer.
  3. Dispatch - Take result from Render and translate that into actual render API calls (D3D, OGL, Metal, GNM, etc).

As you can see each stage pipes its result into the next. Rendering is typically very simple in that sense; we tend to have a one way flow of our data: [[user input or time affects state, state propagates into changes of the renderable objects (transforms, shader constants, etc), figure out what need to be rendered, iterate over that and finally generate render API calls. Rinse & Repeat :]]

If we ignore the problem of ordering the final API calls in the rendering backend it’s fairly easy to see how we can achieve data parallelism in this scenario. Just fork at each stage splitting the workload into a n-chunks (where n is however many worker threads you can throw at it). When all workers are done for a stage take the result and pipe into the next stage.

In essence this is how all rendering in Stingray works. Obviously I’ve glanced over some rather important and challenging details but as you will see they are not too hard to solve if you have good control over your data flows and are picky about when mutation of the data happens.

Design Philosophies & Concepts

The rendering code in Stingray tends to be heavily influenced by Data Oriented Programming principles. When designing new systems our biggest efforts usually goes into structuring our data efficiently and thinking about its flow through the systems, more so than writing the actual code that transforms the data from one form to another.

To achieve data-parallelism throughout the rendering code the first thing to realize is that we have to be very picky about when mutation of the renderable objects happens. Multiple worker threads will run over our objects and its not unlikely that more than one thread visits the same object at the same time, hence we must not mutate the state of our objects in its render function. Therefore all of our render() functions are const.

To further guard ourselves from the outer world (i.e gameplay, physics, etc) the renderer operates in complete isolation from the game logics. It has its own representation of the data it needs, and only the data relevant for rendering. While the gameplay logics usually wants to reason about high-level concepts such as game entities (which basically groups a number of meshes, particle systems, lights, etc together), we on the rendering side don’t really care about that. We are much more interested in just having an array of all renderable objects in a game world, in a memory layout that makes it efficient to access.

Another nice thing with decoupling the representation of the renderable objects from the game objects is that it allows us to run simulation in parallel with rendering (functional parallelism). So while simulation is updating frame n the renderer is processing frame n-1. Some of you might argue that overlaying rendering on top of simulation doesn’t give any performance improvements if the work in all systems is nicely parallelized. In reality though this isn’t really the case. We still have systems that don’t go wide, or have certain sections where they need to do synchronous processing (last generation graphics APIs: e.g DX11, OpenGL are good examples). This creates bubbles in the frame slowing us down.

By overlaying simulation and rendering we get a form of bubble filling among the worker threads which in most cases gives a big enough speed improvement to justify the added complexity that comes from this architecture. More specifically:

  1. Double buffering of state - since the simulation might mutate the state of an object for frame n at the same time as the renderer is processing frame n-1 any mutable state needs to be double buffered.
  2. Life scope tracking of immutable data - while immutable/read only state such as static vertex and index buffers are safe to read by both simulation and renderer we still need to be careful not pulling the rug under the renderers feet by freeing anything still being in use by the renderer.

Here’s a conceptual graph showing the benefits of overlaying simulation and rendering:

So basically what we got here is two “controller threads”: simulation and render both offloading work to the worker threads. In the case that a controller thread is blocked waiting for some work to finish it will assist the worker threads striving to never sit idle. One thing to note is that to prevent frames from stacking up, we never allow the simulation thread to run more than one frame ahead of the render thread.

As a comparison here’s the same workload with simulation and rendering running in sequence.

As you can see we get significantly more idle time (bubbles) on the worker threads due to certain parts of both the simulation and rendering not being able to go wide.

Next up

I think this pretty much covers the high level view of the core rendering architecture in Stingray. Now lets go into some more detail.

Since Andreas Asplund recently covered both how we handle propagation of state from simulation to the renderer (we call this “State reflection” in Stingray): http://bitsquid.blogspot.se/2016/09/state-reflection.html as well as how our view frustum culling system(s) works: http://bitsquid.blogspot.se/2016/10/the-implementation-of-frustum-culling.html I won’t be covering that in this series.

Instead I will jump straight into how creating and destroying GPU resources works, and from there go through all the building blocks needed to implement the second stage Render mentioned above.

Stingray Renderer Walkthrough

Stingray Renderer Walkthrough

Welcome

To simplify knowledge transferring inside the Autodesk development teams and in an attempt to improve my writing skills I’ve decided to do a walkthrough of the Stingray rendering architecture. The idea is to do this as a series of blog posts over the coming weeks starting from the low-level aspects of the renderer chewing my way up to more high-level concepts as I go.

I’ve covered some of these topics before in various presentations over the years but those have been more focused on how our data driven aspects of the renderer works and less on the core architecture behind it. This is an attempt to do a more complete walk-through of the entire rendering architecture.

When I started thinking about this it felt like an almost impossible undertaking considering how much slower I am at expressing myself in text than in code, but after spending a couple of days going through the entire stingray code base doing some spring cleaning it felt a bit more manageable so I’ve now decided to give it a try.

(Note: this has nothing at all to do with me feeling the pressure from Niklas Frykholm who’s currently doing a complete walk-through of the entire Stingray engine code base (well everything except rendering) as a series of youtube videos [1]. Not at all… I feel no pressure, no guilt, nothing… I promise… Thanks Niklas for pushing me!)

Outline

Below is some kind of outline of what I intend to cover and in what order, I might swap things around as I go if I discover it makes more sense. This post will work as an index and I will link to the posts as they come online.

  1. Overview
  2. Resources & Resource Contexts
  3. Render Contexts
  4. Sorting
  5. RenderDevice
  6. RenderInterface
  7. Data-driven rendering
  8. Stingray-renderer & Mini-renderer
  9. Shaders & Materials

Tuesday, October 4, 2016

The Implementation of Frustum Culling in Stingray

Overview

Frustum culling can be an expensive operation. Stingray accelerates it by making heavy use of SIMD and distributing the workload over several threads. The basic workflow is:

  • Kick jobs to do frustum vs sphere culling
    • For each frustum plane, test plane vs sphere
  • Wait for sphere culling to finish
  • For objects that pass sphere test, kick jobs to do frustum vs object-oriented bounding box (OOBB) culling
    • For each frustum plane, test plane vs OOBB
  • Wait for OOBB culling to finish

Frustum vs sphere tests are significantly faster than frustum vs OOBB. By rejecting objects that fail sphere culling first, we have fewer objects to process in the more expensive OOBB pass.

Why go over all objects brute force instead of using some sort of spatial partition data structure? We like to keep things simple and with the current setup we have yet to encounter a case where we've been bound by the culling. Brute force sphere culling followed by OOBB culling is fast enough for all cases we've encountered so far. That might of course change in the future, but we'll take care of that when it's an actual problem.

The brute force culling is pretty fast, because:

  1. The sphere and the OOBB culling use SIMD and only load the minimum amount of needed data.
  2. The workload is distributed over several threads.

In this post, I we will first look at the single threaded SIMD code and then how the culling is distributed over multiple threads.

I'll use a lot of code to show how it's all done. It's mostly actual code from the engine, but it has been cleaned up to a certain extent. Some stuff has been renamed and/or removed to make it easier to understand what's going on.

Data structures used

If you go back to my previous post about state reflection, http://bitsquid.blogspot.ca/2016/09/state-reflection.html you can read that each object on the main thread is associated with a render thread representation via a render_handle. The render_handle is used to get the object_index which is the index of an object in the _objects array.

Take a look at the following code for a refresher:

void RenderWorld::create_object(WorldRenderInterface::ObjectManagementPackage *omp)
{
    // Acquire an `object_index`.
    uint32_t object_index = _objects.size();

    // Same recycling mechanism as seen for render handles.
    if (_free_object_indices.any()) {
        object_index = _free_object_indices.back();
        _free_object_indices.pop_back();
    } else {
        _objects.resize(object_index + 1);
        _object_types.resize(object_index + 1);
    }

    void *render_object = omp->user_data;
    if (omp->type == RenderMeshObject::TYPE) {
        // Cast the `render_object` to a `MeshObject`.
        RenderMeshObject *rmo = (RenderMeshObject*)render_object;

        // If needed, do more stuff with `rmo`.
    }

    // Store the `render_object` and `type`.
    _objects[object_index] = render_object;
    _object_types[object_index] = omp->type;

    if (omp->render_handle >= _object_lut.size())
        _object_lut.resize(omp->handle + 1);
    // The `render_handle` is used
    _object_lut[omp->render_handle] = object_index;
}

The _objects array stores objects of all kinds of different types. It is defined as:

Array<void*> _objects;

The types of the objects are stored in a corresponding _object_types array, defined as:

Array<uint32_t> _object_types;

From _object_types, we know the actual type of the objects and we can use that to cast the void * into the proper type (mesh, terrain, gui, particle_system, etc).

The culling happens in the // If needed, do more stuff with rmo section above. It looks like this:

void *render_object = omp->user_data;
if (omp->type == RenderMeshObject::TYPE) {
    // Cast the `render_object` to a `MeshObject`.
    RenderMeshObject *rmo = (RenderMeshObject*)render_object;

    // If needed, do more stuff with `rmo`.
    if (!(rmo->flags() & renderable::CULLING_DISABLED)) {
        culling::Object o;
        // Extract necessary information to do culling.

        // The index of the object.
        o.id = object_index;

        // The type of the object.
        o.type = rmo->type;

        // Get the mininum and maximum corner positions of a boudning box in object space.
        o.min = float4(rmo->bounding_volume().min, 1.f);
        o.max = float4(rmo->bounding_volume().max, 1.f);

        // World transform matrix.
        o.m = float4x4(rmo->world());

        // Depending on the value of `flags` add the culling representation to different culling sets.
        if (rmo->flags() & renderable::VIEWPORT_VISIBLE)
            _cullable_objects.add(o, rmo->node());
        if (rmo->flags() & renderable::SHADOW_CASTER)
            _cullable_shadow_casters.add(o, rmo->node());
        if (rmo->flags() & renderable::OCCLUDER)
            _occluders.add(o, rmo->node());
    }
}

For culling MeshObjects and other cullable types are represented by culling::Objects that are used to populate the culling data structures. As can be seen in the code they are _cullable_objects, _cullable_shadow_casters and _occluders and they are all represented by an ObjectSet:

struct ObjectSet
{
    // Minimum bounding box corner position.
    Array<float> min_x;
    Array<float> min_y;
    Array<float> min_z;

    // Maximum bounding box corner position.
    Array<float> max_x;
    Array<float> max_y;
    Array<float> max_z;

    // Object->world matrix.
    Array<float> world_xx;
    Array<float> world_xy;
    Array<float> world_xz;
    Array<float> world_xw;
    Array<float> world_yx;
    Array<float> world_yy;
    Array<float> world_yz;
    Array<float> world_yw;
    Array<float> world_zx;
    Array<float> world_zy;
    Array<float> world_zz;
    Array<float> world_zw;
    Array<float> world_tx;
    Array<float> world_ty;
    Array<float> world_tz;
    Array<float> world_tw;

    // World space center position of bounding sphere.
    Array<float> ws_pos_x;
    Array<float> ws_pos_y;
    Array<float> ws_pos_z;

    // Radius of bounding sphere.
    Array<float> radius;

    // Flag to indicate if an object is culled or not.
    Array<uint32_t> visibility_flag;

    // The type and id of an object.
    Array<uint32_t> type;
    Array<uint32_t> id;

    uint32_t n_objects;
};

When an object is added to, e.g. _cullable_objects the culling::Object data is added to the ObjectSet. The ObjectSet flattens the data into a structure-of-arrays representation. The arrays are padded to the SIMD lane count to make sure there's valid data to read.

Frustum-sphere culling

The world space positions and sphere radii of objects are represented by the following members of the ObjectSet:

Array<float> ws_pos_x;
Array<float> ws_pos_y;
Array<float> ws_pos_z;
Array<float> radius;

This is all we need to do frustum-sphere culling.

The frustum-sphere culling needs the planes of the frustum defined in world space. Information on how to find that can be found in: http://gamedevs.org/uploads/fast-extraction-viewing-frustum-planes-from-world-view-projection-matrix.pdf.

The frustum-sphere intersection code tests one plane against several spheres using SIMD instructions. The ObjectSet data is already laid out in a SIMD friendly way. To test one plane against several spheres, the plane's data is splatted out in the following way:

// `float4` is our cross platform abstraction of SSE, NEON etc.
struct SIMDPlane
{
    float4 normal_x; // the normal's x value replicted 4 times.
    float4 normal_y; // the normal's y value replicted 4 times.
    float4 normal_z; // etc.
    float4 d;
};

The single threaded code needed to do frustum-sphere culling is:

void simd_sphere_culling(const SIMDPlane planes[6], culling::ObjectSet &object_set)
{
    const auto all_true = bool4_all_true();
    const uint32_t n_objects = object_set.n_objects;

    uint32_t *visibility_flag = object_set.visibility_flag.begin();

    // Test each plane of the frustum against each sphere.
    for (uint32_t i = 0; i < n_objects; i += 4)
    {
        const auto ws_pos_x = float4_load_aligned(&object_set->ws_pos_x[i]);
        const auto ws_pos_y = float4_load_aligned(&object_set->ws_pos_y[i]);
        const auto ws_pos_z = float4_load_aligned(&object_set->ws_pos_z[i]);
        const auto radius = float4_load_aligned(&object_set->radius[i]);

        auto inside = all_true;
        for (unsigned p = 0; p < 6; ++p) {
            auto &n_x = planes[p].normal_x;
            auto &n_y = planes[p].normal_y;
            auto &n_z = planes[p].normal_z;
            auto n_dot_pos = dot_product(ws_pos_x, ws_pos_y, ws_pos_z, n_x, n_y, n_z);
            auto plane_test_point = n_dot_pos + radius;
            auto plane_test = plane_test_point >= planes[p].d;
            inside = vector_and(plane_test, inside);
        }

        // Store 0 for spheres that didn't intersect or ended up on the positive side of the
        // frustum planes. Store 0xffffffff for spheres that are visible.
        store_aligned(inside, &visibility_flag[i]);
    }
}

After the simd_sphere_culling call, the visibility_flag array contains 0 for all objects that failed the test and 0xffffffff for all objects that passed. We chain this together with the OOBB culling by doing a compactness pass over the visibility_flag array and populating an indirection array:

{
    // Splat out the planes to be able to do plane-sphere test with SIMD.
    const auto &frustum = camera.frustum();

    const SIMDPlane planes[6] = {
        float4_splat(frustum.planes[0].n.x),
        float4_splat(frustum.planes[0].n.y),
        float4_splat(frustum.planes[0].n.z),
        float4_splat(frustum.planes[0].d),

        float4_splat(frustum.planes[1].n.x),
        float4_splat(frustum.planes[1].n.y),
        float4_splat(frustum.planes[1].n.z),
        float4_splat(frustum.planes[1].d),

        float4_splat(frustum.planes[2].n.x),
        float4_splat(frustum.planes[2].n.y),
        float4_splat(frustum.planes[2].n.z),
        float4_splat(frustum.planes[2].d),

        float4_splat(frustum.planes[3].n.x),
        float4_splat(frustum.planes[3].n.y),
        float4_splat(frustum.planes[3].n.z),
        float4_splat(frustum.planes[3].d),

        float4_splat(frustum.planes[4].n.x),
        float4_splat(frustum.planes[4].n.y),
        float4_splat(frustum.planes[4].n.z),
        float4_splat(frustum.planes[4].d),

        float4_splat(frustum.planes[5].n.x),
        float4_splat(frustum.planes[5].n.y),
        float4_splat(frustum.planes[5].n.z),
        float4_splat(frustum.planes[5].d),
    };


    // Do frustum-sphere culling.
    simd_sphere_culling(planes, object_set);

    // Make sure to align the size to the simd lane count.
    const uint32_t n_aligned_objects = align_to_simd_lane_count(object_set.n_objects);

    // Store the indices of the objects that passed the frustum-sphere culling in the `indirection` array.
    Array<uint32_t> indirection(n_aligned_objects);

    const uint32_t n_visible = remove_not_visible(object_set, object_set.n_objects, indirection.begin());
}

Where remove_not_visible is:

uint32_t remove_not_visible(const ObjectSet &object_set, uint32_t count, uint32_t *output_indirection)
{
    const uint32_t *visibility_flag = object_set.visibility_flag.begin();
    uint32_t n_visible = 0U;
    for (uint32_t i = 0; i < count; ++i) {
        if (visibility_flag[i]) {
            output_indirection[n_visible] = i;
            ++n_visible;
        }
    }

    const uint32_t n_aligned_visible = align_to_simd_lane_count(n_visible);
    const uint32_t last_visible = n_visible? output_indirection[n_visible- 1] : 0;

    // Pad out to the simd alignment.
    for (unsigned i = n_visible; i < n_aligned_visible; ++i)
        output_indirection[i] = last_visible;

    return n_visible;
}

n_visible together with indirection provides the input for doing the frustum-OOBB culling on the objects that survived the frustum-sphere culling.

Frustum-OOBB culling

The frustum-OOBB culling takes ideas from Fabian Giesen's https://fgiesen.wordpress.com/2010/10/17/view-frustum-culling/ and Arseny Kapoulkine's http://zeuxcg.org/2009/01/31/view-frustum-culling-optimization-introduction/.

More specifically we use the Method 2: Transform box vertices to clip space, test against clip-space planes that both Fabian and Arseny write about. But we also go with Method 2b: Saving arithmetic ops that Fabian mentions. I won't dwelve into how the culling actually works, to understand that please read their posts.

The code is SIMDified to process several OOBBs at the same time. The same corner of four multiple OOBBs is tested against one frustum plane as a single SIMD operation.

To be able to write the SIMD code in a more intuitive form a few data structures and functions are used:

struct SIMDVector
{
    float4 x; // stores x0, x1, x2, x3
    float4 y; // stores y0, y1, y2, y3
    float4 z; // etc.
    float4 w;
};

A SIMDVector stores x, y, z & w for four objects. To store a matrix for four objects a SIMDMatrix is used:

struct SIMDMatrix
{
    SIMDVector x;
    SIMDVector y;
    SIMDVector z;
    SIMDVector w;
};

A SIMDMatrix-SIMDVector multiplication can then be written as a regular matrix-vector multiplication:

SIMDVector simd_multiply(const SIMDVector &v, const SIMDMatrix &m)
{
    float4 x = v.x * m.x.x;     x = v.y * m.y.x + x;    x = v.z * m.z.x + x;    x = v.w * m.w.x + x;
    float4 y = v.x * m.x.y;     y = v.y * m.y.y + y;    y = v.z * m.z.y + y;    y = v.w * m.w.y + y;
    float4 z = v.x * m.x.z;     z = v.y * m.y.z + z;    z = v.z * m.z.z + z;    z = v.w * m.w.z + z;
    float4 w = v.x * m.x.w;     w = v.y * m.y.w + w;    w = v.z * m.z.w + w;    w = v.w * m.w.w + w;
    SIMDVector res = { x, y, z, w };
    return res;
}

A SIMDMatrix-SIMDMatrix multiplication is:

SIMDMatrix simd_multiply(const SIMDMatrix &lhs, const SIMDMatrix &rhs)
{
    SIMDVector x = simd_multiply(lhs.x, rhs);
    SIMDVector y = simd_multiply(lhs.y, rhs);
    SIMDVector z = simd_multiply(lhs.z, rhs);
    SIMDVector w = simd_multiply(lhs.w, rhs);
    SIMDMatrix res = { x, y, z, w };
    return res;
}

The code needed to do the actual frustum-OOBB culling is:

void simd_oobb_culling(const SIMDMatrix &view_proj, const culling::ObjectSet &object_set, uint32_t n_objects, const uint32_t *indirection)
{
    // Get pointers to the necessary members of the object set.
    const float *min_x = object_set->min_x.begin();
    const float *min_y = object_set->min_y.begin();
    const float *min_z = object_set->min_z.begin();

    const float *max_x = object_set->max_x.begin();
    const float *max_y = object_set->max_y.begin();
    const float *max_z = object_set->max_z.begin();

    const float *world_xx = object_set->world_xx.begin();
    const float *world_xy = object_set->world_xy.begin();
    const float *world_xz = object_set->world_xz.begin();
    const float *world_xw = object_set->world_xw.begin();
    const float *world_yx = object_set->world_yx.begin();
    const float *world_yy = object_set->world_yy.begin();
    const float *world_yz = object_set->world_yz.begin();
    const float *world_yw = object_set->world_yw.begin();
    const float *world_zx = object_set->world_zx.begin();
    const float *world_zy = object_set->world_zy.begin();
    const float *world_zz = object_set->world_zz.begin();
    const float *world_zw = object_set->world_zw.begin();
    const float *world_tx = object_set->world_tx.begin();
    const float *world_ty = object_set->world_ty.begin();
    const float *world_tz = object_set->world_tz.begin();
    const float *world_tw = object_set->world_tw.begin();

    uint32_t *visibility_flag = object_set.visibility_flag.begin();

    for (uint32_t i = 0; i < n_objects; i += 4) {
        SIMDMatrix world;

        // Load the world transform matrix for four objects via the indirection table.

        const uint32_t i0 = indirection[i];
        const uint32_t i1 = indirection[i + 1];
        const uint32_t i2 = indirection[i + 2];
        const uint32_t i3 = indirection[i + 3];

        world.x.x = float4(world_xx[i0], world_xx[i1], world_xx[i2], world_xx[i3]);
        world.x.y = float4(world_xy[i0], world_xy[i1], world_xy[i2], world_xy[i3]);
        world.x.z = float4(world_xz[i0], world_xz[i1], world_xz[i2], world_xz[i3]);
        world.x.w = float4(world_xw[i0], world_xw[i1], world_xw[i2], world_xw[i3]);

        world.y.x = float4(world_yx[i0], world_yx[i1], world_yx[i2], world_yx[i3]);
        world.y.y = float4(world_yy[i0], world_yy[i1], world_yy[i2], world_yy[i3]);
        world.y.z = float4(world_yz[i0], world_yz[i1], world_yz[i2], world_yz[i3]);
        world.y.w = float4(world_yw[i0], world_yw[i1], world_yw[i2], world_yw[i3]);

        world.z.x = float4(world_zx[i0], world_zx[i1], world_zx[i2], world_zx[i3]);
        world.z.y = float4(world_zy[i0], world_zy[i1], world_zy[i2], world_zy[i3]);
        world.z.z = float4(world_zz[i0], world_zz[i1], world_zz[i2], world_zz[i3]);
        world.z.w = float4(world_zw[i0], world_zw[i1], world_zw[i2], world_zw[i3]);

        world.w.x = float4(world_tx[i0], world_tx[i1], world_tx[i2], world_tx[i3]);
        world.w.y = float4(world_ty[i0], world_ty[i1], world_ty[i2], world_ty[i3]);
        world.w.z = float4(world_tz[i0], world_tz[i1], world_tz[i2], world_tz[i3]);
        world.w.w = float4(world_tw[i0], world_tw[i1], world_tw[i2], world_tw[i3]);

        // Create the matrix to go from object->world->view->clip space.
        const auto clip = simd_multiply(world, view_proj);

        SIMDVector min_pos;
        SIMDVector max_pos;

        // Load the mininum and maximum corner positions of the bounding box in object space.
        min_pos.x = float4(min_x[i0], min_x[i1], min_x[i2], min_x[i3]);
        min_pos.y = float4(min_y[i0], min_y[i1], min_y[i2], min_y[i3]);
        min_pos.z = float4(min_z[i0], min_z[i1], min_z[i2], min_z[i3]);
        min_pos.w = float4_splat(1.0f);

        max_pos.x = float4(max_x[i0], max_x[i1], max_x[i2], max_x[i3]);
        max_pos.y = float4(max_y[i0], max_y[i1], max_y[i2], max_y[i3]);
        max_pos.z = float4(max_z[i0], max_z[i1], max_z[i2], max_z[i3]);
        max_pos.w = float4_splat(1.0f);

        SIMDVector clip_pos[8];

        // Transform each bounding box corner from object to clip space by sharing calculations.
        simd_min_max_transform(clip, min_pos, max_pos, clip_pos);

        const auto zero = float4_zero();
        const auto all_true = bool4_all_true();

        // Initialize test conditions.
        auto all_x_less = all_true;
        auto all_x_greater = all_true;
        auto all_y_less = all_true;
        auto all_y_greater = all_true;
        auto all_z_less = all_true;
        auto any_z_less = bool4_all_false();
        auto all_z_greater = all_true;

        // Test each corner of the oobb and if any corner intersects the frustum that object
        // is visible.
        for (unsigned cs = 0; cs < 8; ++cs) {
            const auto neg_cs_w = negate(clip_pos[cs].w);

            auto x_le = clip_pos[cs].x <= neg_cs_w;
            auto x_ge = clip_pos[cs].x >= clip_pos[cs].w;
            all_x_less = vector_and(x_le, all_x_less);
            all_x_greater = vector_and(x_ge, all_x_greater);

            auto y_le = clip_pos[cs].y <= neg_cs_w;
            auto y_ge = clip_pos[cs].y >= clip_pos[cs].w;
            all_y_less = vector_and(y_le, all_y_less);
            all_y_greater = vector_and(y_ge, all_y_greater);

            auto z_le = clip_pos[cs].z <= zero;
            auto z_ge = clip_pos[cs].z >= clip_pos[cs].w;
            all_z_less = vector_and(z_le, all_z_less);
            all_z_greater = vector_and(z_ge, all_z_greater);
            any_z_less = vector_or(z_le, any_z_less);
        }

        const auto any_x_outside = vector_or(all_x_less, all_x_greater);
        const auto any_y_outside = vector_or(all_y_less, all_y_greater);
        const auto any_z_outside = vector_or(all_z_less, all_z_greater);
        auto outside = vector_or(any_x_outside, any_y_outside);
        outside = vector_or(outside, any_z_outside);

        auto inside = vector_xor(outside, all_true);

        // Store the result in the `visibility_flag` array in a compacted way.
        store_aligned(inside, &visibility_flag[i]);
    }
}

The function simd_min_max_transforms used above is the function to transform each OOBB corner from object space to clip space by sharing some of the calculations, for completeness the function is:

void simd_min_max_transform(const SIMDMatrix &m, const SIMDVector &min, const SIMDVector &max, SIMDVector result[])
{
    auto m_xx_x = m.x.x * min.x;    m_xx_x = m_xx_x + m.w.x;
    auto m_xy_x = m.x.y * min.x;    m_xy_x = m_xy_x + m.w.y;
    auto m_xz_x = m.x.z * min.x;    m_xz_x = m_xz_x + m.w.z;
    auto m_xw_x = m.x.w * min.x;    m_xw_x = m_xw_x + m.w.w;

    auto m_xx_X = m.x.x * max.x;    m_xx_X = m_xx_X + m.w.x;
    auto m_xy_X = m.x.y * max.x;    m_xy_X = m_xy_X + m.w.y;
    auto m_xz_X = m.x.z * max.x;    m_xz_X = m_xz_X + m.w.z;
    auto m_xw_X = m.x.w * max.x;    m_xw_X = m_xw_X + m.w.w;

    auto m_yx_y = m.y.x * min.y;
    auto m_yy_y = m.y.y * min.y;
    auto m_yz_y = m.y.z * min.y;
    auto m_yw_y = m.y.w * min.y;

    auto m_yx_Y = m.y.x * max.y;
    auto m_yy_Y = m.y.y * max.y;
    auto m_yz_Y = m.y.z * max.y;
    auto m_yw_Y = m.y.w * max.y;

    auto m_zx_z = m.z.x * min.z;
    auto m_zy_z = m.z.y * min.z;
    auto m_zz_z = m.z.z * min.z;
    auto m_zw_z = m.z.w * min.z;

    auto m_zx_Z = m.z.x * max.z;
    auto m_zy_Z = m.z.y * max.z;
    auto m_zz_Z = m.z.z * max.z;
    auto m_zw_Z = m.z.w * max.z;

    {
        auto xyz_x = m_xx_x + m_yx_y;   xyz_x = xyz_x + m_zx_z;
        auto xyz_y = m_xy_x + m_yy_y;   xyz_y = xyz_y + m_zy_z;
        auto xyz_z = m_xz_x + m_yz_y;   xyz_z = xyz_z + m_zz_z;
        auto xyz_w = m_xw_x + m_yw_y;   xyz_w = xyz_w + m_zw_z;
        result[0].x = xyz_x;
        result[0].y = xyz_y;
        result[0].z = xyz_z;
        result[0].w = xyz_w;
    }

    {
        auto Xyz_x = m_xx_X + m_yx_y;   Xyz_x = Xyz_x + m_zx_z;
        auto Xyz_y = m_xy_X + m_yy_y;   Xyz_y = Xyz_y + m_zy_z;
        auto Xyz_z = m_xz_X + m_yz_y;   Xyz_z = Xyz_z + m_zz_z;
        auto Xyz_w = m_xw_X + m_yw_y;   Xyz_w = Xyz_w + m_zw_z;
        result[1].x = Xyz_x;
        result[1].y = Xyz_y;
        result[1].z = Xyz_z;
        result[1].w = Xyz_w;
    }

    {
        auto xYz_x = m_xx_x + m_yx_Y;   xYz_x = xYz_x + m_zx_z;
        auto xYz_y = m_xy_x + m_yy_Y;   xYz_y = xYz_y + m_zy_z;
        auto xYz_z = m_xz_x + m_yz_Y;   xYz_z = xYz_z + m_zz_z;
        auto xYz_w = m_xw_x + m_yw_Y;   xYz_w = xYz_w + m_zw_z;
        result[2].x = xYz_x;
        result[2].y = xYz_y;
        result[2].z = xYz_z;
        result[2].w = xYz_w;
    }

    {
        auto XYz_x = m_xx_X + m_yx_Y;   XYz_x = XYz_x + m_zx_z;
        auto XYz_y = m_xy_X + m_yy_Y;   XYz_y = XYz_y + m_zy_z;
        auto XYz_z = m_xz_X + m_yz_Y;   XYz_z = XYz_z + m_zz_z;
        auto XYz_w = m_xw_X + m_yw_Y;   XYz_w = XYz_w + m_zw_z;
        result[3].x = XYz_x;
        result[3].y = XYz_y;
        result[3].z = XYz_z;
        result[3].w = XYz_w;
    }

    {
        auto xyZ_x = m_xx_x + m_yx_y;   xyZ_x = xyZ_x + m_zx_Z;
        auto xyZ_y = m_xy_x + m_yy_y;   xyZ_y = xyZ_y + m_zy_Z;
        auto xyZ_z = m_xz_x + m_yz_y;   xyZ_z = xyZ_z + m_zz_Z;
        auto xyZ_w = m_xw_x + m_yw_y;   xyZ_w = xyZ_w + m_zw_Z;
        result[4].x = xyZ_x;
        result[4].y = xyZ_y;
        result[4].z = xyZ_z;
        result[4].w = xyZ_w;
    }

    {
        auto XyZ_x = m_xx_X + m_yx_y;   XyZ_x = XyZ_x + m_zx_Z;
        auto XyZ_y = m_xy_X + m_yy_y;   XyZ_y = XyZ_y + m_zy_Z;
        auto XyZ_z = m_xz_X + m_yz_y;   XyZ_z = XyZ_z + m_zz_Z;
        auto XyZ_w = m_xw_X + m_yw_y;   XyZ_w = XyZ_w + m_zw_Z;
        result[5].x = XyZ_x;
        result[5].y = XyZ_y;
        result[5].z = XyZ_z;
        result[5].w = XyZ_w;
    }

    {
        auto xYZ_x = m_xx_x + m_yx_Y;   xYZ_x = xYZ_x + m_zx_Z;
        auto xYZ_y = m_xy_x + m_yy_Y;   xYZ_y = xYZ_y + m_zy_Z;
        auto xYZ_z = m_xz_x + m_yz_Y;   xYZ_z = xYZ_z + m_zz_Z;
        auto xYZ_w = m_xw_x + m_yw_Y;   xYZ_w = xYZ_w + m_zw_Z;
        result[6].x = xYZ_x;
        result[6].y = xYZ_y;
        result[6].z = xYZ_z;
        result[6].w = xYZ_w;
    }

    {
        auto XYZ_x = m_xx_X + m_yx_Y;   XYZ_x = XYZ_x + m_zx_Z;
        auto XYZ_y = m_xy_X + m_yy_Y;   XYZ_y = XYZ_y + m_zy_Z;
        auto XYZ_z = m_xz_X + m_yz_Y;   XYZ_z = XYZ_z + m_zz_Z;
        auto XYZ_w = m_xw_X + m_yw_Y;   XYZ_w = XYZ_w + m_zw_Z;
        result[7].x = XYZ_x;
        result[7].y = XYZ_y;
        result[7].z = XYZ_z;
        result[7].w = XYZ_w;
    }
}

To get a compact indirection array of all the objects that passed the frustum-OOBB culling, the remove_not_visible function needs to be slightly modified:

uint32_t remove_not_visible(const ObjectSet &object_set, uint32_t count, uint32_t *output_indirection, const uint32_t *input_indirection/*new argument*/)
{
    const uint32_t *visibility_flag = object_set.visibility_flag.begin();
    uint32_t n_visible = 0U;
    for (uint32_t i = 0; i < count; ++i) {

        // Each element of `input_indirection` represents an object that has either been culled
        // or not culled. If it's not null then do a lookup to get the actual object index else
        // use `i` directly.
        const uint32_t index = input_indirection? input_indirection[i] : i;

        // `visibility_flag` is already compacted, so use `i` directly.
        if (visibility_flag[i]) {
            output_indirection[n_visible] = i;
            ++n_visible;
        }
    }

    const uint32_t n_aligned_visible = align_to_simd_lane_count(n_visible);
    const uint32_t last_visible = n_visible? output_indirection[n_visible- 1] : 0;

    // Pad out to the simd alignment.
    for (unsigned i = n_visible; i < n_aligned_visible; ++i)
        output_indirection[i] = last_visible;

    return n_visible;
}

Bringing the frustum-sphere and frustum-OOBB code together we get:

{
    // Splat out the planes to be able to do plane-sphere test with SIMD.
    const auto &frustum = camera.frustum();

    const SIMDPlane planes[6] = {
        float4_splat(frustum.planes[0].n.x),
        float4_splat(frustum.planes[0].n.y),
        float4_splat(frustum.planes[0].n.z),
        float4_splat(frustum.planes[0].d),

        float4_splat(frustum.planes[1].n.x),
        float4_splat(frustum.planes[1].n.y),
        float4_splat(frustum.planes[1].n.z),
        float4_splat(frustum.planes[1].d),

        float4_splat(frustum.planes[2].n.x),
        float4_splat(frustum.planes[2].n.y),
        float4_splat(frustum.planes[2].n.z),
        float4_splat(frustum.planes[2].d),

        float4_splat(frustum.planes[3].n.x),
        float4_splat(frustum.planes[3].n.y),
        float4_splat(frustum.planes[3].n.z),
        float4_splat(frustum.planes[3].d),

        float4_splat(frustum.planes[4].n.x),
        float4_splat(frustum.planes[4].n.y),
        float4_splat(frustum.planes[4].n.z),
        float4_splat(frustum.planes[4].d),

        float4_splat(frustum.planes[5].n.x),
        float4_splat(frustum.planes[5].n.y),
        float4_splat(frustum.planes[5].n.z),
        float4_splat(frustum.planes[5].d),
    };

    // Do frustum-sphere culling.
    simd_sphere_culling(planes, object_set);

    // Make sure to align the size to the simd lane count.
    const uint32_t n_aligned_objects = align_to_simd_lane_count(object_set.n_objects);

    // Store the indices of the objects that passed the frustum-sphere culling in the `indirection` array.
    Array<uint32_t> indirection(n_aligned_objects);

    const uint32_t n_visible = remove_not_visible(object_set, object_set.n_objects, indirection.begin(), nullptr);

    const auto &view_proj = camera.view() * camera.proj();

    // Construct the SIMDMatrix `simd_view_proj`.
    const SIMDMatrix simd_view_proj = {
        float4_splat(view_proj.v[xx]),
        float4_splat(view_proj.v[xy]),
        float4_splat(view_proj.v[xz]),
        float4_splat(view_proj.v[xw]),

        float4_splat(view_proj.v[yx]),
        float4_splat(view_proj.v[yy]),
        float4_splat(view_proj.v[yz]),
        float4_splat(view_proj.v[yw]),

        float4_splat(view_proj.v[zx]),
        float4_splat(view_proj.v[zy]),
        float4_splat(view_proj.v[zz]),
        float4_splat(view_proj.v[zw]),

        float4_splat(view_proj.v[tx]),
        float4_splat(view_proj.v[ty]),
        float4_splat(view_proj.v[tz]),
        float4_splat(view_proj.v[tw]),
    };

    // Cull objects via frustum-oobb tests.
    simd_oobb_culling(simd_view_proj, object_set, n_visible, indirection.begin());

    // Build up the indirection array that represents the objects that survived the frustum-oobb culling.
    const uint32_t n_oobb_visible = remove_not_visible(object_set, n_visible, indirection.begin(), indirection.begin());
}

The final call to remove_not_visible populates the indirection array with the objects that passed both the frustum-sphere and the frustum-OOBB culling. indirection together with n_oobb_visible is all that is needed to know what objects should be rendered.

Distributing the work over several threads

In Stingray, work is distributed by submitting jobs to a pool of worker threads -- conveniently called the ThreadPool. Submitted jobs are put in a thread safe work queue from which the worker threads pop jobs to work on. A task is defined as:

typedef void (*TaskCallback)(void *user_data);

struct TaskDefinition
{
    TaskCallback callback;
    void *user_data;
};

For the purpose of this article, the interesting methods of the ThreadPool are:

class ThreadPool
{
    // Adds `count` tasks to the work queue.
    void add_tasks(const TaskDefinition *tasks, uint32_t count);

    // Tries to pop one task from the queue and do that work. Returns true if any work was done.
    bool do_work();

    // Will call `do_work` while `signal` == value.
    void wait_atomic(std::atomic<uint32_t> *signal, uint32_t value);
};

The ThreadPool doesn't dictate how to synchronize when a job is fully processed, but usually a std::atomic<uint32_t> signal is used for that purpose. The value is 0 while the job is being processed and set to 1 when it's done. wait_atomic() is a convenience method that can be used to wait for such values:

void ThreadPool::wait_atomic(std::atomic<uint32_t> *signal, uint32_t value)
{
    while (signal->load(std::memory_order_acquire) == value) {
        if (!do_work())
            YieldProcessor();
    }
}

do_work:

bool ThreadPool::do_work()
{
    TaskDefinition task;
    if (pop_task(task)) {
        task.callback(task.user_data);
        return true;
    }
    return false;
}

Multi-threading the culling only requires a few changes to the code. For the simd_sphere_culling() method we need to add offset and count parameters to specify the range of objects we are processing:

void simd_sphere_culling(const SIMDPlane planes[6], culling::ObjectSet &object_set, uint32_t offset, uint32_t count)
{
    const auto all_true = bool4_all_true();
    const uint32_t n_objects = offset + count;

    uint32_t *visibility_flag = object_set.visibility_flag.begin();

    // Test each plane of the frustum against each sphere.
    for (uint32_t i = offset; i < n_objects; i += 4)
    {
        const auto ws_pos_x = float4_load_aligned(&object_set->ws_pos_x[i]);
        const auto ws_pos_y = float4_load_aligned(&object_set->ws_pos_y[i]);
        const auto ws_pos_z = float4_load_aligned(&object_set->ws_pos_z[i]);
        const auto radius = float4_load_aligned(&object_set->radius[i]);

        auto inside = all_true;
        for (unsigned p = 0; p < 6; ++p) {
            auto &n_x = planes[p].normal_x;
            auto &n_y = planes[p].normal_y;
            auto &n_z = planes[p].normal_z;
            auto n_dot_pos = dot_product(ws_pos_x, ws_pos_y, ws_pos_z, n_x, n_y, n_z);
            auto plane_test_point = n_dot_pos + radius;
            auto plane_test = plane_test_point >= planes[p].d;
            inside = vector_and(plane_test, inside);
        }

        // Store 0 for spheres that didn't intersect or ended up on the positive side of the
        // frustum planes. Store 0xffffffff for spheres that are visible.
        store_aligned(inside, &visibility_flag[i]);
    }
}

Bringing the previous code snippet together with multi-threaded culling:

{
    // Calculate the number of work items based on that each work will process `work_size` elements.
    const uint32_t work_size = 512;

    // `div_ceil(a, b)` calculates `(a + b - 1) / b`.
    const uint32_t n_work_items = math::div_ceil(n_objects, work_size);

    Array<CullingWorkItem> culling_work_items(n_work_items);
    Array<TaskDefinition> tasks(n_work_items);

    // Splat out the planes to be able to do plane-sphere test with SIMD.
    const auto &frustum = camera.frustum();

    const SIMDPlane planes[6] = {
        same code as previously shown...
    };

    // Make sure to align the size to the simd lane count.
    const uint32_t n_aligned_objects = align_to_simd_lane_count(object_set.n_objects);

    for (unsigned i = 0; i < n_work_items; ++i) {

        // The `offset` and `count` for the work item.
        const uint32_t offset = math::min(work_size * i, n_objects);
        const uint32_t count = math::min(work_size, n_objects - offset);

        auto &culling_item = culling_work_items[i];
        memcpy(culling_data.planes, planes, sizeof(planes));
        culling_item.object_set = &object_set;
        culling_item.offset = offset;
        culling_item.count = count;
        culling_item.signal = 0;

        auto &task = tasks[i];
        task.callback = simd_sphere_culling_task;
        task.user_data = &culling_item;
    }

    // Add the tasks to the `ThreadPool`.
    thread_pool.add_tasks(n_work_items, tasks.begin());
    // Wait for each `item` and if it's not done, help out with the culling work.
    for (auto &item : culling_work_items)
        thread_pool.wait_atomic(&item.signal, 0);
}

CullingWorkItem and simd_sphere_culling_task are defined as:

struct CullingWorkItem
{
    SIMDPlane planes[6];
    const culling::ObjectSet *object_set;
    uint32_t offset;
    uint32_t count;
    std::atomic<uint32_t> signal;
};

void simd_sphere_culling_task(void *user_data)
{
    auto culling_item = (CullingWorkItem*)(user_data);

    // Call the frustum-sphere culling function.
    simd_sphere_culling(culling_item->planes, *culling_item->object_set, culling_item->offset, culling_item->count);

    // Signal that the work is done.
    culling_item->store(1, std::memory_order_release);
}

The same pattern is used to multi-thread the frustum-OOBB culling. That is "left as an exercise for the reader" ;)

Conclusion

This type of culling is done for all of the objects that can be rendered, i.e. meshes, particle systems, terrain, etc. We also use it to cull light sources. It is used both when rendering the main scene and for rendering shadows.

I've left out a few details of our solution. One thing we also do is something called contribution culling. In the frustum-OOBB culling step, the extents of the OOBB corners are projected to the near plane and from that the screen space extents are derived. If the object is smaller than a certain threshold in any axis the object is considered as culled. Special care needs to be considered if any of the corners intersect or is behind the near plane so we don't have to deal with "external line segments" caused by the projection. If you don't know what that is see: http://www.gamasutra.com/view/news/168577/Indepth_Software_rasterizer_and_triangle_clipping.php. In our case the contribution culling is disabled by expanding the extents to span the entire screen when any corner intersects or is behind the near plane.

For our cascaded shadow maps, the extents are also used to detect if an object is fully enclosed by a cascade. If that is the case, then that object is culled from the later cascades. Let me illustrate with some ASCII:

+-----------+-----------+
|           |           |
|     /\    |           |
|    /--\   |           |
+-----------+-----------+
|           |           |
|           |           |
|           |           |
+-----------+-----------+

The squares are the different cascades. The top left square is the first cascades, the top right is the second cascade, bottom left the third and the bottom right is the fourth cascade. In this case the weird triangle shaped object is fully enclosed by the first cascade. What that means is that the object doesn't need to be rendered to any of the later cascades, since the shadow contribution from that object will be fully taken care of from the first cascade.

Wednesday, September 7, 2016

State reflection

Overview

The Stingray engine has two controller threads -- the main thread and the render thread. These two threads build up work for our job system, which is distributed on the remaining threads. The main thread and the render thread are pipelined, so that while the main thread runs the simulation/update for frame N, the render thread is processing the rendering work for the previous frame (N-1). This post will dive into the details how state is propagated from the main thread to the render thread.

I will use code snippets to explain how the state reflection works. It's mostly actual code from the engine but it has been cleaned up to a certain extent. Some stuff has been renamed and/or removed to make it easier to understand what's going on.

The main loop

Here is a slimmed down version of the update loop which is part of the main thread:

while (!quit())
{
    // Calls out to the mandatory user supplied `update` Lua function, Lua is used 
    // as a scripting language to manipulate objects. From Lua worlds, objects etc
    // can be created, manipulated, destroyed, etc. All these changes are recorded
    // on a `StateStream` that is a part of each world.
    _game->update();

    // Flush state changes recorded on the `StateStream` for each world to
    // the rendering world representation.
    unsigned n_worlds = _worlds.size();
    for (uint32_t i = 0; i < n_worlds; ++i) {
        auto &world = *_worlds[i];
        _render_interface->update_world(world);
    }

    // Begin a new render frame.
    _render_interface->begin_frame();

    // Calls out to the user supplied `render` Lua function. It's up to the script
    // to call render on worlds(). The script controls what camera and viewport
    // are used when rendering the world.
    _game->render();

    // Present the frame.
    _render_interface->present_frame();

    // End frame.
    _render_interface->end_frame(_delta_time);

    // Never let the main thread run more than 1 frame a head of the render thread.
    _render_interface->wait_for_fence(_frame_fence);

    // Create a new fence for the next frame.
    _frame_fence = _render_interface->create_fence();
}

First thing to point out is the _render_interface. This is not a class full of virtual functions that some other class can inherit from and override as the name might suggest. The word "interface" is used in the sense that it's used to communicate from one thread to another. So in this context the _render_interface is used to post messages from the main thread to the render thread.

As said in the first comment in the code snippet above, Lua is used as our scripting language and from Lua things such as worlds, objects, etc can be created, destroyed, manipulated, etc.

The state between the main thread and the render thread is very rarely shared, instead each thread has its own representation and when state is changed on the main thread that state is reflected over to the render thread. E.g., the MeshObject, which is the representation of a mesh with vertex buffers, materials, textures, shaders, skinning, data etc to be rendered, is the main thread representation and RenderMeshObject is the corresponding render thread representation. All objects that have a representation on both the main and render thread are setup to work the same way:

class MeshObject : public RenderStateObject
{
};

class RenderMeshObject : public RenderObject
{
};

The corresponding render thread class is prefixed with Render. We use this naming convention for all objects that have both a main and a render thread representation.

The main thread objects inherit from RenderStateObject and the render thread objects inherit from RenderObject. These structs are defined as:

struct RenderStateObject
{
    uint32_t render_handle;
    StateReflection *state_reflection;
};

struct RenderObject
{
    uint32_t type;
};

The render_handle is an ID that identifies the corresponding object on the render thread. state_reflection is a stream of data that is used to propagate state changes from the main thread to the render thread. type is an enum used to identify the type of render objects.

Object creation

In Stingray a world is a container of renderable objects, physical objects, sounds, etc. On the main thread, it is represented by the World class, and on the render thread by a RenderWorld.

When a MeshObject is created in a world on the main thread, there's an explicit call to WorldRenderInterface::create() to create the corresponding render thread representation:

MeshObject *mesh_object = MAKE_NEW(_allocator, MeshObject);
_world_render_interface.create(mesh_object);

The purpose of the call to WorldRenderInterface::create is to explicitly create the render thread representation, acquire a render_handle and to post that to the render thread:

void WorldRenderInterface::create(MeshObject *mesh_object)
{
    // Get a unique render handle.
    mesh_object->render_handle = new_render_handle();

    // Set the state_reflection pointer, more about this later.
    mesh_object->state_reflection = &_state_reflection;

    // Create the render thread representation.
    RenderMeshObject *render_mesh_object = MAKE_NEW(_allocator, RenderMeshObject);

    // Pass the data to the render thread
    create_object(mesh_object->render_handle, RenderMeshObject::TYPE, render_mesh_object);
}

The new_render_handle function speaks for itself.

uint32_t WorldRenderInterface::new_render_handle()
{
    if (_free_render_handles.any()) {
        uint32_t handle = _free_render_handles.back();
        _free_render_handles.pop_back();
        return handle;
    } else
        return _render_handle++;
}

There is a recycling mechanism for the render handles and a similar pattern reoccurs at several places in the engine. The release_render_handle function together with the new_render_handle function should give the complete picture of how it works.

void WorlRenderInterface::release_render_handle(uint32_t handle)
{
    _free_render_handles.push_back(handle);
}

There is one WorldRenderInterface per world which contains the _state_reflection that is used by the world and all of its objects to communicate with the render thread. The StateReflection in its simplest form is defined as:

struct StateReflection
{
    StateStream *state_stream;
};

The create_object function needs a bit more explanation though:

void WorldRenderInterface::create_object(uint32_t render_handle, RenderObject::Type type, void *user_data)
{
    // Allocate a message on the `state_stream`.
    ObjectManagementPackage *omp;
    alloc_message(_state_reflection.state_stream, WorldRenderInterface::CREATE, &omp);

    omp->object_type = RenderWorld::TYPE;
    omp->render_handle = render_handle;
    omp->type = type;
    omp->user_data = user_data;
}

What happens here is that alloc_message will allocate enough bytes to make room for a MessageHeader together with the size of ObjectManagementPackage in a buffer owned by the StateStream. The StateStream is defined as:

struct StateStream
{
    void *buffer;
    uint32_t capacity;
    uint32_t size;
};

capacity is the size of the memory pointed to by buffer, size is the current amount of bytes allocated from buffer.

The MessageHeader is defined as:

struct MessageHeader
{
    uint32_t type;
    uint32_t size;
    uint32_t data_offset;
};

The alloc_message function will first place the MessageHeader and then comes the data, some ASCII to the rescue:

+-------------------------------------------------------------------+
| MessageHeader | data                                              |
+-------------------------------------------------------------------+
<- data_offset ->
<-                          size                                   ->

The size and data_offset mentioned in the ASCII are two of the members of MessageHeader, these are assigned during the alloc_message call:

template<Class T>
void alloc_message(StateStream *state_stream, uint32_t type, T **data)
{
    uint32_t data_size = sizeof(T);

    uint32_t message_size = sizeof(MessageHeader) + data_size;

    // Allocate message and fill in the header.
    void *buffer = allocate(state_stream, message_size, alignof(MessageHeader));
    auto header = (MessageHeader*)buffer;

    header->type = type;
    header->size = message_size;
    header->data_offset = sizeof(MessageHeader);

    *data = memory_utilities::pointer_add(buffer, header->data_offset);
}

The buffer member of the StateStream will contain several consecutive chunks of message headers and data blocks.

+-----------------------------------------------------------------------+
| Header | data | Header | data | Header | data | Header | data | etc   |
+-----------------------------------------------------------------------+

This is the necessary code on the main thread to create an object and populate the StateStream which will later on be consumed by the render thread. A very similar pattern is used when changing the state of an object on the main thread, e.g:

void MeshObject::set_flags(renderable::Flags flags)
{
    _flags = flags;

    // Allocate a message on the `state_stream`.
    SetVisibilityPackage *svp;
    alloc_message(state_reflection->state_stream, MeshObject::SET_VISIBILITY, &svp);

    // Fill in message information.
    svp->object_type = RenderMeshObject::TYPE;

    // The render handle that got assigned in `WorldRenderInterface::create`
    // to be able to associate the main thread object with its render thread 
    // representation.
    svp->handle = render_handle;

    // The new flags value.
    svp->flags = _flags;
}

Getting the recorded state to the render thread

Let's take a step back and explain what happens in the main update loop during the following code excerpt:

// Flush state changes recorded on the `StateStream` for each world to
// the rendering world representation.
unsigned n_worlds = _worlds.size();
for (uint32_t i = 0; i < n_worlds; ++i) {
    auto &world = *_worlds[i];
    _render_interface->update_world(world);
}

When Lua has been creating, destroying, manipulating, etc objects during update() and is done, each world's StateStream which contains all the recorded changes is ready to be sent over to the render thread for consumption. The call to RenderInterface::update_world() will do just that, it roughly looks like:

void RenderInterface::update_world(World &world)
{
    UpdateWorldMsg uw;

    // Get the render thread representation of the `world`.
    uw.render_world = render_world_representation(world);

    // The world's current `state_stream` that contains all changes made 
    // on the main thread.
    uw.state_stream = world->_world_reflection_interface.state_stream;

    // Create and assign a new `state_stream` to the world's `_world_reflection_interface`
    // that will be used for the next frame.
    world->_world_reflection_interface->state_stream = new_state_stream();

    // Post a message to the render thread to update the world.
    post_message(UPDATE_WORLD, &uw);
}

This function will create a new message and post it to the render thread. The world being flushed and its StateStream are stored in the message and a new StateStream is created that will be used for the next frame. This new StateStream is set on the WorldRenderInterface of the World, and since all objects being created got a pointer to the same WorldRenderInterface they will use the newly created StateStream when storing state changes for the next frame.

Render thread

The render thread is spinning in a message loop:

void RenderInterface::render_thread_entry()
{
    while (!_quit) {
        // If there's no message -- put the thread to sleep until there's
        // a new message to consume.
        RenderMessage *message = get_message();

        void *data = data(message);
        switch (message->type) {
            case UPDATE_WORLD:
                internal_update_world((UpdateWorldMsg*)(data));
                break;

            // ... And a lot more case statements to handle different messages. There
            // are other threads than the main thread that also communicate with the
            // render thread. E.g., the resource loading happens on its own thread
            // and will post messages to the render thread.
        }
    }
}

The internal_update_world() function is defined as:

void RenderInterface::internal_update_world(UpdateWorldMsg *uw)
{
    // Call update on the `render_world` with the `state_stream` as argument.
    uw->render_world->update(uw->state_stream);

    // Release and recycle the `state_stream`.
    release_state_stream(uw->state_stream);
}

It calls update() on the RenderWorld with the StateStream and when that is done the StateStream is released to a pool.

void RenderWorld::update(StateStream *state_stream)
{
    MessageHeader *message_header;
    StatePackageHeader *package_header;

    // Consume a message and get the `message_header` and `package_header`.
    while (get_message(state_stream, &message_header, (void**)&package_header)) {
        switch (package_header->object_type) {
            case RenderWorld::TYPE:
            {
                auto omp = (WorldRenderInterface::ObjectManagementPackage*)package_header;
                // The call to `WorldRenderInterface::create` created this message.
                if (message_header->type == WorldRenderInterface::CREATE)
                    create_object(omp);
            }
            case (RenderMeshObject::TYPE)
            {
                if (message_header->type == MeshObject::SET_VISIBILITY) {
                    auto svp = (MeshObject::SetVisibilityPackage*>)package_header;

                    // The `render_handle` is used to do a lookup in `_objects_lut` to
                    // to get the `object_index`.
                    uint32_t object_index = _object_lut[package_header->render_handle];

                    // Get the `render_object`.
                    void *render_object = _objects[object_index];

                    // Cast it since the type is already given from the `object_type`
                    // in the `package_header`.
                    auto rmo = (RenderMeshObject*)render_object;

                    // Call update on the `RenderMeshObject`.
                    rmo->update(message_header->type, package_header);
                }
            }
            // ... And a lot more case statements to handle different kind of messages.
        }
    }
}

The above is mostly infrastructure to extract messages from the StateStream. It can be a bit involved since a lot of stuff is written out explicitly but the basic idea is hopefully simple and easy to understand.

On to the create_object call done when (message_header->type == WorldRenderInterface::CREATE) is satisfied:

void RenderWorld::create_object(WorldRenderInterface::ObjectManagementPackage *omp)
{
    // Acquire an `object_index`.
    uint32_t object_index = _objects.size();

    // Same recycling mechanism as seen for render handles.
    if (_free_object_indices.any()) {
        object_index = _free_object_indices.back();
        _free_object_indices.pop_back();
    } else {
        _objects.resize(object_index + 1);
        _object_types.resize(object_index + 1);
    }

    void *render_object = omp->user_data;
    if (omp->type == RenderMeshObject::TYPE) {
        // Cast the `render_object` to a `MeshObject`.
        RenderMeshObject *rmo = (RenderMeshObject*)render_object;

        // If needed, do more stuff with `rmo`.
    }

    // Store the `render_object` and `type`.
    _objects[object_index] = render_object;
    _object_types[object_index] = omp->type;

    if (omp->render_handle >= _object_lut.size())
        _object_lut.resize(omp->handle + 1);
    // The `render_handle` is used
    _object_lut[omp->render_handle] = object_index;
}

So the take away from the code above lies in the general usage of the render_handle and the object_index. The render_handle of objects are used to do a look up in _object_lut to get the object_index and type. Let's look at an example, the same RenderWorld::update code presented earlier but this time the focus is when the message is MeshObject::SET_VISIBILITY:

void RenderWorld::update(StateStream *state_stream)
{
    StateStream::MessageHeader *message_header;
    StatePackageHeader *package_header;

    while (get_message(state_stream, &message_header, (void**)&package_header)) {
        switch (package_header->object_type) {
            case (RenderMeshObject::TYPE)
            {
                if (message_header->type == MeshObject::SET_VISIBILITY) {
                    auto svp = (MeshObject::SetVisibilityPackage*>)package_header;

                    // The `render_handle` is used to do a lookup in `_objects_lut` to
                    // to get the `object_index`.
                    uint32_t object_index = _object_lut[package_header->render_handle];

                    // Get the `render_object` from the `object_index`.
                    void *render_object = _objects[object_index];

                    // Cast it since the type is already given from the `object_type`
                    // in the `package_header`.
                    auto rmo = (RenderMeshObject*)render_object;

                    // Call update on the `RenderMeshObject`.
                    rmo->update(message_header->type, svp);
                }
            }
        }
    }
}

The state reflection pattern shown in this post is a fundamental part of the engine. Similar patterns appear in other places as well and having a good understanding of this pattern makes it much easier to understand the internals of the engine.

Tuesday, September 6, 2016

A New Localization System for Stingray

The current Stingray localization system is based around the concept of properties. A property is any period separated part of the file name before the extension. Consider the following three files:

  • trees/larch_03.unit
  • trees/larch_03.fr.unit
  • trees/larch_03.ps4.unit

These three files all have the same type (.unit), and the same name (trees/larch_03), but their properties differ. The first one has no properties set. The second one has the property .fr and the last one has the property .ps4. (Note that resources can have more than one property.)

Properties are resolved in slightly different ways, depending on the kind of property. Platform properties are resolved at compile time, so if you compile for PS4, you will get the PS4 version of the resource (or the default version if there is no .ps4 specific version).

Other properties are resolved at resource load time. When you load a bunch of resources, which property variant is loaded depends on a global property preference order set from the script. A property preference order of ['.fr', '.es'] means that resources with the property .fr are be preferred, then resources with the property .es (if no .fr resource is available), and finally a resource without any properties at all.

This single mechanism is used for localizing strings, sounds, textures, etc. Strings, for example, are stored in .strings files, which are essentially just key-value stores:

file = "File"
open = "Open"
...

To create a French localized of this menu.strings resource, you just create a menu.fr.strings resource and fill it with:

file = "Fichier"
open = "Ouvert"
...

This basic localization system has served us well for many years, but it has some drawbacks that are starting to become more pronounced:

  • It doesn't allow file names with periods in them. Since we always interpret periods as properties, periods can't be a part of the regular file name. This isn't a huge problem when users name their own files, but as we are increasing the interoperability between Stingray and other software packages we more and more run into software that has, let's say peculiar, ways of naming its files. Renaming things by hand is cumbersome and can also break things when files cross-reference each other.

  • Switching language requires reloading the resource packages. This seems overly complicated. We have more memory these days than when we started building Stingray. In many cases, especially for strings, it makes more sense to keep them in memory all the time, so we can switch between them easily.

  • Just switching on platform isn't enough. Mobile devices range from very low-end to at least mid-end. Rather than having .ios and .android properties, we might want .low-quality and .high-quality and select which one to use based on the actual capabilities of the hardware.

  • Making editors work well with the property system has been surprisingly complicated. For example, when the editor runs on Windows, what should it show if there is a .win32 specialization of a resource -- the default version or the .win32 one? How would you edit a .ps4 resource when those are normally stripped out of the Windows runtime?

    We used to have this wonky think where you could sort of cross-compile the resources and say that "I want to run on Windows, but as if I was running on PS4. But to be honest, that system never really worked that well and in the new editor we have gotten rid of it.

Interestingly, out of all these problems, it is the first one -- the most stupid one -- that is the main impetus for change.

The New System

The new system has several parts. First, we decided that for systems that deal with localization a lot, such as strings and sounds it makes sense to have the system actually be aware of localization. That way, we can provide the best possible experience.

So the .strings format has changed to:

file = {en = "File", fr = "Fichier", ...}
open = {en = "Open", fr = "Ouvert", ...}
...

All the languages are stored in the same file and to switch language you just call Localizer.set_language("fr"). We keep all the different languages in memory at all times. Even for a game with ridiculous amounts of text this still doesn't use much memory and it means we can hot-swap languages instantly.

This is a nice approach, but it doesn't work for all resources. We don't want to add this deep kind of integration to resources that are normally not localized, such as .unit and .texture. Still, there sometimes is a need to localize such resources. For example, a .texture might have text in it that needs to be localized. We may need a low-poly version of a .unit for a less capable platform. Or a less gory version of an animation for countries with stricter age ratings.

To make things easier for the editor we decided to ditch the property system all together, and instead go for a substitution strategy. There are no special magical parts of a resource's path -- it is just a name and a type. But if you want to, you can say to the engine that all instances of a certain resource should be replaced with another resource:

trees/larch_03.unit → trees/larch_03_ps4.unit

Note here that there is nothing special or magical about the trees/larch_03_ps4.unit. There is no problem with displaying it on Windows. You just edit it in the editor, like any other unit. However, when you play the game -- any time a trees/larch_03.unit is requested by the engine, a trees/larch_03_ps4.unit is substituted. So if you have authored a level full of larch_03 units, when the override above is in place, you will instead see larch_03_ps4 units.

There are many ways for this scheme to go wrong. The gameplay script might expect to find a certain node branch_43 in the unit -- a node that exists in larch_03.unit, but not in larch_03_ps4.unit and this may lead to unexpected behavior. The same problem existed in the old property system. We don't try to do anything special about this, because it is impossible. In the end, it is only the gameplay script that can know what it means for two things to be similar enough to be used interchangeably. Anyone working with localized resources just has to be careful not to break things.

Overrides can be specified from the Lua script:

Application.set_resource_override("unit", "trees/larch_03", "trees/larch_03_ps4");

Note that this is a much more powerful system than the old property system. Any resource can be set to override any other -- we are not restricted to work within the strict naming scheme required by the property system. Also, the override is dynamic and can be determined at runtime. So it can be based on dynamic properties, such as measured CPU or GPU performance -- or a user setting for the amount of gore they are comfortable with.

It can even be used for completely different things than localization or platform specific resources -- such as replacing the units in a level for a night-time or psychedelic version of the same level. And I'm sure our users will find many other ways of (ab)using this mechanism.

But this dynamic system is not quite enough to do everything we want to do.

First, since the override is dynamic and only happens at runtime, our packaging system can't be aware of it. Normally, our packaging system figures out all resource dependencies automatically. So when you say that you want a package with the forest level, the packaging system will automatically pull in the larch_03 unit that is used in that level, any textures used by that unit, etc. But since the packaging system can't know that at runtime you will replace larch_03 with larch_03_ps4, it doesn't know that larch_03_ps4 and its dependencies should go into the package as well.

You could add larch_03_ps4 to the package manually, since you know it will be used. That might work if you only have one or two overrides. However, even with a very small amount of overrides micromanaging packages in this way becomes incredibly tedious and error prone.

Second, we don't want to burden the packages with resources that will never be used. If we are making a game for digital distribution on iOS or Android we don't want to include large PS4-only resources in that game.

So we need a static override mechanism that is known by the package manager to make sure it includes and excludes the right resources. The simplest thing would be a big file that just listed all the overrides. For example, to override larch_03 on PS4 we would write something like:

resource_overrides = [
  {
    type = "unit"
    name = "trees/larch_03"
    override = "trees/larch_03_ps4"
    platforms = ["ps4"]
  }
]

This would work, but could again get pretty tedious if there are a lot of overrides. It would be nice with something that was a bit more automatic.

Since our users are already used to using name suffixes such as .fr and .ps4 for localization, we decided to build on the same mechanism -- creating overrides automatically based on suffix rules:

resource_overrides = [
  {suffix = "_ps4", platforms = ["ps4"]}
]

This rule says that when we are compiling for the platform PS4, if we find a resource that has the same name as another resource, but with the added suffix _ps4, that resource will automatically be registered as an override for that resource:

trees/larch_03.unit → trees/larch_03_ps4.unit
leaves/larch_leaves.texture → leaves/larch_leaves_ps4.unit

In addition to platform settings, the system also generalizes to support other flags:

resource_overrides = [
  {suffix = "_fr", flags = ["fr"]}
  {suffix = "_4k", flags = ["4K"]}
  {suffix = "_noblood", flags = ["noblood", "PG-13"]}
]

This defines the _fr suffix for French localization. A 4K suffix _4k for high-quality versions of resources suitable for 4K monitors. And a _noblood suffix that selects resources without blood and gore.

The flags can be set at compile time with:

--compile --resource-flag-true 4K

This means that we are compiling a 4K version of the game, so when bundling only the 4K resources will be included and the other versions will be stripped out. Just as if we were compiling for a specific platform.

But we can also choose to resolve the flags at runtime:

--compile --resource-flag-runtime noblood

With this setting, both the regular resource and the _noblood resource will be included in the package and loaded into memory. And we can hot swap between them with:

Application.set_resource_flag("noblood", true)

I have not decided yet whether in addition to these two alternatives we should also have an option that resolves at package load time. I.e., both variants of the resource would be included on disk, but only one of them would be loaded into memory and if you wanted to switch resource you would have to unload the package and load it back into memory again.

I can see some use cases for this, but on the other hand adding more options complicates the system and I like to keep things as simple as possible.

A nice thing about this suffix mapping is that it can be configured to be backwards compatible with the old property system:

resource_overrides = [
  {suffix = ".fr", flags = ["fr"]}
  {suffix = ".ps4", platforms = ["ps4"]}
  {suffix = ".xb1", platforms = ["xb1"]}
]

Whenever we change something in Stingray we try to make it more flexible and data-driven, while at the same time ensuring that the most common cases are still easy to work with. This rewrite of the localization is a good example:

  • It fixes the problem with periods in file names. Periods are now only an issue if you have made an explicit suffix mapping that matches them.

  • We can switch language (or any other resource setting) at runtime.

  • The new system is more flexible -- it doesn't just handle localization and platform specific resources, we can set up whatever resource categories we want. And we can even dynamically override individual resources.

  • The editor no longer needs to do anything special to deal with the concept of "properties". Resources that are used to override other resources can be edited in the editor just like any other resource.

  • And the system can easily be configured to be backwards compatible with the old localization system.

I still feel slightly queasy about using name matching to drive parts of this system. Name matching is a practice that can go horribly wrong. But in this case, since the name matching is completely user controlled I think it makes a good compromise between purity and usability.

Tuesday, August 16, 2016

Render Config Extensions

Untitled Document.md

The rendering pipe in Stingray is completely data-driven, meaning that everything from which GPU buffers (render targets etc) that are needed to compose the final rendered frame to the actual flow of the frames is described in the render_config file - a human readable json file. I have covered this in various presentations [1,2] over the years so I won’t be going into more details about it in this blog post, instead I’d like to focus on a new feature that we are rolling out in Stingray v1.5 - Render Config Extensions.

As Stingray is growing to cater to more industries than game development we see lots of feature requests that don’t necessarily fit in with our ideas of what should go into the default rendering pipe that we ship with Stingray. This has made it apparent that we need a way of doing deep integrations of new rendering features without having to duplicate the entire render_config file.

This is where the render_config_extension files comes into play. A render_config_extension is very similar to the main render_config except that instead of having to describe the entire rendering pipe it appends and inserts different json blocks into the main render_config.

When the engine starts the boot ini-file specifies what render_config to use as well as an array of render_config_extensions to load when setting up the renderer.

render_config = "core/stingray_renderer/renderer"
render_config_extensions = ["clouds-resources/clouds", "prism/prism"]

The array describes the initialization order of the extensions which makes it possible for the project author to control how the different extensions stacks on top of each other. It also makes it possible to build extensions that depends on other extensions.

A render_config_extension consists of two root blocks: append and insert_at:

append

The append block is used for everything that is order independent and allows you to append data to the following root blocks of the main render_config:

  • shader_libraries – lists additional shader_libraries to load
  • render_settings – add more render_settings (quality settings, debug flags, etc.)
  • shader_pass_flags – add more shader_pass_flags (used by shader system to dynamically turn on/off passes)
  • global_resources – additional global GPU resources to allocate on boot
  • resource_generators – expose new resource_generators
  • viewports – expose new viewport templates
  • lookup_tables – append to the list of resource_generators to execute when booting the renderer (mainly used for generating lookup tables)

One thing to note about extending these blocks is that we currently do not do any kind of name collision checking, so using a prefix to mimic a namespace for your extension is probably a good idea.

// example append block from JPs volumetric clouds plugin
append = {
  render_settings = {
    clouds_enabled = true
    clouds_raw_data_visualization = false
    clouds_weather_data_visualization = false
  }

  shader_libraries = [
    "clouds-resources/clouds"       
  ]

  global_resources = [
    // Clouds modelling resources:
    { name="clouds_result_texture1" type="render_target" image_type="image_3d" width=256 height=256 layers=256 format="R8G8B8A8" }
    { name="clouds_result_texture2" type="render_target" image_type="image_3d" width=64 height=64 layers=64 format="R8G8B8A8" }
    { name="clouds_result_texture3" type="render_target" image_type="image_2d" width=128 height=128 format="R8G8B8A8" }
    { name="clouds_weather_texture" type="render_target" image_type="image_2d" width=256 height=256 format="R8G8B8A8" }
  ]
}

insert_at

The insert_at block allows you to insert layers and modifiers into already existing layer_configurations and resource_generators, either belonging to the main render_config file or a render_config_extension listed earlier in the render_config_extensions array of engine boot ini-file.

// example insert_at block from JPs volumetric clouds plugin
insert_at = {
  post_processing_development = {
    modifiers = [
      { type="dynamic_branch" render_settings={ clouds_weather_data_visualization=true }
        pass = [
          { type="fullscreen_pass" shader="debug_weather" input=["clouds_weather_texture"] output=["output_target"]  }
        ]
      }
    ]
  }

  skydome = {
    layers = [
      { resource_generator="clouds_modifier" profiling_scope="clouds" }
    ]
  }
}

The object names under the insert_at block refers to extension_insertion_points listed in the main render_config file or one of the previously loaded render_config_extension files. We’ve chosen not to allow extensions to inject anywhere they like (using line numbers or similar crazyness), instead we expose a bunch of extension “hooks” at various places in the main render_config file. By doing this we hope to have a somewhat better chance of not breaking existing extensions as we continue to develop and potentially do bigger refactorings of the default render_config file.

Future work

This extension mechanism is somewhat of an experiment and we might need to rethink parts of it in a later version of Stingray. We’ve briefly discussed a potential need for dealing with versioning, i.e. allowing extensions to explicitly list what versions of Stingray they are compatible with (and maybe also allow extensions to have deviating implementations depending on version). Some kind of enforced name spacing and more aggressive validation to avoid name collisions have also been debated.

In the end we decided to ignore these potential problems for now and instead push for getting a first version out in 1.5 to unblock plugin developers and internal teams wanting to do efficient “deep” integrations of various rendering features. Hopefully we won’t regret this decision too much later on. ;)

References

  • [1] Flexible Rendering for Multiple Platforms (Tobias Persson, GDC 2012)
  • [2] Benefits of data-driven renderer (Tobias Persson, GDC 2011)