bitsquid: development blog

Physical Cameras in Stingray

2017-09-28T19:20:00.000+02:00

This is a quick blog to share some of the progress Olivier Dionne and I made lately with Physical Cameras in Stingray. Our goal of implementing a solid physically based pipeline has always been split in three phases. First we validated our standard material. We then added physical lights. And now we are wrapping it up with a physical camera.

We define a physical camera as an entity controlled by the same parameters a real world camera would use. These parameters are split into two groups which corresponds to the two main parts of a camera. The camera body is defined by it's sensor size, iso sensitivity, and a range of available shutter speeds. The camera lens is defined by it's focal length, focus range, and range of aperture diameters. Setting all of these parameters should expose the incoming light the same way a real world camera would.

Stingray Representation

Just like our physical light, our camera is expressed as an entity with a bunch of components. The main two components being the Camera Body and the Camera Lens. We then have a transform component and a camera component which together represents the view projection matrix of the camera. After that we have a list of shading environment components which we deem relevant to be controlled by a physical camera (all post effects relevent to a camera). The state of these shading environment components is controled through a script component called the "Physical Camera Properties Mapper" (more on this later). Here is a glimpse of what the Physical Camera entity may look like (wip):

So while there are a lot of components that belongs to a physical camera, the user is expected to interact mainly with the body and the lens components.

Post Effects

A lot of our post effects are dedicated to simulate some sort of camera/lens artifact (DOF, motion blur, film grain, vignetting, bloom, chromatic aberation, ect). One thing we wanted was the ability for physical cameras to override the post processes defined in our global shading environments. We also wanted to let users easily opt out of the physically based mapping that occurred between a camera and it's corresponding post-effect. For example a physical camera will generate an accurate circle of confusion for the depth of field effect, but a user might be frustrated by the limitations imposed by a physically correct dof effect. In this case a user can opt out by simply deleting the "Depth Of Field" component from the camera entity.

It's nice to see how the expressiveness of the Stingray entity system is shaping up and how it enables us to build these complex entities without the need to change much of the engine.

Properties Mapper

All of the mapping occurs in the properties mapper component which I mentioned earlier. This is simply a lua script that gets executed whenever any of the entity properties are edited.

The most important property we wanted to map was the exposure value. We wanted the f-stop, shutter speed, and ISO values to map to an exposure value which would simulate how a real camera sensor reacts to incoming light. Lucky for us this topic is very well covered by Sebastien Lagarde and Charles de Rousiers in their awesome awesome awesome Moving Frostbite to Physically Based Rendering document. The mapping basically boils down to:

local function compute_ev(aperture, shutter_time, iso)
 local ev_100 = log2((aperture * aperture * 100) / (shutter_time * iso))
 local max_luminance = 1.2 * math.pow(2, ev_100)
 return (1 / max_luminance)
end

The second property we were really interested in mapping is the field of view of the camera. Usually the horizontal FOV is calculated as 2 x atan(h/2f) where h is the camera sensor's width and f is the current focal length of the lens. This by itself gives a good approximation of the FOV of a lens, but as was pointed out by the MGS5 & Fox Engine presentation, the focus distance of the lens should also be considered when calculating the FOV from the camera properties.

Intuitively we though that the change in the FOV was caused by a change in the effective focal length of the lens. Adjusting the focus usually shifts a group of lenses up and down the optical axis of a camera lens. Our best guess was that this shift would increase or decrease the effective focal length of the camera lens. Using this idea we we're able to simulate the effect that changing the focus point has on the FOV of a camera:

local function compute_fov(focal_length, film_back_height, focus)
 local normalized_focus = (focus - 0.38)/(5.0 - 0.38)
 local focal_length_offset = lerp(0.0, 1.0, normalized_focus)
 return 2.0 * math.atan(film_back_height/(2.0 * (focal_length + focal_length_offset)))
end

While this gave us plausible results in some cases, it does not map accurately to a real world camera with certain lens settings. For example we can choose a focal length offset that gives good FOV mapping for a zoom lens set to 24mm but incorrect FOV results when it's set to 70mm (see video above). This area of lens optics is one we would like to explore more in the future.

In the future we will map more camera properties to their corresponding post-effects. More on this in a follow up blog.

Validating Results

To validate our mappings we designed a small, controlled environment room that we re-created in stingray. This idea was inspired by the "Conference Room" setup that was presented by Hideo Kojima in the MGS5 & Fox Engine presentation. We used our simplified, environment room to compare our rendered results with real world photographs.

Controlled Environment:

Stingray Equivalent:

Since there is no convenient way to adjust the white balancing in Stingray, we decided to white balance our camera data and use a pure white light in our Stingray scene. We also decided to compare the photos and renders in linear space in the hope to minimize the source of potential error.

White balancing our photographs:

Our very first comparison we're disapointing:

We tracked down the difference in brightness to a problem with how we expressed our light intensity. We discovered that we made the mistake of using the specified lumen value of our lights as it's light intensity. The total luminous flux is expressed in lumens, but the luminous intensity (what the material shader is interested in) is actually the luminous flux per solid angle. So while we let the users enter the "intensity" of lights in lumens, we need to map this value to luminous intensity. The mapping is done as lumens/2π(1-cos(½α)) where α is the apex angle of the light. Lots of details can be found here. This works well for point lights and spot lights. In the future our directional lights will be assumed to be the sun or moon and will be expressed in lux, perhaps with a corresponding disk size.

With this fix in place we started getting more encouraging results:

There is lots left to do but this feels like a very good start to our physically based cameras. This is my last post on the Stingray blog but you can follow Olivier on Twitter if you want to stay up to date with the advances made by the Stingray rendering team. Cheers!

Validating materials and lights in Stingray

2017-07-16T04:48:00.001+02:00

Stingray 1.9 is just around the corner and with it will come our new physical lights. I wanted to write a little bit about the validation process that we went through to increase our confidence in the behaviour of our materials and lights.

Early on we were quite set on building a small controlled "light room" similar to what the Fox Engine team presented at GDC as a validation process. But while this seemed like a fantastic way to confirm the entire pipeline is giving plausible results, it felt like identifying the source of discontinuities when comparing photographs vs renders might involve a lot of guess work. So we decided to delay the validation process through a controlled light room and started thinking about comparing our results with a high quality offline renderer. Since SolidAngle joined Autodesk last year and that we had access to an Arnold license server it seemed like a good candidate. Note that the Arnold SDK is extremely easy to use and can be downloaded for free. If you don't have a license you still have access to all the features and the only limitation is that the rendered frames are watermarked.

We started writing a Stingray plugin that supported simple scene reflection into Arnold. We also implemented a custom Arnold Output Driver which allowed us to forward Arnold's linear data directly into the Stingray viewport where they would be gamma corrected and tonemapped by Stingray (minimizing as many potential sources of error).

Material parameters mapping

The trickiest part of the process was to find an Arnold material which we could use to validate. When we started this work we used Arnold 4.3 and realized early that the Arnold's Standard shader didn't map very well to the Metallic/Roughness model. We had more luck using the alSurface shader with the following mapping:

// "alSurface"
// =====================================================================================
AiNodeSetRGB(surface_shader, "diffuseColor", color.x, color.y, color.z);
AiNodeSetInt(surface_shader, "specular1FresnelMode", 0);
AiNodeSetInt(surface_shader, "specular1Distribution", 1);
AiNodeSetFlt(surface_shader, "specular1Strength", 1.0f - metallic);
AiNodeSetRGB(surface_shader, "specular1Color", white.x, white.y, white.z);
AiNodeSetFlt(surface_shader, "specular1Roughness", roughness);
AiNodeSetFlt(surface_shader, "specular1Ior", 1.5f); // ior = (n-1)^2/(n+1)^2 for 0.04
AiNodeSetRGB(surface_shader, "specular1Reflectivity", white.x, white.y, white.z);
AiNodeSetRGB(surface_shader, "specular1EdgeTint", white.x, white.y, white.z);

AiNodeSetInt(surface_shader, "specular2FresnelMode", 1);
AiNodeSetInt(surface_shader, "specular2Distribution", 1);
AiNodeSetFlt(surface_shader, "specular2Strength", metallic);
AiNodeSetRGB(surface_shader, "specular2Color", color.x, color.y, color.z);
AiNodeSetFlt(surface_shader, "specular2Roughness", roughness);
AiNodeSetRGB(surface_shader, "specular2Reflectivity", white.x, white.y, white.z);
AiNodeSetRGB(surface_shader, "specular2EdgeTint", white.x, white.y, white.z);

Stingray VS Arnold: roughness = 0, metallicness = [0, 1]

Stingray VS Arnold: metallicness = 1, roughness = [0, 1]

Halfway through the validation process Arnold 5.0 got released and with it came the new Standard Surface shader which is based on a Metalness/Roughness workflow. This allowed for a much simpler mapping:

// "aiStandardSurface"
// =====================================================================================
AiNodeSetFlt(standard_shader, "base", 1.f);
AiNodeSetRGB(standard_shader, "base_color", color.x, color.y, color.z);
AiNodeSetFlt(standard_shader, "diffuse_roughness", 0.f); // Use Lambert for diffuse

AiNodeSetFlt(standard_shader, "specular", 1.f);
AiNodeSetFlt(standard_shader, "specular_IOR", 1.5f); // ior = (n-1)^2/(n+1)^2 for 0.04
AiNodeSetRGB(standard_shader, "specular_color", 1, 1, 1);
AiNodeSetFlt(standard_shader, "specular_roughness", roughness);
AiNodeSetFlt(standard_shader, "metalness", metallic);

Investigating material differences

The first thing we noticed is an excess in reflection intensity for reflections with large incident angles. Arnold supports Light Path Expressions which made it very easy to compare and identify the term causing the differences. In this particular case we quickly identified that we had an energy conservation problem. Specifically the contribution from the Fresnel reflections was not removed from the diffuse contribution:

Scenes with a lot of smooth reflective surfaces demonstrates the impact of this issue noticeably:

Another source of differences and confusion came from the tint of the Fresnel term for metallic surfaces. Different shaders I investigaed had different behaviors. Some tinted the Fresnel term with the base color while some others didn't:

It wasn't clear to me how Fresnel's law of reflection applied to metals. I asked on Twitter what peoples thoughts were on this and got this simple and elegant claim made by Brooke Hodgman: "Metalic reflections are coloured because their Fresnel is wavelength varying, but Fresnel still goes to 1 at 90deg for every wavelength". This convinced me instantly that indeed the correct thing to do was to use an un-tinted Fresnel contribution regardless of the metallicness of the material. I later found this graph which also confirmed this:

For the Fresnel term we use a pre filtered Fresnel offset stored in a 2d lut (as proposed by Brian Karis in Real Shading in Unreal Engine 4). While results can diverge slightly from Arnold's Standard Surface Shader (see "the effect of metalness" from Zap Anderson's Physical Material Whitepaper), in most cases we get an edge tint that is pretty close:

Investigating light differences

With the brdf validated we could start looking into validating our physical lights. Stingray currently supports point, spot, and directional lights (with more to come). The main problem we discovered with our lights is that the attenuation function we use is a bit awkward. Specifically we attenuate by I/(d+1)^2 as opposed to I/d^2 (Where 'I' is the intensity of the light source and 'd' is the distance to the light source from the shaded point). The main reason behind this decision is to manage the overflow that could occur in the light accumulation buffer. Adding the +1 effectively clamps the maximum value intensity of the light as the intensity set for that light itself i.e. as 'd' approaches zero 'I' approaches the intensity set for that light (as opposed to infinity). Unfortunatly this decision also means we can't get physically correct light falloffs in a scene:

Even if we scale the intensity of the light to match the intensity for a certain distance (say 1m) we still have a different falloff curve than the physically correct attenuation. It's not too bad in a game context, but in the architectural world this is a bigger issue:

This issue will be fixed in Stingray 1.10. Using I/(d+e)^2 (where 'e' is 1/max_value along) with an EV shift up and down while writing and reading from the accumulation buffer as described by Nathan Reed is a good step forward.

Finally we were also able to validate our ies profile parser/shader and our color temperatures behaved as expected:

Results and final thoughts

Integrating a high quality offline renderer like Arnold has proven invaluable in the process of validating our lights in Stingray. A similar validation process could be applicable to many other aspects of our rendering pipeline (antialiasing, refractive materials, fur, hair, post-effects, volumetrics, ect)

I also think that it can be a very powerful tool for content creators to build intuition on the impact of indirect lighting in a particular scene. For example in a simple level like this, adding a diffuse plane dramatically changes the lighting on the Buddha statue:

The next step is now to compare our results with photographs gathered from a controlled environments. To be continued...

Physically Based Lens Flare

2017-07-03T19:22:00.001+02:00

While playing Horizon Zero Dawn I got inspired by the lens flare they supported and decided to look into implementing some basic ones in Stingray. There were four types of flare I was particularly interested in.

Anisomorphic flare
Aperture diffraction (Starbursts)
Camera ghosts due to Sun or Moon (High Quality - What this post will cover)
Camera ghosts due to all other light sources (Low Quality - Screen Space Effect)

Image credits: Die Hard (1), Just Another Dang Blog (2), PEXELS (3), The Matrix Reloaded (4)

Once finished I'll do a follow up blog post on the Camera Lens Flare plugin, but for now I want to share the implementation details of the high-quality ghosts which are an implementation of "Physically-Based Real-Time Lens Flare Rendering".

Code and Results

All the code used to generate the images and videos of this article can can be found on github.com/greje656/PhysicallyBasedLensFlare.

Ghosts

The basic idea of the "Physically-Based Lens Flare" paper is to ray trace "bundles" into a lens system which will end up on a sensor to form a ghost. A ghost here refers to the de-focused light that reaches the sensor of a camera due to the light reflecting off the lenses. Since a camera lens is not typically made of a single optical lens but many lenses there can be many ghosts that form on it's sensor. If we only consider the ghosts that are formed from two bounces, that's a total of nCr(n,2) possible ghosts combinations (where n is the number of lens components in a camera lens)

Lens Interface Description

Ok let's get into it. To trace rays in an optical system we obviously need to build an optical system first. This part can be tedious. Not only have you got to find the "Lens Prescription" you are looking for, you also need to manually parse it. For example parsing the Nikon 28-75mm patent data might look something like this:

There is no standard way of describing such systems. You may find all the information you need from a lens patent, but often (especially for older lenses) you end up staring at an old document that seems to be missing important information required for the algorithm. For example, the Russian lens MIR-1 apparently produces beautiful lens flare, but the only lens description I could find for it was this:

MIP.1B manual

Ray Tracing

Once you have parsed your lens description into something your trace algorithm can consume, you can then start to ray trace. The idea is to initialize a tessellated patch at the camera's light entry point and trace through each of the points in the direction of the incoming light. There are a couple of subtleties to note regarding the tracing algorithm.

First, when a ray misses a lens component the raytracing routine isn't necessarily stopped. Instead if the ray can continue with a path that is meaningful the ray trace continues until it reaches the sensor. Only if the ray misses the sphere formed by the radius of the lens do we break the raytracing routine. The idea behind this is to get as many traced points to reach the sensor so that the interpolated data can remain as continuous as possible. Rays track the maximum relative distance it had with a lens component while tracing through the interface. This relative distance will be used in the pixel shader later to determine if a ray had left the interface.

Relative distance visualized as green/orange gradient (black means ray missed lens component completely)

Secondly, a ray bundle carries a fixed amount of energy so it is important to consider the distortion of the bundle area that occurs while tracing them. In In the paper, the author states:

"At each vertex, we store the average value of its surrounding neighbours. The regular grid of rays, combined with the transform feedback (or the stream-out) mechanism of modern graphics hardware, makes this lookup of neighbouring quad values very easy"

I don't understand how the transform feedback, along with the available adjacency information of the geometry shader could be enough to provide the information of the four surrounding quads of a vertex (if you know please leave a comment). Luckily we now have compute and UAVs which turn this problem into a fairly trivial one. Currently I only calculate an approximation of the surrounding areas by assuming the neighbouring quads are roughly parallelograms. I estimate their bases and heights as the average lengths of their top/bottom, left/right segments. The results are seen as caustics forming on the sensor where some bundles converge into tighter area patches while some other dilates:

This works fairly well but is expensive. Something that I intend to improve in the future.

Now that we have a traced patch we need to make some sense out of it. The patch "as is" can look intimidating at first. Due to early exits of some rays the final vertices can sometimes look like something went terribly wrong. Here is a particularly distorted ghost:

The first thing to do is discard pixels that exited the lens system:

float intensity1 = max_relative_distance < 1.0f;
float intensity = intensity1;
if(intensity == 0.f) discard;

Then we can discard the rays that didn't have any energy as they entered to begin with (say outside the sun disk):

float lens_distance = length(entry_coordinates.xy);
float sun_disk = 1 - saturate((lens_distance - 1.f + fade)/fade);
sun_disk = smoothstep(0, 1, sun_disk);
//...
float intensity2 = sun_disk;
float intensity = intensity1 * intensity2;
if(intensity == 0.f) discard;

Then we can discard the rays that we're blocked by the aperture:

//...
float intensity3 = aperture_sample;
float intensity = intensity1 * intensity2 * intensity3;
if(intensity == 0.f) discard;

Finally we adjust the radiance of the beams based on their final areas:

//...
float intensity4 = (original_area/(new_area + eps)) * energy;
float intensity = intensity1 * intensity2 * intensity3 * intensity4;
if(intensity == 0.f) discard;

The final value is the rgb reflectance value of the ghost modulated by the incoming light color:

float3 color = intensity * reflectance.xyz * TempToColor(INCOMING_LIGHT_TEMP);

Aperture

Image credits: 6iee

The aperture shape is built procedurally. As suggested by Padraic Hennessy's blog I use a signed distance field confined by "n" segments and threshold it against some distance value. I also experimented with approximating the light diffraction that occurs at the edge of the apperture blades using a simple function:

Finally, I offset the signed distance field with a repeating sin function which can give curved aperture blades:

Starburst

The starburst phenomena is due to light diffraction that passes through the small aperture hole. It's a phenomena known as the "single slit diffraction of light". The author got really convincing results to simulate this using the Fraunhofer approximation. The challenge with this approach is that it requires bringing the aperture texture into Fourier space which is not trivial. In previous projects I used Cuda's math library to perform the FFT of a signal but since the goal is to bring this into Stingray I didn't want to have such a dependency. Luckily I found this little gem posted by Joseph S. from intel. He provides a clean and elegant compute implementation of the butterfly passes method which bring a signal to and from Fourier space. Using it I can feed in the aperture shape and extract the Fourier Power Spectrum:

This spectrum needs to be filtered further in order to look like a starburst. This is where the Fraunhofer approximation comes in. The idea is to basically reconstruct the diffraction of white light by summing up the diffraction of multiple wavelengths. The key observation is that the same Fourier signal can be used for all wavelengths. The only thing needed is to scale the sampling coordinates of the Fourier power spectrum:

(x0,y0) = (u,v)·λ·z0 for λ = 350nm/435nm/525nm/700nm

Summing up the wavelengths gives the starburst image. To get more interesting results I apply an extra filtering step. I use a spiral pattern mixed with a small rotation to get rid of any left over radial ringing artifacts (judging by the author's starburst results I suspect this is a step they are also doing):

Anti Reflection Coating

While some appreciate the artistic aspect of lens flare, lens manufacturers work hard to minimize them by coating lenses with anti-reflection coatings. The coating applied to each lenses are usually designed to minimize the reflection of a specific wavelength. They are defined by their thickness and index of refraction. Given the wavelength to minimize reflections for, and the IORs of the two medium involved in the reflection (say n0 and n2), the ideal IOR (n1) and thickness (d) of the coating are defined as n1 = sqrt(n0·n2) and d=λ/4·n1. This is known as a quarter wavelength anti-reflection coating. I've found this site very helpful to understand this phenomenon.

In the current implementation each lens coating specifies a wavelength the coating should be optimized for and the ideal thickness and IOR are used by default. I added a controllable offset to thicken the AR coating layer in order to conveniently reduce it's anti-reflection properties:

No AR Coating:

Ideal AR Coating:

AR Coating with offsetted thickness:

Optimisations

Currently the full cost of the effect for a Nikon 28-75mm lens is 12ms (3ms to ray march 352x32x32 points and 9ms to draw the 352 patches). The performance degrades as the sun disk is made bigger since it results in more and more overshading during the rasterisation of each ghosts. With a simpler lens interface like the 1955 Angenieux the cost decreases significantly. In the current implementation every possible "two bounce ghost" is traced and drawn. For a lens system like the Nikon 28-75mm which has 27 lens components, that's n!/r!(n-r)! = 352 ghosts. It's easy to see that this number can increase dramatically with the number of component.

An obvious optimization would be to skip ghosts that have intensities so low that their contributions are imperceptible. Using Compute/DrawIndirect it would be fairly simple to first run a coarse pass and use it to cull non contributing ghosts. This would reduce the compute and rasterization pressure on the gpu dramatically. Something I intend to do in future.

Conclusion

I'm not sure if this approach was ever used in a game. It would probably be hard to justify it's heavy cost. I feel this would have a better use case in the context of pre-visualization where a director might be interested in having early feedback on how a certain lens might behave in a shot.

Image credits: Wallup

Image credits: Wallpapers Web

Finally, be aware the author has filled a patent for the algorithm described in his paper, which may put limits on how you may use parts of what is described in my post. Please contact the paper's author for more information on what restrictions might be in place.

Reprojecting Reflections

2017-06-22T19:33:00.003+02:00

Screen space reflections are such a pain. When combined with taa they are even harder to manage. Raytracing against a jittered depth/normal g-buffer can easily cause reflection rays to have widely different intersection points from frame to frame. When using neighborhood clamping, it can become difficult to handle the flickering caused by too much clipping especially for surfaces that have normal maps with high frequency patterns in them.

On top of this, reflections are very hard to reproject. Since they are view dependent simply fetching the motion vector from the current pixel tends to make the reprojection "smudge" under camera motion. Here's a small video grab that I did while playing Uncharted 4 (notice how the reflections trails under camera motion)

Last year I spent some time trying to understand this problem a little bit more. I first drew a ray diagram describing how a reflection could be reprojected in theory. Consider the goal of reprojecting the reflection that occurs at incidence point v0 (see diagram bellow), then to reproject the reflection which occurred at that point you would need to:

Retrieve the surface motion vector (ms) corresponding to the reflection incidence point (v0)
Reproject the incidence point using (ms)
Using the depth buffer history, reconstruct the reflection incidence point (v1)
Retrieve the motion vector (mr) corresponding to the reflected point (p0)
Reproject the reflection point using (mr)
Using the depth buffer history, reconstruct the previous reflection point (p1)
Using the previous view matrix transform, reconstruct the previous surface normal of the incidence point (n1)
Project the camera position (deye) and the reconstructed reflection point (dp1) onto the previous plane (defined by surface normal = n1, and surface point = v1)
Solve for the position of the previous reflection point (r) knowing (deye) and (dp1)
Finally, using the previous view-projection matrix, evaluate (r) in the previous reflection buffer

By adding to Stingray a history depth buffer and using the previous view-projection matrix I was able to confirm this approach could successfully reproject reflections.

float3 proj_point_in_plane(float3 p, float3 v0, float3 n, out float d) {
 d = dot(n, p - v0);
 return p - (n * d);
}

float3 find_reflection_incident_point(float3 p0, float3 p1, float3 v0, float3 n) {
 float d0 = 0;
 float d1 = 0;
 float3 proj_p0 = proj_point_in_plane(p0, v0, n, d0);
 float3 proj_p1 = proj_point_in_plane(p1, v0, n, d1);

 if(d1 < d0)
  return (proj_p0 - proj_p1) * d1/(d0+d1) + proj_p1;
 else
  return (proj_p1 - proj_p0) * d0/(d0+d1) + proj_p0;
}

float2 find_previous_reflection_position(
 float3 ss_pos, float3 ss_ray,
 float2 surface_motion_vector, float2 reflection_motion_vector,
 float3 world_normal) {
 float3 ss_p0 = 0;
 ss_p0.xy = ss_pos.xy - surface_motion_vector;
 ss_p0.z = TEX2D(input_texture5, ss_p0.xy).r;

 float3 ss_p1 = 0;
 ss_p1.xy = ss_ray.xy - reflection_motion_vector;
 ss_p1.z = TEX2D(input_texture5, ss_p1.xy).r;

 float3 view_n = normalize(world_to_prev_view(world_normal, 0));
 float3 view_p0 = float3(0,0,0);
 float3 view_v0 = ss_to_view(ss_p0, 1);
 float3 view_p1 = ss_to_view(ss_p1, 1);

 float3 view_intersection = 
  find_reflection_incident_point(view_p0, view_p1, view_v0, view_n);
 float3 ss_intersection = view_to_ss(view_intersection, 1);

 return ss_intersection.xy;
}

You can see in these videos that most of the reprojection distortion in the reflections are addressed:

Ghosting was definitely minimized under camera motion. The video bellow compares the two reprojection method side by side.

LEFT: Simple Reprojection, RIGHT: Correct Reprojection
(note that I disabled neighborhood clamping in this video to visualize the reprojection better)

So instead I tried a different approach. The new idea was to pick a few reprojection vectors that are likely to be meaningful in the context of a reflection. Originally I looked into:

Motion vector at ray incidence
Motion vector at ray intersection
Parallax corrected motion vector at ray incidence
Parallax corrected motion vector at ray intersection

The idea of doing parallax correction on motion vectors for reflections came from the Stochastic Screen-Space Reflections talk presented by Tomasz Stachowiak at Siggraph 2015. Right now here's how it's currently implemented although I'm not 100% sure that's as correct as it could be (there's a PARALLAX_FACTOR define which I needed to manually tweak to get optimal results. Perhaps there's a better way of doing this)?

float2 parallax_velocity = velocity * saturate(1.0 - total_ray_length * PARALLAX_FACTOR);

Once all those interesting vectors are retrieved, the one with the smallest magnitude is declared as "the most likely succesful reprojection vector". This simple idea alone has improved the reprojection of the ssr buffer quite significantly (note that if casting multiple rays per pixel, then averaging the sum of all succesful reprojection vectors still gave us a better reprojection than what we had previously)

Screen space reflections is one of the most difficult screen space effect I've had to deal with. They are plagued with artifacts which can often be difficult to explain or understand. In the last couple of years I've seen people propose really creative ways to minimize some of these artifacts that are inherent to ssr. I hope this continues!

Rebuilding the Entity Index

2017-05-16T23:56:00.001+02:00

Background

If you are not familiar with the Stingray Entity system you can find good resources to catch up here:

The Entity system is a very central part of the future of Stingray and as we integrate it with more parts new requirements pops up. One of those is the ability to interact with Entity Components via the visual scripting language in Stingray - Flow. We want to provide a generic interface to Entites in Flow without adding weight to the fundamental Entity system.

To accomplish this we added a “Property” system that Flow and other parts of the Stingray Engine can use which is optional for each Component to implement in addition to having its own specialized API. The Property System enables an API to read and write entity component properties using the name of component, the property name and the property value. The Property System needs to be able to find a specific Component Instance by name for an Entity, and the Entity System does not directly track an Entity / Component Instance relationship. It does not even track the Entity / Component Manager relationship.

So what we did was to add the Entity Index, a registry where we add all Component Instances created for an Entity as it is constructed from an Entity Resource. To make it usable we also added the rule that each Component in an Entity Resource should have a unique name within the resource so the user can identify it by name when using the Flow system.

In order for the Flow system to work we need to be able to find a specific component instance by name for an Entity so we could get and set properties of that instance. This is the job of the Entity Index. In the Entity Index you can register an Entitys components by name so you later can do a lookup.

Property System and Entity Index

When creating an Entity we use the name of the component instance together with the component type name, i.e the Component Manager, and create an Entity Index that maps the name to the component instance and the Component Manager. In the Stingray Entity system an Entity cannot have two component instances with the same name.

Example:

Entity

Transform - Transform Component
Fog - Render Data Component
Vignette - Render Data Component

For this Entity we would instantiate one Transform Component Instance and two Render Data Component Instances. We get back an InstanceId for each Component Instance which can be used to identify which of Fog or Vignette we are talking about even though they are created from the same Entity using the same Component Manager.

We also register this in the Entity Index as:

Key	Value
Entity	Array<Components>

The Array<Components> contains one or more entries which each contain the following:

Components
Component Manager
InstanceId
Name

Lets add the a few entities and components to the Entity Index:

entity_1.id

Name	Component Manager	InstanceId
hash(“Transform”)	&transform_manager	13
hash(“Fog”)	&render_data_manager_1	4
hash(“Vignette”)	&render_data_manager_1	5

entity_2.id

Name	Component Manager	InstanceId
hash(“Transform”)	&transform_manager	14
hash(“Fog”)	&render_data_manager_1	6
hash(“Vignette”)	&render_data_manager_1	7

entity_3.id

Name	Component Manager	InstanceId
hash(“Transform”)	&transform_manager	2
hash(“Fog”)	&render_data_manager_2	4
hash(“Vignette”)	&render_data_manager_2	5

This allows Flow to set and get properties using the Entity and the Component Name. Using the Entity and Component Name we can look up which Component Manager has the component instance and which InstanceId it has assigned to it so we can get the Instance and operate on the data.

The problem with this implementation is that it will become very large - we need a large registry with one key-array pair for each Entity where the array contains one entry for each Component Instance for the Entity, not very efficient as the number of entites grow. There is no reuse at all in the Entity Index - and it can’t be - each entry in the index is unique with no overlap.

Here are some measurements using a synthetic test that creates entities, add and looks up components on the entities and deleted entities. It deletes parts of the entities as it runs and does garbage collection. The number entities given in the tables is the total number created during the test, not the number of simultaneous entities which varies over time. The entities has 75 different types of component compositions, ranging from a single component to eleven for other entities. The test is single threaded and no locking besides some on the memory sub system which makes the times match up well with CPU usage.

Entity Count	Test run time (s)	Memory used (Mb)	Time/Entity (us)
10k	0.01	5.79	0.977
20k	0.01	5.79	0.488
40k	0.03	11.88	0.732
80k	0.06	11.88	0.732
160k	0.13	25.69	0.793
320k	0.32	31.04	0.977
640k	1.08	55.90	1.648
1.28m	2.58	65.82	1.922
2.56m	6.35	65.55	2.366
5.12m	13.42	120.55	2.500
10.24m	25.69	130.55	2.393

As you can see we start to take longer and longer time and use more and more memory as we double the number of entities and as we get to the larger numbers the time and memory increases pretty dramatically.

Since we plan to use the entity system extensively we need an index that is more efficient with memory and scales more linearly in CPU usage.

Shifting control of the InstanceId

The InstanceId is defined to be unique to the Entity instance for a specific Component Manager - it does not have to be unique for all components in a Component Manager, nor does it have to be unique across different Component Managers.

The create and lookup functions for an Component Instance looks like this:

InstanceWithId instance_with_id = transform_manager.create(entity);
InstanceId my_transform_id = instance_with_id.id;

.....

Instance instance = transform_manager.lookup(entity, my_transform_id);

The interface is somewhat confusing since the create function returns both the component instance id and the instance. This is done so you don’t have to do a lookup of the instance directly after create. As you can see we have no knowledge of what the resulting InstanceId will be so we can’t make any assumptions in the Entity Index forcing us to have unique entries for each Component instance of every Entity.

But we already set up the rule that in the Entity Resource, each Component should have a unique name for the Property System to work - this is a new requirement that was added at a later stage than when designing the initial Entity system. Now that it is there we can make use of this to simplify the Entity Index.

Instead of letting each Component Manager decide the InstanceId we let the caller to the create function decide the InstanceId. We can decide that the InstanceId should be the 32-bit hash of the Component Name from the Entity Resource. Doing this will restrict the possible optimization that a component manager could do if it had control of the InstanceId, but so far we have had no real use case for it and the benefits of changing this are greater than the loss of a possible optimization that we might do sometime in the future.

So we change the API like this:

Instance instance = transform_manager.create(entity, hash("Transform"));

.....

Instance instance = transform_manager.lookup(entity, hash("Transform"));

Nice, clean and symmetrical. Note though that the InstanceId is entierly up to the caller to control, it does not have to be a hash of a string. It must be unique for an Entity within a specific component manager. Having it work with the Entity Index and the Property System the InstanceId needs to be unique across all Component Instances in all Component Managers for each Entity instance. This is enforced when an Entity is created from a resource but not when constructing Component Instances by hand in code. If you want a component added outside the resource construction to work with the Property System care needs to be taken so it does not collide with other names of component instances for the Entity.

Lets add the entities and components again using the new rule set, the Entity Index now look like this:

entity_1.id

Name	Component Manager	InstanceId
hash(“Transform”)	&transform_manager	hash(“Transform”)
hash(“Fog”)	&render_data_manager_1	hash(“Fog”)
hash(“Vignette”)	&render_data_manager_1	hash(“Vignette”)

entity_2.id

Name	Component Manager	InstanceId
hash(“Transform”)	&transform_manager	hash(“Transform”)
hash(“Fog”)	&render_data_manager_1	hash(“Fog”)
hash(“Vignette”)	&render_data_manager_1	hash(“Vignette”)

entity_3.id

Name	Component Manager	InstanceId
hash(“Transform”)	&transform_manager	hash(“Transform”)
hash(“Fog”)	&render_data_manager_2	hash(“Fog”)
hash(“Vignette”)	&render_data_manager_2	hash(“Vignette”)

As we now see the Instance Id column now contain redundant data - we only need to store the Component Manager pointer. We use the Entity and hash the component name to find our Component Manager which can be used to look up the Instance.

entity_1.id

Name	Component Manager
hash(“Transform”)	&transform_manager
hash(“Fog”)	&render_data_manager_1
hash(“Vignette”)	&render_data_manager_1

entity_2.id

Name	Component Manager
hash(“Transform”)	&transform_manager
hash(“Fog”)	&render_data_manager_1
hash(“Vignette”)	&render_data_manager_1

entity_3.id

Name	Component Manager
hash(“Transform”)	&transform_manager
hash(“Fog”)	&render_data_manager_2
hash(“Vignette”)	&render_data_manager_2

We now also see that the lookup array for entity_1 and entity_2 are identical so two keys could point to the same value.

Options for implementation

We could opt for an index that has a map from entity_id to a list or map of entries for lookup:

entity_1.id = [ hash("Transform"), &transform_manager ], [ hash("Fog"), &render_data_manager_1 ], [ hash("Vignette"), &render_data_manager_1 ]
entity_2.id = [ hash("Transform"), &transform_manager ], [ hash("Fog"), &render_data_manager_1 ], [ hash("Vignette"), &render_data_manager_1 ]
entity_3.id = [ hash("Transform"), &transform_manager ], [ hash("Fog"), &render_data_manager_2 ], [ hash("Vignette"), &render_data_manager_2 ]

We should probably not store the same entry lookup list multiple times if it can be resused by multiple entity instances as this wastes space, but at any time a new component instance can be added or removed from an entity and its entry list would then change - that would mean administrating memory for the lookup lists and detecting when two entities starts to diverge so we can make a new extended copy of the entry list for the changed entity. We should probably also remove lookup lists that are no longer used as it would waste memory.

Entity and component creation

The call sequence for creating entities from resources (or even programmatically) looks something like this:

Entity e = create();
Instance transform = transform_manager.create(e, hash("Transform"));
Instance fog = render_data_manager_1.create(e, hash("Fog"));
Instance vignette = render_data_manager_1.create(e, hash("Vignette"));

In this scenario we could potentially build a entity lookup list for the entity which contains lookup for the transform, fog and vignette instances:

entity_index.register(e, [ hash("Transform"), &transform_manager ], [ hash("Fog"), &render_data_manager_1 ], [ hash("Vignette"), &render_data_manager_1 ]);

But as stated previously - component instances can be added and removed at any point in time making the lookup table change during the lifetime of the Entity. We need to be able to extend it at will, so it should look something like this:

Entity e = create();
Instance transform = transform_manager.create(e, hash("Transform"));
entity_index.register(e, [ hash("Transform"), &transform_manager ]);

Instance fog = render_data_manager_1.create(e, hash("Fog"));
entity_index.register(e, [ hash("Fog"), &render_data_manager_1 ]);

Instance vignette = render_data_manager_1.create(e, hash("Vignette"));
entity_index.register(e, [ hash("Vignette"), &render_data_manager_1 ]);

Now we just extend the lookup list of the entity as we add new components. This means that two entities that started out life as having identical lookup lists after being spawned from a resource might diverge over time so the Entity Index needs to handle that.

Component Instances can also be destroyed, so we should handle that as well. Even if we do not remove component instances things will still work - if we keep a lookup to an Instance that has been removed we would then just fail the lookup in the corresponding Component Manager. It would lead to waste of memory though, something we need to be aware of going forward.

Building a Prototype chain

Looking at how we build up the Component instances for an Entity it goes something like this: first add the Transform, then add Fog and finally Vignette. This looks sort of like an inheritance chain…
Lets call a lookup list that contains a specific set of entry values a Prototype.

An entity starts with an empty lookup list that contains nothing [], this is the base Prototype, lets call that P0.

Add the “Transform” component and your prototype is now P0 + [&transform_manager, “Transform”], lets call that prototype P1.
Add the “Fog” component, now the prototype is P1 + [&render_data_manager_1, “Fog”] - call it P2.
Add the “Vignette” component, now the prototype is P2 + [&render_data_manager_1, “Vignette”] - call it P3.

Your entity is now using the prototype P3, and from that you can find all the lookup entries you need.
The prototype registry will contain:

P0 = []
P1 = [] + [&transform_manager, "Transform"]
P2 = [] + [&transform_manager, "Transform"] + [&render_data_manager_1, "Fog"]
P3 = [] + [&transform_manager, "Transform"] + [&render_data_manager_1, "Fog"] + [&render_data_manager_1, "Vignette"]

If you create another entity which uses the same Components with the same names you will end up with the same prototype:

Create entity_2, it will have the empty prototype P0.

Add the “Transform” component and your prototype now P1.
Add the “Fog” component, now the prototype is P2.
Add the “Vignette” component, now the prototype is P3.

We end up with the same prototype P3 as the other entity - as long as we add the entities in the same order we end up with the same prototype. For entites created from resources this will be true for all entities created from the same entity resource. For components that are added programatically it will only work if the code adds components in the same order, but even if they do not always do this we still will have a very large overlap for most of the entities.

Lets look at the third example where we do not have an exact match, entity_3:

Create entity_3, it will have the empty prototype P0.

Add the “Transform” component and your prototype is now P0 + [&transform_manager:Transform, “Transform”] = P1.
Add the “Fog” component - this render data component manager is not the same as entity_1 and entity_2 so we get P1 + [&render_data_manager_2, “Fog”], this does not match P2 so we make a new prototype P4 instead.
Add the “Vignette” component, now the prototype is P4 + [&render_data_manager_2, “Vignette”] -> P5.

The prototype registry will contain:

P0 = []
P1 = [] + [&transform_manager, "Transform"]
P2 = [] + [&transform_manager, "Transform"] + [&render_data_manager_1, "Fog"]
P3 = [] + [&transform_manager, "Transform"] + [&render_data_manager_1, "Fog"] + [&render_data_manager_1, "Vignette"]
P4 = [] + [&transform_manager, "Transform"] + [&render_data_manager_2, "Fog"]
P5 = [] + [&transform_manager, "Transform"] + [&render_data_manager_2, "Fog"] + [&render_data_manager_2, "Vignette"]

Storage of the prototype

We can either for each prototype store all the component lookup entries - this makes it easy to get all the component instance look-ups in one go at the expense of memory due to data duplication. Each entity will store which prototype it uses.

entity_1 -> P3
entity_2 -> P3
entity_3 -> P5

The prototype registry now contains:

P0 = []
P1 = [] + [&transform_manager, "Transform"]
P2 = [] + [&transform_manager, "Transform"] + [&render_data_manager_1, "Fog"]
P3 = [] + [&transform_manager, "Transform"] + [&render_data_manager_1, "Fog"] + [&render_data_manager_1, "Vignette"]
P4 = [] + [&transform_manager, "Transform"] + [&render_data_manager_2, "Fog"]
P5 = [] + [&transform_manager, "Transform"] + [&render_data_manager_2, "Fog"] + [&render_data_manager_2, "Vignette"]

Some of the entries (P2 and P4) could technically be removed since they are not actively used - we would need to temporarily re-create them as new entries with the same structure were added.
A different option is to actually use the intermediate entries by referencing them, like so:

P0 = []
P1 = P0 + [&transform_manager, "Transform"]
P2 = P1 + [&render_data_manager_1, "Fog"]
P3 = P2 + [&render_data_manager_1, "Vignette"]
P4 = P1 + [&render_data_manager_2, "Fog"]
P5 = P4 + [&render_data_manager_2, "Vignette"]

Less wasteful but requires lookup up in the chain to find all the components for an entity. On the other hand we can make this very efficient storage-wise by having a lookup table like this:
Map from Prototype to {base_prototype, component_manager, component_name}. The prototype data is small and has no dynamic size so they can be stored very effiently.

The prototype will add all the prototypes to the same prototype map and since the HashMap implementation lookup gives us O(1) lookup cost, traversing the chain will only cost us the potential cache-misses of the lookup. Since the hashmap is likely to be pretty compact (via prototype reuse) this hopefully should not be a huge issue. If it turns out to be, a different storage approach might be needed trading memory use for lookup speed.

Since the amount of data we store for each Prototype would be very small - roughly 16 bytes - we can be a bit more relaxed with unused prototypes - we do not need to remove them as aggressively as we would if each prototype contained a complete lookup table for all components.

Building the Prototype index

So how do we “name” the prototypes effectively for fast lookup? Well, the first lookup would be Entity -> Prototype and then from Prototype -> Prototype definition.
A simple approach would be hashing - use the content of the Prototype as the hash data to get a unique identifier.

The first base prototype has an empty definition so we let that be zero.
To calculate a prototype, mix the prototype you are basing it of with the hash of the protoype data, in our case we hash the Component Manager pointer and Component Name, and mix it with the base prototype.

Prototype prototype = mix(base_prototype, mix(hash(&component_manager), hash(component_name)))

The entry is stored with the prototype as key and the value as [base_prototype, &component_manager, component_name].

When you add a new Component to an entity we add/find the new prototype and update the Entity -> Prototype map to match the new prototype.

So, we end up with a structure like this:

struct PrototypeDescription {
    Prototype base_prototype;
    ComponentMananger *component_manager;
    IdString32 component_name;
}

Map<Entity, Prototype> entity_prototype_lookup;
Map<Prototype, PrototypeDescription> prototypes;

void register_component(Entity, ComponentManager, component_name)
{
    Prototype p = entity_prototype_lookup[Entity];
    Prototype new_p = mix(p, mix(hash(ComponentManager), hash(component_name)));
    if (!prototypes.has(new_p))
        prototypes.insert(new_p, {p, &ComponentManager, component_name});
    enity_index[Entity] = new_p;
}

ComponentMananger *find_component_manager(Entity, component_name)
{
    Prototype p = entity_index[Entity];
    while (p != 0)
    {
        PrototypeDescription description = prototypes[p];
        if (description.component_name == component_name)
            return description.component_manager;
        p = description.base_prototype;
    }
    return nullptr;
}

This could lead to a lot of hashing and look-ups but we can change the api to register new components to multiple Entities in one go which would lead to dramatically less number of hashing and look-ups, we already do that kind of optimization when creating entities from resources so it would be a natural fit. Also, we can easily cache the base prototype index to avoid more of the hash look-ups in find_component_manager.

Measuring the results

Lets run the synthetic test again and see how our new entity index match up to the old one.

Entity Count	Test run time (s)	Memory used (Mb)	Time/Entity (us)
10k	0.01	0.26	0.977
20k	0.01	0.51	0.488
40k	0.03	0.99	0.832
80k	0.06	0.99	0.610
160k	0.11	0.99	0.671
320k	0.23	0.99	0.702
640k	0.46	0.99	0.702
1.28m	0.94	0.99	0.700
2.56m	1.88	0.99	0.700
5.12m	3.78	0.99	0.704
10.24m	7.57	0.99	0.705

The run time now scales very close to linearly and is overall faster than the old implementation. Most notable is the win when using a lot of entities. Memory usage has gone down as well and the time/entity is also scaling more gracefully.

Memory usage looks a little strange but there is an easy explanation - the mapping from entity to prototype is using almost all that memory (via a hashmap) and the actual prototypes takes less than 30 Kb. Note that the old index uses the same amount of memory for the Entity to Prototype mapping.

Lets compare the graphs between the old and new implementation:

Entity Count	Time New (s)	Time Legacy (s)	Memory New (Mb)	Memory Legacy (Mb)	Time/Entity New (us)	Time/Entity Legacy (us)
10k	0.01	0.01	0.26	5.79	0.977	0.977
20k	0.01	0.01	0.51	5.79	0.488	0.488
40k	0.03	0.03	0.99	11.88	0.832	0.732
80k	0.05	0.06	0.99	11.88	0.610	0.732
160k	0.11	0.13	0.99	25.69	0.671	0.793
320k	0.23	0.32	0.99	31.04	0.702	0.977
640k	0.46	1.08	0.99	55.90	0.702	1.648
1.28m	0.94	2.58	0.99	65.82	0.700	1.922
2.56m	1.88	6.53	0.99	65.55	0.700	2.366
5.12m	3.78	13.42	0.99	120.55	0.704	2.500
10.24m	7.57	25.69	0.99	130.55	0.705	2.393

Looks like a pretty good win.

Final words

By taking into account the new requirements as the Entity system evolved we were able to create a much more space efficient and more performant Entity Index.

The implementation chosen here has focused on reducing the amount of data we use in the Entity Index at the cost of lookup complexity, I think this is the right trade-of, especially since it performs better as well. Since the interface for the Entity Index is fairly non-complex and does not dictate how we store the data we could change the implementation to optimize for lookup speed if need be.

Stingray Renderer Walkthrough #8: stingray-renderer & mini-renderer

2017-03-14T10:28:00.000+01:00

Stingray Renderer Walkthrough #8: stingray-renderer & mini-renderer

Introduction

In the last post we looked at our systems for doing data-driven rendering in Stingray. Today I will go through the two default rendering pipes we ship as templates with Stingray. Both are entirely described in data using two render_config files and a bunch of shader_source files.

We call them the “stingray renderer” and the “mini renderer”

Stingray Renderer

The “stingray renderer” is the default rendering pipe and is used in almost all template and sample projects. It’s a fairly standard “high-end” real-time rendering pipe and supports the regular buzzword features.

The render_config file is approx 1500 lines of sjson. While 1500 might sound a bit massive it’s important to remember that this configuration is highly configurable, pretty much all features can be dynamically switched on/off. It also run on a broad variety of different platforms (mobile -> consoles -> high-end PC), supports a bunch of different debug visualization modes, and features four different stereo rendering paths in addition to the default mono path.

If you are interested in taking a closer look at the actual implementation you can download stingray and you’ll find it under core/stingray_renderer/renderer.render_config.

Going through the entire file and all the implementation details would require multiple blog posts, instead I will try to do a high-level break down of the default layer_configuration and talk a bit about the feature set. Before we begin, please keep in mind that this rendering pipe is designed to handle lots of different content and run on lots of different platforms. A game project would typically use it as a base and then extend, optimize and simplify it based on the project specific knowledge of the content and target platforms.

Here’s a somewhat simplified dump of the contents of the layer_configs/default array found in core/stingray_renderer/renderer.render_config in Stingray v1.8:

// run any render_config_extensions that have requested to insert work at the insertion point named "first"
{ extension_insertion_point = "first" }

// kick resource generator for rendering all shadow maps
{ resource_generator="shadow_mapping" profiling_scope="shadow mapping" }

// kick resource generator for assigning light sources to clustered shading structure
{ resource_generator="clustered_shading" profiling_scope="clustered shading" }

// special layer, only responsible for clearing hdr0, gbuffer2 and the depth_stencil_buffer
{ render_targets=["hdr0", "gbuffer2"] depth_stencil_target="depth_stencil_buffer" 
    clear_flags=["SURFACE", "DEPTH", "STENCIL"] profiling_scope="clears" }      

// if vr is supported kick a resource generator laying down a stencil mask to reject pixels outside of the lens shape
{ type="static_branch" platforms=["win"] render_settings={ vr_supported=true }
    pass = [
        { resource_generator="vr_mask" profiling_scope="vr_mask" }
    ]
}

// g-buffer layer, bulk of all materials renders into this
{ name="gbuffer" render_targets=["gbuffer0", "gbuffer1", "gbuffer2", "gbuffer3"] 
    depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="gbuffer" }

{ extension_insertion_point = "gbuffer" }

// linearize depth into a R32F surface
{ resource_generator="stabilize_and_linearize_depth" profiling_scope="linearize_depth" }

// layer for blending decals into the gbuffer0 and gbuffer1
{ name="decals" render_targets=["gbuffer0" "gbuffer1"] depth_stencil_target="depth_stencil_buffer" 
    profiling_scope="decal" sort="EXPLICIT" }

{ extension_insertion_point = "decals" }

// generate and merge motion vectors for non written pixels with motion vectors in gbuffer
{ type="static_branch" platforms=["win", "xb1", "ps4", "web", "linux"]
    pass = [
        { resource_generator="generate_motion_vectors" profiling_scope="motion vectors" }
    ]
}

// render localized reflection probes into hdr1
{ name="reflections" render_targets=["hdr1"] depth_stencil_target="depth_stencil_buffer" 
    sort="FRONT_BACK" profiling_scope="reflections probes" }

{ extension_insertion_point = "reflections" }

// kick resource generator for screen space reflections
{ type="static_branch" platforms=["win", "xb1", "ps4"]
    pass = [
        { resource_generator="ssr_reflections" profiling_scope="ssr" }
    ]
}

// kick resource generator for main scene lighting
{ resource_generator="lighting" profiling_scope="lighting" }
{ extension_insertion_point = "lighting" }

// layer for emissive materials
{ name="emissive" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="FRONT_BACK" profiling_scope="emissive" }

// kick debug visualization
{ type="static_branch" render_caps={ development=true }
    pass=[
        { resource_generator="debug_visualization" profiling_scope="debug_visualization" }
    ]
}

// kick resource generator for laying down fog 
{ resource_generator="fog" profiling_scope="fog" }

// layer for skydome rendering
{ name="skydome" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="BACK_FRONT" profiling_scope="skydome" }
{ extension_insertion_point = "skydome" }

// layer for transparent materials 
{ name="hdr_transparent" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="BACK_FRONT" profiling_scope="hdr_transparent" }
{ extension_insertion_point = "hdr_transparent" }

// kick resource generator for reading back any requested render targets / buffers to the CPU
{ resource_generator="stream_capture_buffers" profiling_scope="stream_capture" }

// kick resource generator for capturing reflection probes
{ type="static_branch" platform=["win"] render_caps={ development=true }
    pass = [
        { resource_generator="cubemap_capture" }
    ]
}

// layer for rendering object selections from the editor
{ type="static_branch" platforms=["win", "ps4", "xb1"]
    pass = [
        { type = "static_branch" render_settings={ selection_enabled=true }
            pass = [
                { name="selection" render_targets=["gbuffer0" "ldr1_dev_r"] 
                    depth_stencil_target="depth_stencil_buffer_selection" sort="BACK_FRONT" 
                    clear_flags=["SURFACE" "DEPTH"] profiling_scope="selection"}
            ]
        }
    ]
}

// kick resource generators for AA resolve and post processing
{ resource_generator="post_processing" profiling_scope="post_processing" }
{ extension_insertion_point = "post_processing" }

// layer for rendering LDR materials, primarily used for rendering HUD and debug rendering
{ name="transparent" render_targets=["output_target"] depth_stencil_target="stable_depth_stencil_buffer_alias" 
    sort="BACK_FRONT" profiling_scope="transparent" }

// kick resource generator for rendering shadow map debug overlay
{ type="static_branch" render_caps={ development=true }
    pass = [
        { resource_generator="debug_shadows" profiling_scope="debug_shadows" }
    ]
}

// kick resource generator for compositing left/right eye
{ type="static_branch" platforms=["win"] render_settings={ vr_supported=true }
    pass = [
        { resource_generator="vr_present" profiling_scope="present" }
    ]
}

{ extension_insertion_point = "last" }

So what we have above is a fairly standard breakdown of a rendered frame, if you have worked with real-time rendering before there shouldn’t be much surprises in there. Something that is kind of cool with having the frame flow in this representation and pairing that with the hot-reloading functionality of render_configs, is that it really encourages experimentations: move things around, comment stuff out, inject new resource generators, etc.

Let’s go through the frame in a bit more detail:

Extension insertion points

First of all there are a bunch of extension_insertion_point at various locations during the frame, these are used by render_config_extensions to be able to schedule work into an existing render_config. You could argue that an extensions system to the render_configs is a bit superfluous, and for an in-house game engine targeting a specific industry that might very well be the case. But for us the extension system allows building features a bit more modular, it also encourages sharing of various rendering features across teams.

Shadows

// kick resource generator for rendering all shadow maps
{ resource_generator="shadow_mapping" profiling_scope="shadow mapping" }

We start off by rendering shadow maps. As we want to handle shadow receiving on alpha blended geometry there’s no simple way to reuse our shadow maps by interleaving the rendering of them into the lighting code. Instead we simply gather all shadow casting lights, try to prioritize them based on screen coverage, intensity, etc. and then render all shadows into two shadow maps.

One shadow map is dedicated to handle a single directional light which uses a cascaded shadow map approach, rendering each cascade into a region of a larger shadow map atlas. The other shadow map is an atlas for all local light sources, such as spot and point lights (interpreted as 6 spot lights).

Clustered shading

// kick resource generator for assigning light sources to clustered shading structure
{ resource_generator="clustered_shading" profiling_scope="clustered shading" }

We separate local light sources into two kinds: “simple” and “custom”. Simple lights are either spot lights or point lights that don’t have a custom material graph assigned. Simple light sources, which tend to be the bulk of all visible light sources in a frame, get inserted into a clustered shading acceleration structure.

While simple lights will affect both opaque and transparent materials, custom lights will only affect opaque geometry as they run a more traditional deferred shading path. We will touch on the lighting a bit more soon.

Clearing & VR mask

// special layer, only responsible for clearing hdr0, gbuffer2 and the depth_stencil_buffer
{ render_targets=["hdr0", "gbuffer2"] depth_stencil_target="depth_stencil_buffer" 
    clear_flags=["SURFACE", "DEPTH", "STENCIL"] profiling_scope="clears" }      

// if vr is supported kick a resource generator laying down a stencil mask to reject pixels outside of the lens shape
{ type="static_branch" platforms=["win"] render_settings={ vr_supported=true }
    pass = [
        { resource_generator="vr_mask" profiling_scope="vr_mask" }
    ]
}

Here we use the layer system to record a bind and a clear for a few render targets into a RenderContext generated by the LayerManager.

Then, depending on if the vr_supported render setting is true or not we kick a resource generator that marks in the stencil buffer any pixels falling outside of the lens region. This resource generator only does something if the renderer is running in stereo mode. Also note that the branch above is a static_branch so if vr_supported is set to false the execution of the vr_mask resource generator will get eliminated completely during boot up of the renderer.

G-buffer

// g-buffer layer, bulk of all materials renders into this
{ name="gbuffer" render_targets=["gbuffer0", "gbuffer1", "gbuffer2", "gbuffer3"] 
    depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="gbuffer" }

{ extension_insertion_point = "gbuffer" }

// linearize depth into a R32F surface
{ resource_generator="stabilize_and_linearize_depth" profiling_scope="linearize_depth" }

// layer for blending decals into the gbuffer0 and gbuffer1
{ name="decals" render_targets=["gbuffer0" "gbuffer1"] depth_stencil_target="depth_stencil_buffer" 
    profiling_scope="decal" sort="EXPLICIT" }

{ extension_insertion_point = "decals" }

// generate and merge motion vectors for non written pixels with motion vectors in gbuffer
{ type="static_branch" platforms=["win", "xb1", "ps4", "web", "linux"]
    pass = [
        { resource_generator="generate_motion_vectors" profiling_scope="motion vectors" }
    ]
}

Next we lay down the gbuffer. We are using a fairly fat “floating” gbuffer representation. By floating I mean that we interpret the gbuffer channels differently depending on material. I won’t go into details of the gbuffer layout in this post but everything builds upon a standard metallic PBR material model, same as most modern engines runs today. We also stash high precision motion vectors to be able to do accurate reprojection for TAA, RGBM encoded irradiance from light maps (if present, else irradiance is looked up from an IBL probe), high precision normals, AO, etc. Things quickly add up, in the default configuration on PC we are looking at 192 bpp for the color targets (i.e not counting depth/stencil). The gbuffer layout could use some love, I think we should be able to shrink it somewhat without losing any features.

We then kick a resource generator called stabilize_and_linerize_depth, this resource generator does two things:

It linearizes the depth buffer and stores the result in an R32F target using a fullscreen_pass.
It does a hacky TAA resolve pass for depth in an attempt to remove some intersection flickering for materials rendering after TAA resolve. We call the output of this pass stable_depth and use it when rendering editor selections, gizmos, debug lines, etc. We also use this buffer during post processing for any effects that depends on depth (e.g. depth of field) as those runs after AA resolve.

After that we have another more minimalistic gbuffer layer for splatting deferred decals.

Last but not least we kick another resource generator that calculates per pixel velocity for any pixels that haven’t been rendered to during the gbuffer pass (i.e skydome).

Reflections & Lighting

// render localized reflection probes into hdr1
{ name="reflections" render_targets=["hdr1"] depth_stencil_target="depth_stencil_buffer" 
    sort="FRONT_BACK" profiling_scope="reflections probes" }

{ extension_insertion_point = "reflections" }

// kick resource generator for screen space reflections
{ type="static_branch" platforms=["win", "xb1", "ps4"]
    pass = [
        { resource_generator="ssr_reflections" profiling_scope="ssr" }
    ]
}

// kick resource generator for main scene lighting
{ resource_generator="lighting" profiling_scope="lighting" }
{ extension_insertion_point = "lighting" }

At this point we are fully done with the gbuffer population and are ready to do some lighting. We start by laying down the indirect specular / reflections into a separate buffer. We use a rather standard three-step fallback scheme for our reflections: screen-space reflections, falling back to localized parallax corrected pre-convoluted radiance cubemaps, falling back to a global pre-convoluted radiance cubemap.

The reflections layer is the target layer for all cubemap based reflections. We are naively rendering the cubemap reflections by treating each reflection probe as a light source with a custom material. These lights gets picked up by a resource generator performing traditional deferred shading - i.e it renders proxy volumes for each light. One thing that some people struggle to wrap their heads around is that the resource generator responsible for running the deferred shading modifier isn’t kicked until a few lines down (in the lighting resource generator). If you’ve paid attention in my previous posts this shouldn’t come as a surprise for you, as what we describe here is the GPU scheduling of a frame, nothing else.

When the reflection probes are laid down we move on and run a resource generator for doing Screen-Space Reflections. As SSR typically runs in half-res we store the result in a separate render target.

We then finally kick the lighting resource generator, which is responsible for the following:

Build a screen space mask for sun shadows, this is done by running multiple fullscreen_passes. The fullscreen_passes transform the pixels into cascaded shadow map space and perform PCF. Stencil culling makes sure the shader only runs for pixels within a certain cascade.
SSAO with a bunch of different quality settings.
A fullscreen pass we refer to as the “global lighting” pass. This is the pass that does most of the heavy lifting when it comes to the lighting. It handles mixing SSR with probe reflections, mixing of SSAO with material AO, lighting from all simple lights looked up from the clustered shading structure as well as calculates sun lighting masked with the result from sun shadow mask (step 1).
Run a traditional deferred shading modifier for all light sources that has a material graph assigned. If the shader doesn’t target a specific layer the lights proxy volume will be rendered at this point, else it will be scheduled to render into whatever layer the shader has specified.

At this point we have a fully lit HDR output for all of our opaque materials.

Various stuff

// layer for emissive materials
{ name="emissive" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="FRONT_BACK" profiling_scope="emissive" }

// kick debug visualization
{ type="static_branch" render_caps={ development=true }
    pass=[
        { resource_generator="debug_visualization" profiling_scope="debug_visualization" }
    ]
}

// kick resource generator for laying down fog 
{ resource_generator="fog" profiling_scope="fog" }

// layer for skydome rendering
{ name="skydome" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="BACK_FRONT" profiling_scope="skydome" }
{ extension_insertion_point = "skydome" }

// layer for transparent materials 
{ name="hdr_transparent" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="BACK_FRONT" profiling_scope="hdr_transparent" }
{ extension_insertion_point = "hdr_transparent" }

// kick resource generator for reading back any requested render targets / buffers to the CPU
{ resource_generator="stream_capture_buffers" profiling_scope="stream_capture" }

// kick resource generator for capturing reflection probes
{ type="static_branch" platform=["win"] render_caps={ development=true }
    pass = [
        { resource_generator="cubemap_capture" }
    ]
}

// layer for rendering object selections from the editor
{ type="static_branch" platforms=["win", "ps4", "xb1"]
    pass = [
        { type = "static_branch" render_settings={ selection_enabled=true }
            pass = [
                { name="selection" render_targets=["gbuffer0" "ldr1_dev_r"] 
                    depth_stencil_target="depth_stencil_buffer_selection" sort="BACK_FRONT" 
                    clear_flags=["SURFACE" "DEPTH"] profiling_scope="selection"}
            ]
        }
    ]
}

Next follows a bunch of layers for doing various stuff, most of this is straightforward:

emissive - Layer for adding any emissive material influences to the light accumulation target (hdr0)
debug_visualization - Kick of a resource generator for doing debug rendering. When debug rendering is enabled, the post processing pipe is disabled so we can render straight to the output target / back buffer here. Note: This doesn’t need to be scheduled exactly here, it could be moved later down the pipe.
fog - Kick of a resource generator for blending fog into the accumulation target.
skydome - Layer for rendering anything skydome related.
hdr_transparent - Layer for rendering transparent materials, traditional forward shading using the clustered shading acceleration structure for lighting. VFX with blending usually also goes into this layer.
stream_capture_buffer - Arbitrary location for capturing various render targets and dumping them into system memory.
cubemap_capture - Capturing point for reflection cubemap probes.
selection - Layer for rendering selection outlines.

So basically a bunch of miscellaneous stuff that needs to happen before we enter post processing…

Post Processing

// kick resource generators for AA resolve and post processing
{ resource_generator="post_processing" profiling_scope="post_processing" }
{ extension_insertion_point = "post_processing" }

Up until this point we’ve been in linear color space accumulating lighting into a 4xf16 render target (hdr0). Now its time to take that buffer and push it through the post processing resource generator.

The post processing pipe in the Stingray Renderer does:

Temporal AA resolve
Depth of Field
Motion Blur
Lens Effects (chromatic aberration, distortion)
Bloom
Auto exposure
Scene Combine (exposure, tone map, sRGB, LUT color grading)
Debug rendering

All steps of the post processing pipe can dynamically be enabled/disabled (not entirely true, we will always have to run some variation of step 7 as we need to output our result to the back buffer).

Final touches

// layer for rendering LDR materials, primarily used for rendering HUD and debug rendering
{ name="transparent" render_targets=["output_target"] depth_stencil_target="stable_depth_stencil_buffer_alias" 
    sort="BACK_FRONT" profiling_scope="transparent" }

// kick resource generator for rendering shadow map debug overlay
{ type="static_branch" render_caps={ development=true }
    pass = [
        { resource_generator="debug_shadows" profiling_scope="debug_shadows" }
    ]
}

// kick resource generator for compositing left/right eye
{ type="static_branch" platforms=["win"] render_settings={ vr_supported=true }
    pass = [
        { resource_generator="vr_present" profiling_scope="present" }
    ]
}

Before we present we allow rendering of unlit geometry in LDR (mainly used for HUDs and debug rendering), potentially do some more debug rendering and if we’re in VR mode we kick a resource generator that handles left/right eye combining (if needed).

That’s it - a very high-level breakdown of a rendered frame when running Stingray with the default “Stingray Renderer” render_config file.

Mini Renderer

We also have a second rendering pipe that we ship with Stingray called the “Mini Renderer” - mini as in minimalistic. It is not as broadly used as the Stingray Renderer so I won’t walk you through it, just wanted to mention it’s there and say a few words about it.

The main design goal behind the mini renderer was to build a rendering pipe with as little overhead from advanced lighting effects and post processing as possible. It’s primarily used for doing mobile VR rendering. High-resolution, high-performance rendering on mobile devices is hard! You pretty much need to avoid all kinds of fullscreen effects to hit target frame rate. Therefore the mini renderer has a very limited feature set:

It’s a forward renderer. While it’s capable of doing per pixel lighting through clustered shading it rarely gets used, instead most applications tend to bake their lighting completely or run with only a single directional light source.
No post processing.
While all lighting is done in linear color space we don’t store anything in HDR, instead we expose, tonemap and output sRGB directly into an LDR target (usually directly to the back buffer).

The mini_renderer.render_config file is ~400 lines, i.e. less than 1/3 of the stingray renderer. It is still in a somewhat experimental state but is the fastest way to get up and running doing mobile VR. I also feel that it makes sense for us to ship an example of a more lightweight rendering pipe; it is simpler to follow than the render_config for the full stingray renderer, and it makes it easy to grasp the benefits of data-driven rendering compared to a more static hard-coded rendering pipe (especially if you don’t have source access to the full engine as then the hard-coded rendering pipe would likely be a complete black box for the user).

Wrap up

I realize that some of you might have hoped for a more complete walkthrough of the various lighting and post processing techniques we use in the Stingray renderer. Unfortunately that would have become a very long post and also it feels a bit out of context as my goal with this blog series has been to focus on the architecture of the stingray rendering pipe rather than specific rendering techniques. Most of the techniques we use can probably be considered “industry standard” within real-time rendering nowadays. If you are interested in learning more there are lots of excellent information available, to name a few:

Sébastien Lagarde & Charles de Rousiers amazing course notes from their Siggraph 2014 presentation: “Moving Frostbite to PBR”: http://www.frostbite.com/2014/11/moving-frostbite-to-pbr/
Morgan McGuire’s excellent Siggraph 2016 presentation: “Peering Through a Glass, Darkly
at the Future of Real-Time Transparency”: http://graphics.cs.williams.edu/papers/TransparencySIGGRAPH16/
Everything from Natalya Tatarchuk’s Siggraph courses: “Advances in Real-Time Rendering in 3D Graphics and Games”: http://advances.realtimerendering.com/
Everything from Stephen Hill’s and Stephen McAuley’s Siggraph courses: “Physically Based Shading in Theory and Practice”: http://blog.selfshadow.com/publications/s2016-shading-course/

In the next and final post of this series we will take a look at the shader and material system we have in Stingray.

Stingray Renderer Walkthrough #7: Data-driven rendering

2017-03-09T16:21:00.001+01:00

Stingray Renderer Walkthrough #7: Data-driven rendering

Introduction

With all the low-level stuff in place it’s time to take a look at how we drive rendering in Stingray, i.e how a final frame comes together. I’ve covered this in various presentations over the years but will try do go through everything again to give a more complete picture of how things fit together.

Stingray features what we call a data-driven rendering pipe, basically what we mean by that is that all shaders, GPU resource creation and manipulation, as well as the entire flow of a rendered frame is defined in data. In our case the data is a set of different json files.

These json-files are hot-reloadable on all platforms, providing a nice workflow with fast iteration times when experimenting with various rendering techniques. It also makes it easy for a project to optimize the renderer for its specific needs (in terms of platforms, features, etc.) and/or to push it in other directions to better suit the art direction of the project.

There are four different types of json-files driving the Stingray renderer:

.render_config - the heart of a rendering pipe.
.render_config_extension - extensions to an existing .render_config file.
.shader_source - shader source and meta data for compiling statically declared shaders.
.shader_node - shader source and meta data used by the graph based shader system.

Today we will be looking at the render_config, both from a user’s perspective as well as how it works on the engine side.

Meet the `render_config`

The render_config is a sjson file describing everything from which render settings to expose to the user to the flow of an entire rendered frame. It can be broken down into four parts: render settings, resource sets, layer configurations and resource generators. All of which are fairly simple and minimalistic systems on the engine side.

Render Settings & Misc

Render settings is a simple key:value map exposed globally to the entire rendering pipe as well as an interface for the end user to peek and poke at. Here’s an example of how it might look in the render_config file:

render_settings = {
    sun_shadows = true
    sun_shadow_map_size = [ 2048, 2048 ]
    sun_shadow_map_filter_quality = "high"  
    local_lights_shadow_atlas_size = [ 2048, 2048 ]
    local_lights_shadow_map_filter_quality = "high"

    particles_local_lighting = true
    particles_receive_shadows = true

    debug_rendering = false
    gbuffer_albedo_visualization = false
    gbuffer_normal_visualization = false
    gbuffer_roughness_visualization = false
    gbuffer_specular_visualization = false
    gbuffer_metallic_visualization = false
    bloom_visualization = false
    ssr_visualization = false
}

As you will see we have branching logics for most systems in the render_config which allows the renderer to take different paths depending on the state of properties in the render_settings. There is also a block called render_caps which is very similar to the render_settings block except that it is read only and contains knowledge of the capabilities of the hardware (GPU) running the engine.

On the engine side there’s not that much to cover about the render_settings and render_caps, keys are always strings getting murmur hashed to 32 bits and the value can be a bool, float, array of floats or another hashed string.

When booting the renderer we populate the render_settings by first reading them from the render_config file, then looking in the project specific settings.ini file for potential overrides or additions, and last allowing to override certain properties again from the user’s configuration file (if loaded).

The render_caps block usually gets populated when the RenderDevice is booted and we’re in a state where we can enumerate all device capabilities. This makes the keys and values of the render_caps block somewhat of a black box with different contents depending on platform, typically they aren’t that many though.

So that covers the render_settings and render_caps blocks, we will look at how they are actually used for branching in later sections of this post.

There are also a few other miscellaneous blocks in the render_config, most important being:

shader_pass_flags - Array of strings building up a bit flag that can be used to dynamically turn on/off various shader passes.
shader_libraries - Array of what shader_source files to load when booting the renderer. The shader_source files are libraries with pre-compiled shader libraries mainly used by the resource generators.

Resource Sets

We have the concept of a RenderResourceSet on the engine side, it simply maps a hashed string to a GPU resource. RenderResourceSets can be locally allocated during rendering, creating a form of scoping mechanism. The resources are either allocated by the engine and inserted into a RenderResourceSet or allocated through the global_resources block in a render_config file.

The RenderInterface owns a global RenderResourceSet populated by the global_resources array from the render_config used to boot the renderer.

Here’s an example of a global_resources array:

global_resources = [
    { type="static_branch" platforms=["ios", "android", "web", "linux"]
        pass = [
            { name="output_target" type="render_target" depends_on="back_buffer" 
                    format="R8G8B8A8" }
        ]
        fail = [
            { name="output_target" type="alias" aliased_resource="back_buffer" }
        ]
    }

    { name="depth_stencil_buffer" type="render_target" depends_on="output_target" 
            w_scale=1 h_scale=1 format="DEPTH_STENCIL" }
    { name="gbuffer0" type="render_target" depends_on="output_target" 
            w_scale=1 h_scale=1 format="R8G8B8A8" }
    { name="gbuffer1" type="render_target" depends_on="output_target" 
            w_scale=1 h_scale=1 format="R8G8B8A8" } 
    { name="gbuffer2" type="render_target" depends_on="output_target" 
            w_scale=1 h_scale=1 format="R16G16B16A16F" }

    { type="static_branch" render_settings={ sun_shadows = true }
        pass = [
            { name="sun_shadow_map" type="render_target" size_from_render_setting="sun_shadow_map_size" 
                format="DEPTH_STENCIL" }
        ]
    }
    
    { name="hdr0" type="render_target" depends_on="output_target" w_scale=1 h_scale=1 
        format="R16G16B16A16F" }
]

So while the above example mainly shows how to create what we call DependentRenderTargets (i.e render targets that inherit its properties from another render target and then allow overriding properties locally), it can also create other buffers of various kinds.

We’ve also introduced the concept of a static_branch, there are two types of branching in the render_config file: static_branch and dynamic_branch. In the global_resource block only static branching is allowed as it only runs once, during set up of the renderer. (Note: The branch syntax is far from nice and we nowadays have come up with a much cleaner syntax that we use in the shader system, unfortunately it hasn’t made its way back to the render_config yet.)

So basically what this example boils down to is the creation of a set of render targets. The output_target is a bit special though, on PC and consoles we simply just setup an alias for an already created render target - the back buffer, while on gl based platforms we create a new separate render target. (This is because we render the scene up-side-down on gl-platforms to get consistent UV coordinate systems between all platforms.)

The other special case from the example above is the sun_shadow_map which grabs the resolution from a render_setting called sun_shadow_map_size. This is done because we want to expose the ability to tweak the shadow map resolution to the user.

When rendering a frame we typically pipe the global RenderResourceSet owned by the RenderInterface down to the various rendering systems. Any resource declared in the RenderResourceSet is accessible from the shader system by name. Each rendering system can at any point decide to create its own local version of a RenderResourceSet making it possible to scope shader resource access.

Worth pointing out is that the resources declared in the global_resource block of the render_config used when booting the engine are all allocated in the set up phase of the renderer and not released until the renderer is closed.

Layer Configurations

A render_config can have multiple layer_configurations. A Layer Configuration is essentially a description of the flow of a rendered frame, it is responsible for triggering rendering sub-systems and scheduling the GPU work for a frame. Here’s a simple example of a deferred rendering pipe:


layer_configs = {
    simple_deferred = [
        { name="gbuffer" render_targets=["gbuffer0", "gbuffer1", "gbuffer2"] 
            depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="gbuffer" }

        { resource_generator="lighting" profiling_scope="lighting" }

        { name="emissive" render_targets=["hdr0"] 
            depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="emissive" }

        { name="skydome" render_targets=["hdr0"] 
            depth_stencil_target="depth_stencil_buffer" sort="BACK_FRONT" profiling_scope="skydome" }

        { name="hdr_transparent" render_targets=["hdr0"] 
            depth_stencil_target="depth_stencil_buffer" sort="BACK_FRONT" profiling_scope="hdr_transparent" }

        { resource_generator="post_processing" profiling_scope="post_processing" }

        { name="ldr_transparent" render_targets=["output_target"] 
            depth_stencil_target="depth_stencil_buffer" sort="BACK_FRONT" profiling_scope="transparent" }
    ]
}

Each line in the simple_deferred array specifies either a named layer that the shader system can reference to direct rendering into (i.e a renderable object, like e.g. a mesh, has shaders assigned and the shaders know into which layer they want to render - e.g gbuffer), or it can trigger a resource_generator.

The order of execution is top->down and the way the GPU scheduling works is that each line increments a bit in the “Layer System” bit range covered in the post about sorting.

On the engine side the layer configurations are managed by a system called the LayerManager, owned by the RenderInterface. It is a tiny system that basically just maps the named layer_config to an array of “Layers”:

struct Layer {
    uint64_t sort_key;

    IdString32 name;
    render_sorting::DepthSort depth_sort;
    IdString32 render_targets[MAX_RENDER_TARGETS];
    IdString32 depth_stencil_target;
    IdString32 resource_generator;
    uint32_t clear_flags;   

    #if defined(DEVELOPMENT)
        const char *profiling_scope;
    #endif  
};

sort_key - As mentioned above and in the post about how we do sorting, each layer gets a sort_key assigned from the “Layer System” bit range. By looking up the layer’s sort_key and using that when recording Commands to RenderContexts we get a simple way to reason about overall ordering of a rendered frame.
name - the shader system can use this name to look up the layer’s sort_key to group draw calls into layers.
depth_sort - describes how to encode the depth range bits of the sort key when recording a RenderJobPackage to a RenderContext. depth_sort is an enum that indicates if sorting should be done front-to-back or back-to-front.
render_targets - array of named render target resources to bind for this layer
depth_stencil_target - named render target resource to bind for this layer
resource_generator -
clear_flags - bit flag hinting if color, depth or stencil should be cleared for this layer
profiling_scope - used to record markers on the RenderContext that later can be queried for GPU timings and statistics.

When rendering a World (see: RenderInterface) the user passes a viewport to the render_world function, the viewport knows which layer_config to use. We look up the array of Layersfrom the LayerManager and record a RenderContext with state commands for binding and clearing render targets using the sort_keys from the Layer. We do this dynamically each time the user calls render_world but in theory we could cache the RenderContext between render_world calls.

The name Layer is a bit misleading as a layer also can be responsible for making sure that a ResourceGenerator runs, in practice a Layer is either a target for the shader system to render into or it is the execution point for a ResourceGenerator. It can in theory be both but we never use it that way.

Resource Generators

The Resource Generators is a minimalistic framework for manipulating GPU resources and triggering various rendering sub-systems. Similar to a layer configuration a resource generator is described as an array of “modifiers”. Modifiers get executed in the order they were declared. Here’s an example:

auto_exposure = {
    modifiers = [
        { type="dynamic_branch" render_settings={ auto_exposure_enabled=true } profiling_scope="auto_exposure"
            pass = [
                { type="fullscreen_pass" shader="quantize_luma" inputs=["hdr0"] 
                    outputs=["quantized_luma"]  profiling_scope="quantize_luma" }

                { type="compute_kernel" shader="compute_histogram" thread_count=[40 1 1] inputs=["quantized_luma"] 
                    uavs=["histogram"] profiling_scope="compute_histogram" }

                { type="compute_kernel" shader="adapt_exposure" thread_count=[1 1 1] inputs=["quantized_luma"] 
                    uavs=["current_exposure" "current_exposure_pos" "target_exposure_pos"] profiling_scope="adapt_exposure" }
            ]
        }
    ]   
}

First modifier in the above example is a dynamic_branch. In contrast to a static_branch which gets evaluated during loading of the render_config, a dynamic_branch is evaluated each time the resource generator runs making it possible to take different paths through the rendering pipeline based on settings and other game context that might change over time. Dynamic branching is also supported in the layer_config block.

If the branch is taken (i.e if auto_exposure_enabled is true) the modifiers in the pass array will run.

The first modifier is of the type fullscreen_pass and is by far the most commonly used modifier type. It simply renders a single triangle covering the entire viewport using the named shader. Any resource listed in the inputs array is exposed to the shader. Any resource(s) listed in the outputs array are bound as a render target(s).

The second and third modifiers are of the type compute_kernel and will dispatch a compute shader. inputs array is the same as for the fullscreen_pass and uavs lists resources to bind as UAVs.

This is obviously a very basic example, but the idea is the same for more complex resource generators. By chaining a bunch of modifiers together you can create interesting rendering effects entirely in data.

Stingray ships with a toolbox of various modifiers, and the user can also extend it with their own modifiers if needed. Here’s a list of some of the other modifiers we ship with:

cascaded_shadow_mapping - Renders a cascaded shadow map from a directional light.
atlased_shadow_mapping - Renders a shadow map atlas from a set of spot and omni lights.
generate_mips - Renders a mip chain for a resource by interleaving a resource generator that samples from sub-resource n-1 while rendering into sub-resource n.
clustered_shading - Assign a set of light sources to a clustered shading structure (on CPU at the moment).
deferred_shading - Renders proxy volumes for a set of light sources with specified shaders (i.e. traditional deferred shading).
stream_capture - Reads back the specified resource to CPU (usually multi-buffered to avoid stalls).
fence - Synchronization of graphics and compute queues.
copy_resource - Copies a resource from one GPU to another.

In Stingray we encourage building all lighting and post processing using resource generators. So far it has proved very successful for us as it gives great per project flexibility. To make sharing of various rendering effects easier we also have a system called render_config_extension that we rolled out last year, which is essentially a plugin system to the render_config files.

I won’t go into much detail how the resource generator system works on the engine side, it’s fairly simple though; There’s a ResourceGeneratorManager that knows about all the generators, each time the user calls render_world we ask the manager to execute all generators referenced in the layer_config using the layers sort key. We don’t restrain modifiers in any way, they can be implemented to do whatever and have full access to the engine. E.g they are free to create their own ResourceContexts, spawn worker threads, etc. When the modifiers for all generators are done executing we are handed all RenderContexts they’ve created and can dispatch them together with the contexts from the regular scene rendering. To get scheduling between modifiers in a resource generators correct we use the 32-bit “user defined” range in the sort key.

Future improvements

Before we wrap up I’d like to cover some ideas for future improvements.

The Stingray engine has had a data-driven renderer from day one, so it has been around for quite some time by now. And while the render_config has served us good so far there are a few things that we’ve discovered that could use some attention moving forward.

Scalability

The complexity of the default rendering pipe continues to increase as the demand for new rendering features targeting different industries (games, design visualization, film, etc.) increases. While the data-driven approach we have addresses the feature set scalability needs decently well, there is also an increasing demand to have feature parity across lots of different hardware. This tends to result in lots of branching in render_config making it a bit hard to follow.

In addition to that we also start seeing the need for managing multiple paths through the rendering pipe on the same platform, this is especially true when dealing with stereo rendering. On PC we currently we have 5 different paths through the default rendering pipe:

Mono - Traditional mono rendering.
Stereo - Old school stereo rendering, one render_world call per eye. Almost identical to the mono path but still there are some stereo specific work for assembling the final image that needs to happen.
Instanced Stereo - Using “hardware instancing” to do stereo propagation to left/right eye. Single scene traversal pass, culling using a uber-frustum. A bunch of shader patch up work and some branching in the render_config.
Nvidia Single Pass Stereo (SPS) - Somewhat similar to instanced stereo but using nvidia specific hardware for doing multicasting to left/right eye.
Nvidia VRSLI - DX11 path for rendering left/right eye on separate GPUs.

We estimate that the number of paths through the rendering pipe will continue to increase also for mono rendering, we’ve already seen that when we’ve experimented with explicit multi-GPU stuff under DX12. Things quickly becomes hairy when you aren’t running on a known platform. Also, depending on hardware it’s likely that you want to do different scheduling of the rendered frame - i.e its not as simple as saying: here are our 4 different paths we select from based on if the user has 1-4 GPUs in their systems, as that breaks down as soon as you don’t have the exact same GPUs in the system.

In the future I think we might want to move to an even higher level of abstraction of the rendering pipe that makes it easier to reason about different paths through it. Something that decouples the strict flow through the rendering pipe and instead only reasons about various “jobs” that needs to be executed by the GPUs and what their dependencies are. The engine could then dynamically re-schedule the frame load depending on hardware automatically… at least in theory, in practice I think it’s more likely that we would end up with a few different “frame scheduling configurations” and then select one of them based on benchmarking / hardware setup.

Memory

As mentioned earlier our system for dealing with GPU resources is very static, resources declared in the global_resource set are allocated as the renderer boots up and not released until the renderer is closed. On last gen consoles we had support for aliasing memory of resources of different types but we removed that when deprecating those platforms. With the rise of DX12/Vulkan and the move to 4K rendering this static resource system is in need of an overhaul. While we can (and do) try to recycle temporary render targets and buffers throughout the a frame it is easy to break some code path without noticing.

We’ve been toying with similar ideas to the “Transient Resource System” described in Yuriy O’Donnell’s excellent GDC2017 presentation: FrameGraph: Extensible Rendering Architecture in Frostbite but have so far not got around to test it out in practice.

DX12 improvements

Today our system implicitly deals with binding of input resources to shader stages. We expose pretty much everything to the shader system by name and if a shader stage binds a resource for reading we don’t know about it until we create the RenderJobPackage. This puts us in a somewhat bad situation when it comes to dealing with resource transitions as we end up having to do some rather complicated tracking to inject resource barriers at the right places during the dispatch stage of the RenderContexts (See: RenderDevice).

We could instead enforce declaration of all writable GPU resources when they get bound as input to a layer or resource generator. As we already have explicit knowledge of when a GPU resource gets written to by a layer or resource generator, adding the explicit knowledge of when we read from one would complete the circle and we would have all the needed information to setup barriers without complicated tracking.

Wrap up

Last week at GDC 2017 there were a few presentations (and a lot of discussions) around the concepts of having more high-level representations of a rendered frame and what benefits that brings. If you haven’t already I highly encourage you to check out both Yuriy O’Donnell’s presentation “FrameGraph: Extensible Rendering Architecture in Frostbite” and Aras Pranckevičius’s presentation: “Scriptable Render Pipeline”.

In the next post I will briefly cover the feature set of the two render_configs that we ship as template rendering pipes with Stingray.

Stingray Renderer Walkthrough #6: RenderInterface

2017-02-22T15:50:00.000+01:00

Stingray Renderer Walkthrough #6: RenderInterface

Today we will be looking at the RenderInterface. I’ve struggled a bit with deciding if it is worth covering this piece of the code or not, as most of the stuff described will likely feel kind of obvious. In the end I still decided to keep it to give a more complete picture of how everything fits together. Feel free to skim through it or sit tight and wait for the coming two posts that will dive into the data-driven aspects of the Stingray renderer.

The glue layer

The RenderInterface is responsible for tying together a bunch of rendering sub-systems. Some of which we have covered in earlier posts (like e.g the RenderDevice) and a bunch of other, more high-level, systems that forms the foundation of our data-driven rendering architecture.

The RenderInterface has a bunch of various responsibilities, including:

Tracking of windows and swap chains.

While windows are managed by the simulation thread, swap chains are managed by the render thread. The RenderInterface is responsible for creating the swap chains and keep track of the mapping between a window and a swap chain. It is also responsible for signaling resizing and other state information from the window to the renderer.
Managing of RenderWorlds.

As mentioned in the Overview post, the renderer has its own representation of game Worlds called RenderWorlds. The RenderInterface is responsible for creating, updating and destroying the RenderWorlds.
Owner of the four main building blocks of our data-driven rendering architecture: LayerManager, ResourceGeneratorManager, RenderResourceSet, RenderSettings

Will be covered in the next post (I’ve talked about them in various presentations before [1] [2]).
Owner of the shader manager.

Centralized repository for all available/loaded shaders. Controls scheduling for loading, unload and hot-reloading of shaders.
Owner of the render resource streamer.

While all resource loading is asynchronous in Stingray (See [3]), the resource streamer I’m referring to in this context is responsible for dynamically loading in/out mip-levels of textures based on their screen coverage. Since this streaming system piggybacks on the view frustum culling system, it is owned and updated by the RenderInterface.

The interface

In addition to being the glue layer, the RenderInterface is also the interface to communicate with the renderer from other threads (simulation, resource streaming, etc.). The renderer operates under its own “controller thread” (as covered in the Overview post), and exposes two different types of functions: blocking and non-blocking.

Blocking functions

Blocking functions will enforce a flush of all outstanding rendering work (i.e. synchronize the calling thread with the rendering thread), allowing the caller to operate directly on the state of the renderer. This is mainly a convenience path when doing bigger state changes / reconfiguring the entire renderer, and should typically not be used during game simulation as it might cause stuttering in the frame rate.

Typical operations that are blocking:

Opening and closing of the RenderDevice.

Sets up / shuts down the graphics API by calling the appropriate functions on the RenderDevice.
Creation and destruction of the swap chains.

Creating and destroying swap chains associated to a Window. Done by forwarding the calls to the RenderDevice.
Loading of the render_config / configuring the data-driven rendering pipe.

The render_config is a configuration file describing how the renderer should work for a specific project. It describes the entire flow of a rendered frame and without it the renderer won’t know what to do. It is the RenderInterface responsibility to make sure that all the different sub-systems (LayerManager, ResourceGeneratorManager, RenderResourceSet, RenderSettings) are set up correctly from the loaded render_config. More on this topic in the next post.
Loading, unloading and reloading of shaders.

The shader system doesn’t have a thread safe interface and is only meant to be accessed from the rendering thread. Therefor any loading, unloading and reloading of shaders needs to synchronize with the rendering thread.
Registering and unregistering of Worlds

Creates or destroys a corresponding RenderWorld and sets up mapping information to go from World* to RenderWorld*.

Non-blocking functions

Non-blocking functions communicates by posting messages to a ring-buffer that the rendering thread consumes. Since the renderer has its own representation of a “World” there is not much communication over this ring-buffer, in a normal frame we usually don’t have more than 10-20 messages posted.

Typical operations that are non-blocking:

Rendering of a World.
```
void render_world(World &world, const Camera &camera, const Viewport &viewport, 
    const ShadingEnvironment &shading_env, uint32_t swap_chain);
```
Main interface for rendering of a world viewed from a certain Camera into a certain Viewport. The ShadingEnvironment is basically just a set of shader constants and resources defined in data (usually containing a description of the lighting environment, post effects and similar). swap_chain is a handle referencing which window that will present the final result.

When the user calls this function a RenderWorldMsg will be created and posted to the ring buffer holding handles to the rendering representations for the world, camera, viewport and shading environment. When the message is consumed by rendering thread it will enter the first of the three stages described in the Overview post - Culling.
Reflection of state from a World to the RenderWorld.

Reflects the “state delta” (from the last frame) for all objects on the simulation thread over to the render thread. For more details see [4].
Synchronization.
```
uint32_t create_fence();
void wait_for_fence(uint32_t fence);
```
Synchronization methods for making sure the renderer is finished processing up to a certain point. Used to handle blocking calls and to make sure the simulation doesn’t run more than one frame ahead of the renderer.
Presenting a swap chain.
```
void present_frame(uint32_t swap_chain = 0);
```
When the user is done with all rendering for a frame (i.e has no more render_world calls to do), the application will present the result by looping over all swap chains touched (i.e referenced in a previous call to render_world) and posting one or many PresentFrameMsg messages to the renderer.
Providing statistics from the RenderDevice.

As mentioned in the RenderContext post, we gather various statistics and (if possible) GPU timings in the RenderDevice. Exactly what is gathered depends on the implementation of the RenderDevice. The RenderInterface is responsible for providing a non blocking interface for retrieving the statistics. Note: the statistics returned will be 2 frames old as we update them after the rendering thread is done processing a frame (GPU timings are even older). This typically doesn’t matter though as usually they don’t fluctuate much from one frame to another.

Executing user callbacks.

typedef void (*Callback)(void *user_data);
void run_callback(Callback callback, void *user, uint32_t user_data_size);

Generic callback mechanics to easily inject code to be executed by the rendering thread.

Creation, dispatching and releasing of RenderContexts and RenderResourceContexts.

While most systems tends to create, dispatch and release RenderContexts and RenderResourceContexts from the rendering thread there can be use cases for doing it from another thread (e.g. the resource thread creates RenderResourceContexts). The RenderInterface provides the necessary functions for doing so in a thread-safe way without having to block the rendering thread.

Wrap up

The RenderInterface in itself doesn’t get more interesting than that. Something needs to be responsible for coupling of various rendering systems and manage the interface for communicating with the controlling thread of the renderer - the RenderInterface is that something.

In the next post we will walk through the various components building the foundation of the data-driven rendering architecture and go through some examples of how to configure them to do something fun from the render_config file.

Stay tuned.

Stingray Renderer Walkthrough #5: RenderDevice

2017-02-17T13:55:00.000+01:00

Stingray Renderer Walkthrough #5: RenderDevice

Overview

The RenderDevice is essentially our abstraction layer for platform specific rendering APIs. It is implemented as an abstract base class that various rendering back-ends (D3D11, D3D12, OGL, Metal, GNM, etc.) implement.

The RenderDevice has a bunch of helper functions for initializing/shutting down the graphics APIs, creating/destroying swap chains, etc. All of which are fairly straightforward so I won’t cover them in this post, instead I will put my focus on the two dispatch functions consuming RenderResourceContexts and RenderContexts:


class RenderDevice {
public: 
    virtual void dispatch(uint32_t n_contexts, RenderResourceContext **rrc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;

    virtual void dispatch(uint32_t n_contexts, RenderContext **rc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

Resource Management

As covered in the post about RenderResourceContexts, they provide a free-threaded interface for allocating and deallocating GPU resources. However, it is not until the user has called RenderDevice::dispatch() handing over the RenderResourceContexts as their representation gets created on the RenderDevice side.

All implementations of a RenderDevice have some form of resource management that deals with creating, updating and destroying of the graphics API specific representations of resources. Typically we track the state of all various types of resources in a single struct, here’s a stripped down example from the DX12 RenderDevice implementation called D3D12ResourceContext:


struct D3D12VertexBuffer
{
    D3D12_VERTEX_BUFFER_VIEW view;
    uint32_t allocation_index;
    int32_t size;
};

struct D3D12IndexBuffer
{
    D3D12_INDEX_BUFFER_VIEW view;
    uint32_t allocation_index;
    int32_t size;
};

struct D3D12ResourceContext 
{
    Array<D3D12VertexBuffer> vertex_buffers;
    Array<uint32_t> unused_vertex_buffers;

    Array<D3D12IndexBuffer> index_buffers;
    Array<uint32_t> unused_index_buffers;

    // .. lots of other resources

    Array<uint32_t> resource_lut;
};

As you might remember, the linking between the engine representation and the RenderDevice representation is done using the RenderResource::render_resource_handle. It encodes both the type of the resource as well as a handle. The resource_lut is an indirection to go from the engine handle to a local index for a specific type (e.g vertex_buffers or index_buffers in the sample above). We also track freed indices for each type (e.g. unused_vertex_buffers) to simplify recycling of slots.

The implementation of the dispatch function is fairly straight forward. We simply iterate over all the RenderResourceContexts and for each context iterate over its commands and either allocate or deallocate resources in the D3D12ResourceContext. It is important to note that this is a synchronous operation, nothing else is peeking or poking on the D3D12ResourceContext when the dispatch of RenderResourceContexts is happening, which makes our life a lot easier.

Unfortunately that isn’t the case when we dispatch RenderContexts as in that case we want to go wide (i.e. forking the workload and process it using multiple worker threads) when translating the commands into API calls. While we don’t allow allocating and deallocating new resources from the RenderContexts we do allow updating them which mutates the state of the RenderDevice representations (e.g. a D3D12VertexBuffer).

At the moment our solution for this isn’t very nice, basically we don’t allow asynchronous updates for anything else than DYNAMIC buffers. UPDATABLE buffers are always updated serially before we kick the worker threads no matter what their sort_key is. All worker threads access resources through their own copy of something we call a ResourceAccessor, it is responsible for tracking the worker threads state of dynamic buffers (among other things). In the future I think we probably should generalize this and treat UPDATABLE buffers in a similar way.

(Note: this limitation doesn’t mean you can’t update an UPDATABLE buffer more than once per frame, it simply means you cannot update it more than once per dispatch).

Shaders

Resources in the D3D12ResourceContext are typically buffers. One exception that stands out is the RenderDevice representation of a “shader”. A “shader” on the RenderDevice side maps to a ShaderTemplate::Context on the engine side, or what I guess we could call a multi-pass shader. Here’s some pseudo code:


struct ShaderPass
{
    struct ShaderProgram
    {
        Array<uint8_t> bytecode;
        struct ConstantBufferBindInfo;
        struct ResourceBindInfo;
        struct SamplerBindInfo;
    };
    ShaderProgram vertex_shader;
    ShaderProgram domain_shader;
    ShaderProgram hull_shader;
    ShaderProgram geometry_shader;
    ShaderProgram pixel_shader;
    ShaderProgram compute_shader;

    struct RenderStates;
};

struct Shader
{
    Vector<ShaderPass> passes;
    enum SortMode { IMMADIATE, DEFERRED };
    uint32_t sort_mode;
};

The pseudo code above is essentially the RenderDevice representation of a shader that we serialize to disk during data compilation. From that we can create all the necessary graphics API specific objects expressing an executable shader together with its various state blocks (Rasterizer, Depth Stencil, Blend, etc.).

As discussed in the last post the sort_key encodes the shader pass index. Using Shader::sort_mode, we know which bit range to extract from the sort_key as pass index, which we then use to look up the ShaderPass from Shader::passes. A ShaderPass contains one ShaderProgram per active shader stage and each ShaderProgram contains the byte code for the shader to compile as well as “bind info” for various resources that the shader wants as input.

We will look at this in a bit more detail in the post about “Shaders & Materials”, for now I just wanted to familiarize you with the concept.

Render Context translation

Let’s move on and look at the dispatch for translating RenderContexts into graphics API calls:

class RenderDevice {
public: 
    virtual void dispatch(uint32_t n_contexts, RenderContext **rc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

The first thing all RenderDevice implementation do when receiving a bunch of RenderContexts is to merge and sort their Commands. All implementations share the same code for doing this:

void prepare_command_list(RenderContext::Commands &output, unsigned n_contexts, RenderContext **contexts);

This function basically just takes the RenderContext::Commands from all RenderContexts and merges them into a new array, runs a stable radix sort, and returns the sorted commands in output. To avoid memory allocations the RenderDevice implementation owns the memory of the output buffer.

Now we have all the commands nicely sorted based on their sort_key. Next step is to do the actual translation of the data referenced by the commands into graphics API calls. I will explain this process with the assumption that we are running on a graphics API that allows us to build graphics API command lists in parallel (e.g. DX12, GNM, Vulkan, Metal), as that feels most relevant in 2017.

Before we start figuring out our per thread workloads for going wide, we have one more thing to do; “instance merging”.

Instance Merging

I’ve mentioned the idea behind instance merging before [1,2], basically we want to try to reduce the number of RenderJobPackages (i.e. draw calls) by identifying packages that are similar enough to be merged. In Stingray “similar enough” basically means that they must have identical inputs to the input assembler as well as identical resources bound to all shader stages, the only thing that is allowed to differ are constant buffer variables. (Note: by todays standards this can be considered a bit old school, new graphics APIs and hardware allows to tackle this problem more aggressively using “bindless” concepts. )

The way it works is by filtering out ranges of RenderContexts::Commands where the “instance bit” of the sort_key is set and all bits above the instance bit are identical. Then for each of those ranges we fork and go wide to analyze the actual RenderJobPackage data to see if the instance_hash and the shader are the same, and if so we know its safe to merge them.

The actual merge is done by extracting the instance specific constants (these are tagged by the shader author) from the constant buffers and propagating them into a dynamic RawBuffer that gets bound as input to the vertex shader.

Depending on how the scene is constructed, instance merging can significantly reduce the number of draw calls needed to render the final scene. The instance merger in itself is not graphics API specific and is isolated in its own system, it just happens to be the responsibility of the RenderDevice to call it. The interface looks like this:

namespace instance_merger {

struct ProcessMergedCommandsResult
{
    uint32_t n_instances;
    uint32_t instanced_batches;
    uint32_t instance_buffer_size;
};

ProcessMergedCommandsResult process_merged_commands(Merger &instance_merger, 
    RenderContext::Commands &merged_commands);

}

Pass in a reference to the sorted RenderContext::Commands in merged_commands and after the instance merger is done running you hopefully have fewer commands in the array. :)

You could argue that merging, sorting and instance merging should all happen before we enter the world of the RenderDevice. I wouldn’t argue against that.

Prepare workloads

Last step before we can start translating our commands into state / draw / dispatch calls is to split the workload into reasonable chunks and prepare the execution contexts for our worker threads.

Typically we just divide the number of RenderContext::Commands we have to process with the number of worker threads we have available. We don’t care about the type of different commands we will be processing and trying to load balance differently. The reasoning behind this is that we anticipate that draw calls will always represent the bulk of the commands and the rest of the commands can be considered as unavoidable “noise”. We do, however, make sure that we don’t do less than x-number of commands per worker threads, where x can differ a bit depending on platform but is usually ~128.

For each execution context we create a ResourceAccessors (described above) as well as make sure we have the correct state setup in terms of bound render targets and similar. To do this we are stuck with having to do a synchronous serial sweep over all the commands to find bigger state changing commands (such as RenderContext::set_render_target).

This is where the Command::command_flags bit-flag comes into play, instead of having to jump around in memory to figure out what type of command the Command::head points to, we put some hinting about the type in the Command::command_flags, like for example if it is a “state command”. This way the serial sweep doesn’t become very costly even when dealing with large number of commands. During this sweep we also deal with updating of UPDATABLE resources, and on newer graphics APIs we track fences (discussed in the post about Render Contexts).

The last thing we do is to set up the execution contexts with create graphics API specific representations of command lists (e.g. ID3D12GraphicsCommandList in DX12),

Translation

When getting to this point doing the actual translation is fairly straight forward. Within each worker thread we simply loop over its dedicated range of commands, fetch its data from Command::head and generate any number of API specific commands necessary based on the type of command.

For a RenderJobPackage representing a draw call it involves:

Look up the correct shader pass and, unless already bound, bind all active shader stages
Look up the state blocks (Rasterizer, Depth stencil, Blending, etc.) from the shader and bind them unless already bound
Look up and bind the resources for each shader stage using the RenderResource::render_resource_handle translated through the D3D12ResourceAccessor
Setup the input assembler by looping over the RenderResource::render_resource_handles pointed to by the RenderJobPackage::resource_offset and translated through the D3D12ResourceAccessor
Bind and potentially update constant buffers
Issue the draw call

The execution contexts also holds most-recently-used caches to avoid unnecessary binds of resources/shaders/states etc.

Note: In DX12 we also track where resource barriers are needed during this stage. After all worker threads are done we might also end up having to inject further resource barriers between the command lists generated by the worker threads. We have ideas on how to improve on this by doing at least parts of this tracking when building the RenderContexts but haven’t gotten around looking into it yet.

Execute

When the translation is done we pass the resulting command lists to the correct queues for execution.

Note: In DX12 this is a bit more complicated as we have to interleave signaling / waiting on fences between command list execution (ExecuteCommandList).

Next up

I’ve deliberately not dived into too much details in this post to make it a bit easier to digest. I think I’ve manage to cover the overall design of a RenderDevice though, enough to make it easier for people diving into the code for the first time.

With this post we’ve reached half-way through this series, we have covered the “low-level” aspects of the Stingray rendering architecture. As of next post we will start looking at more high-level stuff, starting with the RenderInterface which is the main interface for other threads to talk with the renderer.

Stingray Renderer Walkthrough #4: Sorting

2017-02-14T14:56:00.002+01:00

Stingray Renderer Walkthrough #4: Sorting

Introduction

This post will focus on ordering of the commands in the RenderContexts. I briefly touched on this subject in the last post and if you’ve implemented a rendering engine before you’re probably not new to this problem. Basically we need a way to make sure our RenderJobPackages (draw calls) end up on the screen in the correct order, both from a visual point of view as well as from a performance point of view. Some concrete examples,

Make sure g-buffers and shadow maps are rendered before any lighting happens.
Make sure opaque geometry is rendered front to back to reduce overdraw.
Make sure transparent geometry is rendered back to front for alpha blending to generate correct results.
Make sure the sky dome is rendered after all opaque geometry but before any transparent geometry.
All of the above but also strive to reduce state switches as much as possible.
All of the above but depending on GPU architecture maybe shift some work around to better utilize the hardware.

There are many ways of tackling this problem and it’s not uncommon that engines uses multiple sorting systems and spend quite a lot of frame time getting this right.

Personally I’m a big fan of explicit ordering with a single stable sort. What I mean by explicit ordering is that every command that gets recorded to a RenderContext already has the knowledge of when it will be executed relative to other commands. For us this knowledge is in the form of a 64 bit sort_key, in the case where we get two commands with the exact same sort_key we rely on the sort being stable to not introduce any kind of temporal instabilities in the final output.

The reasons I like this approach are many,

It’s trivial to implement compared to various bucketing schemes and sorting of those buckets.
We only need to visit renderable objects once per view (when calling their render() function), no additional pre-visits for sorting are needed.
The sort is typically fast, and cost is isolated and easy to profile.
Parallel rendering works out of the box, we can just take all the Command arrays of all the RenderContexts and merge them before sorting.

To make this work each command needs to know its absolute sort_key. Let’s breakdown the sort_key we use when working with our data-driven rendering pipe in Stingray. (Note: if the user doesn’t care about playing nicely together with our system for data-driven rendering it is fine to completely ignore the bit allocation patterns described below and roll their own.)

`sort_key` breakdown

Most significant bit on the left, here are our bit ranges:

MSB [ 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ] LSB
      ^ ^       ^  ^                                   ^^                 ^
      | |       |  |                                   ||                 |- 3 bits - Shader System (Pass Immediate)
      | |       |  |                                   ||- 16 bits - Depth
      | |       |  |                                   |- 1 bit - Instance bit
      | |       |  |- 32 bits - User defined
      | |       |- 3 bits - Shader System (Pass Deferred)
      | - 7 bits - Layer System
      |- 2 bits - Unused

2 bits - Unused

Nothing to see here, moving on… (Not really sure why these 2 bits are unused, I guess they weren’t at some point but for the moment they are always zero) :)

7 bits - Layer System

This 7-bits range is managed by the “Layer system”. The Layer system is responsible for controlling the overall scheduling of a frame and is set up in the render_config file. It’s a central part of the data-driven rendering architecture in Stingray. It allows you to configure what layers to expose to the shader system and in which order these layers should be drawn. We will look closer at the implementation of the layer system in a later post but in the interest of clarifying how it interops with the sort_key here’s a small example:


default = [
  // sort_key = [ 00000000 10000000 00000000 00000000 00000000 00000000 00000000 00000000 ]
  { name="gbuffer" render_targets=["gbuffer0", "gbuffer1", "gbuffer2", "gbuffer3"]
     depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="gbuffer" }

  // sort_key = [ 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ]
  { name="decals" render_targets=["gbuffer0" "gbuffer1"] depth_stencil_target="depth_stencil_buffer"
     profiling_scope="decal" sort="EXPLICIT" }

  // sort_key = [ 00000001 10000000 00000000 00000000 00000000 00000000 00000000 00000000 ]
  { resource_generator="lighting" profiling_scope="lighting" }

  // sort_key = [ 00000010 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ] LSB
  { name="emissive" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer"
    sort="FRONT_BACK" profiling_scope="emissive" }
]

Above we have three layers exposed to the shader system and one kick of a resource_generator called lighting (more about resource_generators in a later post). The layers are rendered in the order they are declared, this is handled by letting each new layer increment the 7 bits range belonging to the Layer System with 1 (as can be seen in the sort_key comments above).

The shader author dictates into which layer(s) it wants to render. When a RenderJobPackage is recorded to the RenderContext (as described in the last post) the correct layer sort_keys are looked up from the layer system and the result is bitwise ORed together with the sort_key value piped as argument to RenderContext::render().

3 bits - Shader System (Pass Deferred)

The next 3 bits are controlled by the Shader System. These three bits encode the shader pass index within a layer. When I say shader in this context I refer to our ShaderTemplate::Context which is basically a wrapper around multiple linked shaders rendering into one or many layers. (Nathan Reed recently blogged about “The Many Meanings of “Shader””, in his analogy our ShaderTemplate is the same as an “Effect”)

Since we can have a multi-pass shader rendering into the same layer we need to encode the pass index into the sort_key, that is what this 3 bit range is used for.

32 bits - User defined

We then have 32 user defined bits, these bits are primarily used by our “Resource Generator” system (I will be covering this system in the post about render_config & data-driven rendering later), but the user is free to use them anyway they like and still maintain compatibility with the data-driven rendering system.

1 bit - Instance bit

This single bit also comes from the Shader System and is set if the shader implements support for “Instance Merging”. I will be covering this in a bit more detail in my next post about the RenderDevice but essentially this bit allows us to scan through all commands and find ranges of commands that potentially can be merged together to fewer draw calls.

16 bits - Depth

One of the arguments piped to RenderContext::render() is an unsigned normalized depth value (0.0-1.0). This value gets quantized into these 16 bits and is what drives the front-to-back vs back-to-front sorting of RenderJobPackages. If the sorting criteria for the layer (see layer example above) is set to back-to-front we simply flip the bits in this range.

3 bits - Shader System (Pass Immediate)

A shader can be configured to run in “Immediate Mode” instead of “Deferred Mode” (default). This forces passes in a multi-pass shader to run immediately after each other and is achieved by moving the pass index bits into the least significant bits of the sort_key. The concept is probably easiest to explain with an artificial example and some pseudo code:

Take a simple scene with a few instances of the same mesh, each mesh recording one RenderJobPackages to one or many RenderContexts and all RenderJobPackages are being rendered with the same multi-pass shader.

In “Deferred Mode” (i.e pass indices encoded in the “Shader System (Pass Deferred)” range) you would get something like this:

foreach (pass in multi-pass-shader)
  foreach (render-job in render-job-packages)
    render (render-job)
  end
end

If shader is configured to run in “Immediate Mode” you would instead get something like this:

foreach (render-job in render-job-packages)
  foreach (pass in multi-pass-shader)
    render (render-job)
  end
end

As you probably can imagine the latter results in more shader / state switches but can sometimes be necessary to guarantee correctly rendered results. A typical example is when using multi-pass shaders that does alpha blending.

Wrap up

The actual sort is implemented using a standard stable radix sort and happens immediately after the user has called RenderDevice::dispatch() handing over n-number of RenderContexts to the RenderDevice for translation into graphics API calls.

Next post will cover this and give an overview of what a typical rendering back-end (RenderDevice) looks like in Stingray. Stay tuned.

Stingray Renderer Walkthrough #3: Render Contexts

2017-02-10T15:11:00.000+01:00

Stingray Renderer Walkthrough #3: Render Contexts

Render Contexts Overview

In the last post we covered how to create and destroy various GPU resources. In this post we will go through the system we have for recording a stream of rendering commands/packages that later gets consumed by the render backend (RenderDevice) where they are translated into actual graphics API calls. We call this interface RenderContext and similar to RenderResourceContext we can have multiple RenderContexts in flight at the same time to achieve data parallelism.

Let’s back up and reiterate a bit what was said in the Overview post. Typically in a frame we take the result of the view frustum culling, split it up into a number of chunks, allocate one RenderContext per chunk and then kick one worker thread per chunk. Each worker thread then sequentially iterates over its range of renderable objects and calls their render() function. The render() function takes the chunk’s RenderContext as one of its argument and is responsible for populating it with commands. When all worker threads are done the resulting RenderContexts gets “dispatched” to the RenderDevice.

So essentially the RenderContext is the output data structure for the second stage Render as discussed in the Overview post.

The RenderContext is very similar to the RenderResourceContext in the sense that it’s a fairly simple helper class for populating a command buffer. There is one significant difference though; the RenderContext also has a mechanics for reasoning about the ordering of the commands in the buffer before they get translated into graphics API calls by the RenderDevice.

Ordering & Buffers

We need a way to reorder commands in one or many RenderContexts to make sure triangles end up on the screen in the right order, or more generally speaking; to schedule our GPU work.

There are many ways of dealing with this but my favorite approach is to just associate one or many commands with a 64 bit sort key and when all commands have been recorded simply sort them on this key before translating them into actual graphics API calls. The approach we are using in Stingray is heavily inspired by Christer Ericsson’s blog post “Order your graphics draw calls around!”. I will be covering our sorting system in more details in my next post, for now the only thing important to grasp is that while the RenderContext records commands it does so by populating two buffers. One is a simple array of a POD struct called Command:

struct Command
{
    uint64_t sort_key;
    void *head;
    uint32_t command_flags;
};

sort_key - 64 bit sort key used for reordering commands before being consumed by the RenderDevice, more on this later.
head - Pointer to the actual data for this command.
command_flags - A bit flag encoding some hinting about what kind of command head is actually pointing to. This is simply an optimization to reduce pointer chasing in the RenderDevice, it will be covered in more detail in a later post.

Render Package Stream

The other buffer is what we call a RenderPackageStream and is what holds the actual command data. The RenderPackageStream class is essentially just a few helper functions to put arbitrary length commands into memory. The memory backing system for RenderPackageStreams is somewhat more complex than a simple array though, this is because we need a way to keep its memory footprint under control. For efficiency, we want to recycle the memory instead of reallocating it every frame, but depending on workload we are likely to get some RenderContexts becoming much larger than others. This creates a problem when using simple arrays to store the commands as the workload will shift slightly over time causing all arrays having to grow to fit the worst case scenario, resulting in lots of wasted memory.

To combat this we allocate and return fixed size blocks of memory from a pool. As we know the size of each command before writing them to the buffer we can make sure that a command doesn’t end up spanning multiple blocks; if we detect that we are about to run out of memory in the active block we simply allocate a new block and move on. If we detect that a single command will span multiple blocks we make sure to allocate them sequentially in memory. We return a block to the pool when we are certain that the consumer of the data (in this case the RenderDevice) is done with it. (This memory allocation approach is well described in Christian Gyrling’s excellent GDC 2015 presentation Parallelizing the Naughty Dog Engine Using Fibers)

You might be wondering why we put the sort_key in a separate array instead of putting it directly into the header data of the packages written to the RenderPackageStream, there are a number of reasons for that:

The actual package data can become fairly large even for regular draw calls. Since we want to make the packages self contained we have to put all data needed to translate the command into an graphics API call inside the package. This includes handles to all resources, constant buffer reflections and similar. I don’t know of any way to efficiently sort an array with elements of varying sizes.
Since we allocate the memory in blocks, as described above, we would need to introduce some form of “jump label” and insert that into the buffer to know how and when to jump into the next memory block. This would further complicate the sorting and traversal of the buffers.
It allows us to recycle the actual package data from one draw call to another when rendering multi-pass shaders as we simply can inject multiple Commands pointing to the same package data. (Which shader pass to use when translating the package into graphic API calls can later be extracted from the sort_key.)
We can reduce pointer chasing by encoding hints in the Command about the contents of the package data. This is what we do in command_flags mentioned earlier.

Render Context interface

With the low-level concepts of the RenderContext covered let’s move on and look at how it is used from a users perspective.

If we break down the API there are essentially three different types of commands that populates a RenderContext:

State commands - Commands affecting the state of the rendering pipeline (e.g render target bindings, viewports, scissoring, etc) + some miscellaneous commands.
Rendering commands - Commands used to trigger draw calls and compute work on the GPU.
Resource update commands - Commands for updating GPU resources.

1. State Commands

“State commands” are a series of commands getting executed in sequence for a specific sort_key. The interface for starting/stopping the recording looks like this:

class RenderContext
{
    void begin_state_command(uint64_t sort_key, uint32_t gpu_affinity_mask = GPU_DEFAULT);
    void end_state_command();
};

sort_key - the 64 bit sort key.
gpu_affinity_mask - I will cover this towards the end of this post but, for now just think of it as a bit mask for addressing one or many GPUs.

Here’s a small example showing what the recording of a few state commands might look like:

rc.begin_state_command(sort_key);
for (uint32_t i=0; i!=MAX_RENDER_TARGETS; ++i)
    rc.set_render_target(i, nullptr);
rc.set_depth_stencil_target(depth_shadow_map);
rc.clear(RenderContext::CLEAR_DEPTH);
rc.set_viewports(1, &viewport);
rc.set_scissor_rects(1, &scissor_rect);
rc.end_state_command();

While state commands primarily are used for doing bigger graphics pipeline state changes (like e.g. changing render targets) they are also used for some miscellaneous things like clearing of bound render targets, pushing/poping timer markers, and some other stuff. There is no obvious reasoning for grouping these things together under the name “state commands”, it’s just something that has happened over time. Keep that in mind as we go through the list of commands below.

Common commands

set_render_target(uint32_t slot, RenderTarget *target, const SurfaceInfo& surface_info);
- slot - Which index of the “Multiple Render Target” (MRT) chain to bind
- target - What RenderTarget to bind
- surface_info - SurfaceInfo is a struct describing which surface of the RenderTarget to bind.
```
struct SurfaceInfo {
    uint32_t array_index; // 0 in all cases except if binding a texture array
    uint32_t slice;       // 0 for 2D textures, 0-5 for cube maps, 0-n for volume textures
    uint32_t mip_level;   // 0-n depending on wanted mip level
};
```
set_depth_stencil_target(RenderTarget *target, const SurfaceInfo& surface_info); - Same as above but for depth stencil.
clear(RenderContext::ClearFlags flags); - Clears currently bound render targets.
- flags - enum bit flag describing what parts of the bound render targets to clear.
```
enum ClearFlags {
    CLEAR_SURFACE   = 0x1,
    CLEAR_DEPTH     = 0x2,
    CLEAR_STENCIL   = 0x4
};
```
set_viewports(uint32_t n_viewports, const Viewport *viewports);
- n_viewports - Number of viewports to bind.
- viewports - Pointer to first Viewport to bind. Viewport is a struct describing the dimensions of the viewport:
```
struct Viewport {
    float x, y, width, height;
    float min_depth, max_depth;
};
```
Note that x, y, width and height are in unsigned normalized [0-1] coordinates to decouple render target resolution from the viewport.
set_scissor_rects(uint32_t n_scissor_rects, const ScissorRect *scissor_rects);
- n_scissor_rects - Number of scissor rectangles to bind
- scissor_rects - Pointer to the first ScissorRect to bind.
```
struct ScissorRect {
    float x, y, width, height;
};
```
Note that x, y, width and height are in unsigned normalized [0-1] coordinates to decouple render target resolution from the scissor rectangle.

A bit more exotic commands

set_stream_out_target(uint32_t slot, RenderResource *resource, uint32_t offset);
- slot - Which index of the stream out buffers to bind
- resource - Which RenderResource to bind to that slot (has to point to a VertexStream)
- offset - A byte offset describing where to begin writing in the buffer pointed to by resource.
set_instance_multiplier(uint32_t multiplier);
Allows the user to scale the number instances to render for each render() call (described below). This is a convenience function to make it easier to implement things like Instanced Stereo Rendering.

Markers

push_marker(const char *name)
Starts a new marker scope named name. Marker scopes are both used for gathering RenderDevice statistics (number of draw calls, state switches and similar) as well as for creating GPU timing events. The user is free to nestle markers if they want to better group statistics. More on this in a later post.
pop_marker(const char *name)
Stops an existing marker scope named name.

2. Rendering

With most state commands covered let’s move on and look at how to record commands for triggering draw calls and compute work to a RenderContext.

For that we have a single function called render():

class RenderContext
{
    RenderJobPackage *render(const RenderJobPackage* job,
        const ShaderTemplate::Context& shader_context, uint64_t interleave_sort_key = 0,
        uint64_t shader_pass_branch_key = 0, float job_sort_depth = 0.f,
        uint32_t gpu_affinity_mask = GPU_DEFAULT);
};

job

First argument piped to render() is a pointer to a RenderJobPackage, and as you can see the function also returns a pointer to a RenderJobPackage. What is going on here is that the RenderJobPackage piped as argument to render() gets copied to the RenderPackageStream, the copy gets patched up a bit and then a pointer to the modified copy is returned to allow the caller to do further tweaks to it. Ok, this probably needs some further explanation…

The RenderJobPackage is basically a header followed by an arbitrary length of data that together contains everything needed to make it possible for the RenderDevice to later translate it into either a draw call or a compute shader dispatch. In practice this means that after the RenderJobPackage header we also pack RenderResource::render_resource_handle for all resources to bind to all different shader stages as well as full representations of all non-global shader constant buffers.

Since we are building multiple RenderContexts in parallel and might be visiting the same renderable object (mesh, particle system, etc) simultaneously from multiple worker threads, we cannot mutate any state of the renderable when calling its render() function.

Typically all renderable objects have static prototypes of all RenderJobPackages they need to be drawn correctly (e.g. a mesh with three materials might have three RenderJobPackages - one per material). Naturally though, the renderable objects don’t know anything about in which context they will be drawn (e.g. from what camera or in what kind of lighting environment) up until the point where their render() function gets called and the information is provided. At that point their static RenderJobPackages prototypes somehow needs to be patched up with this information (which typically is in the form of shader constants and/or resources).

One way to handle that would be to create a copy of the prototype RenderJobPackage on the stack, patch up the stack copy and then pipe that as argument to RenderContext::render(). That is a fully valid approach and would work just fine, but since RenderContext::render() needs to create a copy of the RenderJobPackage anyway it is more efficient to patch up that copy directly instead. This is the reason for RenderContext::render() returning a pointer to the RenderJobPackage on the RenderPackageStream.

Before diving into the RenderJobPackage struct let’s go through the other arguments of RenderContext::render():

shader_context

We will go through this in more detail in the post about our shader system but essentially we have an engine representation called ShaderTemplate, each ShaderTemplate has a number of Contexts.

A Context is basically a description of any rendering passes that needs to run for the RenderJobPackage to be drawn correctly when rendered in a certain “context”. E.g. a simple shader might declare two contexts: “default” and “shadow”. The “default” context would be used for regular rendering from a player camera, while the “shadow” context would be used when rendering into a shadow map.

What I call a “rendering pass” in this scenario is basically all shader stages (vertex, pixel, etc) together with any state blocks (rasterizer, depth stencil, blend, etc) needed to issue a draw call / dispatch a compute shader in the RenderDevice.

interleave_sort_key

RenderContext::render() automatically figures out what sort keys / Commands it needs to create on it’s command array. Simple shaders usually only render into one layer in a single pass. In those scenarios RenderContext::render() will create a single Command on the command array. When using a more complex shader that renders into multiple layers and/or needs to render in multiple passes; more than one Command will be created, each command referencing the same RenderJobPackage in its Command::head pointer.

This can feel a bit abstract and is hard to explain without giving you the full picture of how the shader system works together with the data-driven rendering system which in turn dictates the bit allocation patterns of the sort keys, for now it’s enough to understand that the shader system somehow knows what Commands to create on the command array.

The shader author can also decide to bypass the data-driven rendering system and put the scheduling responsibility entirely in the hands of the caller of RenderContext::render(), in this case the sort key of all Commands created will simply become 0. This is where the interleave_sort_key comes into play, this variable will be bitwise ORed with the sort key before being stored in the Command.

shader_pass_branch_key

The shader system has a feature for allowing users to dynamically turn on/off certain rendering passes. Again this becomes somewhat abstract without providing the full picture but basically this system works by letting the shader author flag certain passes with a “tag”. A tag is simply a string that gets mapped to a bit within a 64 bit bit-mask. By bitwise ORing together multiple of these tags and piping the result in shader_pass_branch_key the user can control what passes to activate/deactivate when rendering the RenderJobPackage.

job_sort_depth

A signed normalized [0-1] floating point value used for controlling depth sorting between RenderJobPackages. As you will see in the next post this value simply gets mapped into a bit range of the sort key, removing the need for doing any kind of special trickery to manage things like back-to-front / front-to-back sorting of RenderJobPackages.

gpu_affinity_mask

Same as the gpu_affinity_mask parameter piped to begin_state_command().

`RenderJobPackage`

Let’s take a look at the actual RenderJobPackage struct:

struct RenderJobPackage
{
    BatchInfo batch_info;
    #if defined(COMPUTE_SUPPORTED)
        ComputeInfo compute_info;
    #endif

    uint32_t size;                          // size of entire package including extra data

    uint32_t n_resources;                   // number of resources assigned to job.
    uint32_t resource_offset;               // offset from start of RenderJobPackage to first RenderResource.

    uint32_t shader_resource_data_offset;   // offset to shader resource data
    RenderResource::Handle shader;          // shader used to execute job

    uint64_t instance_hash;                 // unique hash used for instance merging

    #if defined(DEVELOPMENT)
        ResourceID resource_tag;            // debug tag associating job to a resource on disc
        IdString32 object_tag;              // debug tag associating job to an object
        IdString32 batch_tag;               // debug tag associating job to a sub-batch of an object
    #endif
};

batch_info & compute_info

First two members are two nestled POD structs mainly containing the parameters needed for doing any kind of drawing or dispatching of compute work in the RenderDevice:

struct BatchInfo
{
    enum PrimitiveType {
        TRIANGLE_LIST,
        LINE_LIST
        // ...
    };
    enum FrontFace {
        COUNTER_CLOCK_WISE = 0,
        CLOCK_WISE = 1
    };

    PrimitiveType primitive_type;
    uint32_t vertex_offset;         // Offset to first vertex to read from vertex buffer.
    uint32_t primitives;            // Number of primitives to draw
    uint32_t index_offset;          // Offset to the first index to read from the index buffer
    uint32_t vertices;              // Number of vertices in batch (used if batch isn't indexed)
    uint32_t instances;             // Number of instances of this batch to draw
    FrontFace front_face;           // Defines which triangle winding order
};

Most of these are self explanatory, I think the only thing worth pointing out is the front_face enum. This is here to dynamically handle flipping of the primitive winding order when dealing with objects that are negatively scaled on an uneven number of axes. For typical game content it’s rare that we see content creators using mesh mirroring when modeling, for other industries however it is a normal workflow.

struct ComputeInfo
{
    uint32_t thread_count[3];
    bool async;
};

So while BatchInfo mostly holds the parameters needed to render something, ComputeInfo hold the parameters to dispatch a compute shader. The three element array thread_count containing the thread group count for x, y, z. If async is true the graphics API’s “compute queue” will be used instead of the “graphics queue”.

resource_offset

Byte offset from start of RenderJobPackage to an array of n_resources with RenderResource::Handle. Resources found in this array can be of the type VertexStream, IndexStream or VertexDeclaration. Based on the their type and order in the array they get bound to the input assembler stage in the RenderDevice.

shader_resource_data_offset

Byte offset from start of RenderJobPackage to a data block holding handles to all RenderResources as well as all constant buffer data needed by all the shader stages. The layout of this data blob will be covered in the post about the shader system.

instance_hash

We have a system for doing what we call “instance merging”, this system figures out if two RenderJobPackages only differ on certain shader constants and if so merges them into the same draw call. The shader author is responsible but not required to implement support for this feature. If the shader supports “instance merging” the system will use the instance_hash to figure out if two RenderJobPackages can be merged or not. Typically the instance_hash is simply a hash of all RenderResource::Handle that the shader takes as input.

resource_tag & object_tag & batch_tag

Three levels of debug information to make it easier to back track errors/warning inside the RenderDevice to the offending content.

3. Resource updates

The last type of commands are for dynamically updating various RenderResources (Vertex/Index/Raw buffers, Textures, etc).

The interface for updating a buffer with new data looks like this:

class RenderContext
{
    void *map_write(RenderResource *resource, render_sorting::SortKey sort_key,
        const ShaderTemplate::Context* shader_context = 0,
        shader_pass_branching::Flags shader_pass_branch_key = 0,
        uint32_t gpu_affinity_mask = GPU_DEFAULT);
};

resource

This function basically returns a pointer to the first byte of the buffer that will replace the contents of the resource. map_write() figures out the size of the buffer by casting the resource to the correct type (using the type information encoded in the RenderResource::render_resource_handle). It then allocates memory for the buffer and a small header on the RenderPackageStream and returns a pointer to the buffer.

sort_key & shader_context & shader_pass_branch_key

In some rare situations you might need to update the same buffer with different data multiple times within a frame. A typical example could be the vertex buffer of a particle system implementing some kind of level-of-detail system causing the buffers to change depending on e.g camera position. To support that the user can provide a bunch of extra parameters to make sure the contents of the GPU representation of the buffer is updated right before the graphics API draw calls are triggered for the different rendering passes. This works in a similar way how RenderContext::render() can create multiple Commands on the command array referencing the same data.

Unless you need to update the buffer multiple times within the frame it is safe to just set all of the above mentioned parameters to 0, making it very simple to update a buffer:

void *buf = rc.map_write(resource, 0);
// .. fill bits in buffer ..

Note: To shorten the length of this post I’ve left out a few other flavors of updating resources, but map_write is the most important one to grasp.

GPU Queues, Fences & Explicit MGPU programming

Before wrapping up I’d like to touch on a few recent additions to the Stingray renderer, namely how we’ve exposed control for dealing with different GPU Queues, how to synchronize between them and how to control, communicate and synchronize between multiple GPUs.

New graphics APIs such as DX12 and Vulkan exposes three different types of command queues: Graphics, Compute and Copy. There’s plenty of information on the web about this so I won’t cover it here, the only thing important to understand is that these queues can execute asynchronously on the GPU; hence we need to have a way to synchronize between them.

To handle that we have exposed a simple fence API that looks like this:

class RenderContext
{
    struct FenceMessage
    {
        enum Operation { SIGNAL, WAIT };
        Operation operation;
        IdString32 fence_name;
    };
    void signal_fence(IdString32 fence_name, render_sorting::SortKey sort_key,
        uint32_t queue = GRAPHICS_QUEUE, uint32_t gpu_affinity_mask = GPU_DEFAULT);
    void wait_fence(IdString32 fence_name, render_sorting::SortKey sort_key,
        uint32_t queue = GRAPHICS_QUEUE, uint32_t gpu_affinity_mask = GPU_DEFAULT);
};

Here’s a pseudo code snippet showing how to synchronize between the graphics queue and the compute queue:

uint64_t sort_key = 0;

// record a draw call
rc.render(graphics_job, graphics_shader, sort_key++);

// record an asynchronous compute job
// (ComputeInfo::async bool in async_compute_job is set to true to target the graphics APIs compute queue)
rc.render(async_compute_job, compute_shader, sort_key++);

// now lets assume the graphics queue wants to use the result of the async_compute_job,
// for that we need to make sure that the compute shader is done running
rc.wait_fence(IdString32("compute_done"), sort_key++, GRAPHICS_QUEUE);
rc.signal_fence(IdString32("compute_done"), sort_key++, COMPUTE_QUEUE);

rc.render(graphics_job_using_result_from_compute, graphics_shader2, sort_key++);

As you might have noticed all methods for populating a RenderContext described in this post also takes an extra parameter called gpu_affinity_mask. This is a bit-mask used for directing commands to one or many GPUs. The idea is simple, when we boot up the renderer we enumerate all GPUs present in the system and decide which one to use as our default GPU (GPU_DEFAULT) and assign that to bit 1. We also let the user decide if there are other GPUs present in the system that should be available to Stingray and if so assign them bit 2, 3, 4, and so on. By doing so we can explicitly direct control of all commands put on the RenderContext to one or many GPUs in a simple way.

As you can see that is also true for the fence API described above, on top of that there’s also a need for a copy interface to copying resources between GPUs:

class RenderContext
{
    void copy(RenderResource *dst_resource, RenderResource *src_resource,
        render_sorting::SortKey sort_key, Box *src_box = 0, uint32_t dst_offsets[3] = 0,
        uint32_t queue = GRAPHICS_QUEUE, uint32_t gpu_affinity_mask = GPU_DEFAULT,
        uint32_t gpu_source = GPU_DEFAULT, uint32_t gpu_destination = GPU_DEFAULT);
};

Even though this work isn’t fully completed I still wanted to share the high-level idea of what we are working towards for exposing explicit MGPU control to the Stingray renderer. We are actively working on this right now and with some luck I might be able to revisit this with more concrete examples when getting to the post about the render_config & data-driven rendering.

Next up

With that I think I’ve covered the most important aspects of the RenderContext. Next post will dive a bit deeper into bit allocation ranges of the sort keys and the system for sorting in general, hopefully that post will become a bit shorter.

Stingray Renderer Walkthrough #2: Resources & Resource Contexts

2017-02-01T21:07:00.002+01:00

Stingray Renderer Walkthrough #2: Resources & Resource Contexts

Render Resources

Before any rendering can happen we need a way to reason about GPU resources. Since we want all graphics API specific code to stay isolated we need some kind of abstraction on the engine side, for that we have an interface called RenderDevice. All calls to graphics APIs like D3D, OGL, GNM, Metal, etc. stays behind this interface. We will be covering the RenderDevice in a later post so for now just know that it is there.

We want to have a graphics API agnostic representation for a bunch of different types of resources and we need to link these representations to their counterparts on the RenderDevice side. This linking is handled through a POD-struct called RenderResource:

struct RenderResource
{
    enum {
        TEXTURE, RENDER_TARGET, DEPENDENT_RENDER_TARGET, BACK_BUFFER_WRAPPER,
        CONSTANT_BUFFER, VERTEX_STREAM, INDEX_STREAM, RAW_BUFFER,
        BATCH_INFO, VERTEX_DECLARATION, SHADER,
        NOT_INITIALIZED = 0xFFFFFFFF
    };

    uint32_t render_resource_handle;
};

Any engine resource that also needs a representation on the RenderDevice side inherits from this struct. It contains a single member render_resource_handle which is used to lookup the correct graphics API specific representation in the RenderDevice.

The most significant 8 bits of render_resource_handle holds the type enum, the lower 24 bits is simply an index into an array for that specific resource type inside the RenderDevice.

Various Render Resources

Let’s take a look at the different render resource that can be found in Stingray:

Texture - A regular texture, this object wraps all various types of different texture layouts such as 2D, Cube, 3D.
RenderTarget - Basically the same as Texture but writable from the GPU.
DependentRenderTarget - Similar to RenderTarget but with logics for inheriting properties from another RenderTarget. This is used for creating render targets that needs to be reallocated when the output window (swap chain) is being resized.
BackBufferWrapper - Special type of RenderTarget created inside the RenderDevice as part of the swap chain creation. Almost all render targets are explicitly created by the user, this is the only exception as the back buffer associated with the swap chain is typically created together with the swap chain.
ShaderConstantBuffer - Shader constant buffers designed for explicit update and sharing between multiple shaders, mainly used for “view-global” state.
VertexStream - A regular Vertex Buffer.
VertexDeclaration - Describes the contents of one or many VertexStreams.
IndexStream - A regular Index Buffer.
RawBuffer - A linear memory buffer, can be setup for GPU writing through an UAV (Unordered Access View).
Shader - For now just think of this as something containing everything needed to build a full pipeline state object (PSO). Basically a wrapper over a number of shaders, render states, sampler states etc. I will cover the shader system in a later post.

Most of the above resources have a few things in common:

They describe a buffer either populated by the CPU or by the GPU
CPU populated buffers has a validity field describing its update frequency:
- STATIC - The buffer is immutable and won’t change after creation, typically most buffers coming from DCC assets are STATIC.
- UPDATABLE - The buffer can be updated but changes less than once per frame, e.g: UI elements, post processing geometry and similar.
- DYNAMIC - The buffer frequently changes, at least once per frame but potentially many times in a single frame e.g: particle systems.
They have enough data for creating a graphics API specific representation inside the RenderDevice, i.e they know about strides, sizes, view requirements (e.g should an UAV be created or not), etc.

Render Resource Context

With the RenderResource concept sorted, we’ll go through the interface for creating and destroying the RenderDevice representation of the resources. That interface is called RenderResourceContext (RRC).

We want resource creation to be thread safe and while the RenderResourceContext in itself isn’t, we can achieve free threading by allowing the user to create any number of RRC’s they want, and as long as they don’t touch the same RRC from multiple threads everything will be fine.

Similar to many other rendering systems in Stingray the RRC is basically just a small helper class wrapping an abstract “command buffer”. On this command buffer we put what we call “packages” describing everything that is needed for creating/destroying RenderResource objects. These packages have variable length depending on what kind of object they represent. In addition to that the RRC can also hold platform specific allocators that allow allocating/deallocating GPU mapped memory directly, avoiding any additional memory shuffling in the RenderDevice. This kind of mechanism allows for streaming e.g textures and other immutable buffers directly into GPU memory on platforms that provides that kind of low-level control.

Typically the only two functions the user need to care about are:

class RenderResourceContext
{
public:
  void alloc(RenderResource *resource);
  void dealloc(RenderResource *resource);
};

When the user is done allocating/deallocating resources they hand over the RRC either directly to the RenderDevice or to the RenderInterface.

class RenderDevice
{
public:
    virtual void dispatch(uint32_t n_contexts, RenderResourceContext **rrc, uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

Handing it over directly to the RenderDevice requires the caller to be on the controller thread for rendering as RenderDevice::dispatch() isn’t thread safe. If the caller is on any other thread (like e.g. one of the worker threads or the resource streaming thread) RenderInterface::dispatch() should be used instead. We will cover the RenderInterface in a later post so for now just think of it as a way of piping data into the renderer from an arbitrary thread.

Wrap up

The main reason of having the RenderResourceContext concept instead of exposing allocate()/deallocate() functions directly in the RenderDevice/RenderInterface interfaces is for efficiency. We have a need for allocating and deallocating lots of resources, sometimes in parallel from multiple threads. Decoupling the interface for doing so makes it easy to schedule when in the frame the actual RenderDevice representations gets created, it also makes the code easier to maintain as we don’t have to worry about thread-safety of the RenderResourceContext.

In the next post we will discuss the RenderJobs and RenderContexts which are the two main building blocks for creating and scheduling draw calls and state changes.

Stay tuned.

Stingray Renderer Walkthrough #1: Overview

2017-02-01T21:07:00.001+01:00

Stingray Renderer Walkthrough #1: Overview

Introduction

When we started writing Bitsquid back in mid 2009 all platforms we intended to run on were already multi-core architectures. This and the fact that we had some prior experience trying to get our last engine to run efficiently on the PS3 answered the question how not to architecture an efficient renderer that scales to many cores. We knew we needed more than functional parallelism, we wanted data-parallelism.

To solve that we divide the CPU view of a rendered frame into three stages:

Culling - Filter out visible renderable objects with respect to a camera from a potentially huge set of different type of objects (meshes, particle systems, lights, etc).
Render - Iterate over the filtered result from Culling and “record” an intermediate representation of draw calls/state switches to a command buffer.
Dispatch - Take result from Render and translate that into actual render API calls (D3D, OGL, Metal, GNM, etc).

As you can see each stage pipes its result into the next. Rendering is typically very simple in that sense; we tend to have a one way flow of our data: [[user input or time affects state, state propagates into changes of the renderable objects (transforms, shader constants, etc), figure out what need to be rendered, iterate over that and finally generate render API calls. Rinse & Repeat :]]

If we ignore the problem of ordering the final API calls in the rendering backend it’s fairly easy to see how we can achieve data parallelism in this scenario. Just fork at each stage splitting the workload into a n-chunks (where n is however many worker threads you can throw at it). When all workers are done for a stage take the result and pipe into the next stage.

In essence this is how all rendering in Stingray works. Obviously I’ve glanced over some rather important and challenging details but as you will see they are not too hard to solve if you have good control over your data flows and are picky about when mutation of the data happens.

Design Philosophies & Concepts

The rendering code in Stingray tends to be heavily influenced by Data Oriented Programming principles. When designing new systems our biggest efforts usually goes into structuring our data efficiently and thinking about its flow through the systems, more so than writing the actual code that transforms the data from one form to another.

To achieve data-parallelism throughout the rendering code the first thing to realize is that we have to be very picky about when mutation of the renderable objects happens. Multiple worker threads will run over our objects and its not unlikely that more than one thread visits the same object at the same time, hence we must not mutate the state of our objects in its render function. Therefore all of our render() functions are const.

To further guard ourselves from the outer world (i.e gameplay, physics, etc) the renderer operates in complete isolation from the game logics. It has its own representation of the data it needs, and only the data relevant for rendering. While the gameplay logics usually wants to reason about high-level concepts such as game entities (which basically groups a number of meshes, particle systems, lights, etc together), we on the rendering side don’t really care about that. We are much more interested in just having an array of all renderable objects in a game world, in a memory layout that makes it efficient to access.

Another nice thing with decoupling the representation of the renderable objects from the game objects is that it allows us to run simulation in parallel with rendering (functional parallelism). So while simulation is updating frame n the renderer is processing frame n-1. Some of you might argue that overlaying rendering on top of simulation doesn’t give any performance improvements if the work in all systems is nicely parallelized. In reality though this isn’t really the case. We still have systems that don’t go wide, or have certain sections where they need to do synchronous processing (last generation graphics APIs: e.g DX11, OpenGL are good examples). This creates bubbles in the frame slowing us down.

By overlaying simulation and rendering we get a form of bubble filling among the worker threads which in most cases gives a big enough speed improvement to justify the added complexity that comes from this architecture. More specifically:

Double buffering of state - since the simulation might mutate the state of an object for frame n at the same time as the renderer is processing frame n-1 any mutable state needs to be double buffered.
Life scope tracking of immutable data - while immutable/read only state such as static vertex and index buffers are safe to read by both simulation and renderer we still need to be careful not pulling the rug under the renderers feet by freeing anything still being in use by the renderer.

Here’s a conceptual graph showing the benefits of overlaying simulation and rendering:

So basically what we got here is two “controller threads”: simulation and render both offloading work to the worker threads. In the case that a controller thread is blocked waiting for some work to finish it will assist the worker threads striving to never sit idle. One thing to note is that to prevent frames from stacking up, we never allow the simulation thread to run more than one frame ahead of the render thread.

As a comparison here’s the same workload with simulation and rendering running in sequence.

As you can see we get significantly more idle time (bubbles) on the worker threads due to certain parts of both the simulation and rendering not being able to go wide.

Next up

I think this pretty much covers the high level view of the core rendering architecture in Stingray. Now lets go into some more detail.

Since Andreas Asplund recently covered both how we handle propagation of state from simulation to the renderer (we call this “State reflection” in Stingray): http://bitsquid.blogspot.se/2016/09/state-reflection.html as well as how our view frustum culling system(s) works: http://bitsquid.blogspot.se/2016/10/the-implementation-of-frustum-culling.html I won’t be covering that in this series.

Instead I will jump straight into how creating and destroying GPU resources works, and from there go through all the building blocks needed to implement the second stage Render mentioned above.

Stingray Renderer Walkthrough

2017-02-01T21:07:00.000+01:00

Stingray Renderer Walkthrough

Welcome

To simplify knowledge transferring inside the Autodesk development teams and in an attempt to improve my writing skills I’ve decided to do a walkthrough of the Stingray rendering architecture. The idea is to do this as a series of blog posts over the coming weeks starting from the low-level aspects of the renderer chewing my way up to more high-level concepts as I go.

I’ve covered some of these topics before in various presentations over the years but those have been more focused on how our data driven aspects of the renderer works and less on the core architecture behind it. This is an attempt to do a more complete walk-through of the entire rendering architecture.

When I started thinking about this it felt like an almost impossible undertaking considering how much slower I am at expressing myself in text than in code, but after spending a couple of days going through the entire stingray code base doing some spring cleaning it felt a bit more manageable so I’ve now decided to give it a try.

(Note: this has nothing at all to do with me feeling the pressure from Niklas Frykholm who’s currently doing a complete walk-through of the entire Stingray engine code base (well everything except rendering) as a series of youtube videos [1]. Not at all… I feel no pressure, no guilt, nothing… I promise… Thanks Niklas for pushing me!)

Outline

Below is some kind of outline of what I intend to cover and in what order, I might swap things around as I go if I discover it makes more sense. This post will work as an index and I will link to the posts as they come online.

The Implementation of Frustum Culling in Stingray

2016-10-04T21:40:00.000+02:00

Overview

Frustum culling can be an expensive operation. Stingray accelerates it by making heavy use of SIMD and distributing the workload over several threads. The basic workflow is:

Kick jobs to do frustum vs sphere culling
- For each frustum plane, test plane vs sphere
Wait for sphere culling to finish
For objects that pass sphere test, kick jobs to do frustum vs object-oriented bounding box (OOBB) culling
- For each frustum plane, test plane vs OOBB
Wait for OOBB culling to finish

Frustum vs sphere tests are significantly faster than frustum vs OOBB. By rejecting objects that fail sphere culling first, we have fewer objects to process in the more expensive OOBB pass.

Why go over all objects brute force instead of using some sort of spatial partition data structure? We like to keep things simple and with the current setup we have yet to encounter a case where we've been bound by the culling. Brute force sphere culling followed by OOBB culling is fast enough for all cases we've encountered so far. That might of course change in the future, but we'll take care of that when it's an actual problem.

The brute force culling is pretty fast, because:

The sphere and the OOBB culling use SIMD and only load the minimum amount of needed data.
The workload is distributed over several threads.

In this post, I we will first look at the single threaded SIMD code and then how the culling is distributed over multiple threads.

I'll use a lot of code to show how it's all done. It's mostly actual code from the engine, but it has been cleaned up to a certain extent. Some stuff has been renamed and/or removed to make it easier to understand what's going on.

Data structures used

If you go back to my previous post about state reflection, http://bitsquid.blogspot.ca/2016/09/state-reflection.html you can read that each object on the main thread is associated with a render thread representation via a render_handle. The render_handle is used to get the object_index which is the index of an object in the _objects array.

Take a look at the following code for a refresher:

void RenderWorld::create_object(WorldRenderInterface::ObjectManagementPackage *omp)
{
    // Acquire an `object_index`.
    uint32_t object_index = _objects.size();

    // Same recycling mechanism as seen for render handles.
    if (_free_object_indices.any()) {
        object_index = _free_object_indices.back();
        _free_object_indices.pop_back();
    } else {
        _objects.resize(object_index + 1);
        _object_types.resize(object_index + 1);
    }

    void *render_object = omp->user_data;
    if (omp->type == RenderMeshObject::TYPE) {
        // Cast the `render_object` to a `MeshObject`.
        RenderMeshObject *rmo = (RenderMeshObject*)render_object;

        // If needed, do more stuff with `rmo`.
    }

    // Store the `render_object` and `type`.
    _objects[object_index] = render_object;
    _object_types[object_index] = omp->type;

    if (omp->render_handle >= _object_lut.size())
        _object_lut.resize(omp->handle + 1);
    // The `render_handle` is used
    _object_lut[omp->render_handle] = object_index;
}

The _objects array stores objects of all kinds of different types. It is defined as:

Array<void*> _objects;

The types of the objects are stored in a corresponding _object_types array, defined as:

Array<uint32_t> _object_types;

From _object_types, we know the actual type of the objects and we can use that to cast the void * into the proper type (mesh, terrain, gui, particle_system, etc).

The culling happens in the // If needed, do more stuff with rmo section above. It looks like this:

void *render_object = omp->user_data;
if (omp->type == RenderMeshObject::TYPE) {
    // Cast the `render_object` to a `MeshObject`.
    RenderMeshObject *rmo = (RenderMeshObject*)render_object;

    // If needed, do more stuff with `rmo`.
    if (!(rmo->flags() & renderable::CULLING_DISABLED)) {
        culling::Object o;
        // Extract necessary information to do culling.

        // The index of the object.
        o.id = object_index;

        // The type of the object.
        o.type = rmo->type;

        // Get the mininum and maximum corner positions of a boudning box in object space.
        o.min = float4(rmo->bounding_volume().min, 1.f);
        o.max = float4(rmo->bounding_volume().max, 1.f);

        // World transform matrix.
        o.m = float4x4(rmo->world());

        // Depending on the value of `flags` add the culling representation to different culling sets.
        if (rmo->flags() & renderable::VIEWPORT_VISIBLE)
            _cullable_objects.add(o, rmo->node());
        if (rmo->flags() & renderable::SHADOW_CASTER)
            _cullable_shadow_casters.add(o, rmo->node());
        if (rmo->flags() & renderable::OCCLUDER)
            _occluders.add(o, rmo->node());
    }
}

For culling MeshObjects and other cullable types are represented by culling::Objects that are used to populate the culling data structures. As can be seen in the code they are _cullable_objects, _cullable_shadow_casters and _occluders and they are all represented by an ObjectSet:

struct ObjectSet
{
    // Minimum bounding box corner position.
    Array<float> min_x;
    Array<float> min_y;
    Array<float> min_z;

    // Maximum bounding box corner position.
    Array<float> max_x;
    Array<float> max_y;
    Array<float> max_z;

    // Object->world matrix.
    Array<float> world_xx;
    Array<float> world_xy;
    Array<float> world_xz;
    Array<float> world_xw;
    Array<float> world_yx;
    Array<float> world_yy;
    Array<float> world_yz;
    Array<float> world_yw;
    Array<float> world_zx;
    Array<float> world_zy;
    Array<float> world_zz;
    Array<float> world_zw;
    Array<float> world_tx;
    Array<float> world_ty;
    Array<float> world_tz;
    Array<float> world_tw;

    // World space center position of bounding sphere.
    Array<float> ws_pos_x;
    Array<float> ws_pos_y;
    Array<float> ws_pos_z;

    // Radius of bounding sphere.
    Array<float> radius;

    // Flag to indicate if an object is culled or not.
    Array<uint32_t> visibility_flag;

    // The type and id of an object.
    Array<uint32_t> type;
    Array<uint32_t> id;

    uint32_t n_objects;
};

When an object is added to, e.g. _cullable_objects the culling::Object data is added to the ObjectSet. The ObjectSet flattens the data into a structure-of-arrays representation. The arrays are padded to the SIMD lane count to make sure there's valid data to read.

Frustum-sphere culling

The world space positions and sphere radii of objects are represented by the following members of the ObjectSet:

Array<float> ws_pos_x;
Array<float> ws_pos_y;
Array<float> ws_pos_z;
Array<float> radius;

This is all we need to do frustum-sphere culling.

The frustum-sphere culling needs the planes of the frustum defined in world space. Information on how to find that can be found in: http://gamedevs.org/uploads/fast-extraction-viewing-frustum-planes-from-world-view-projection-matrix.pdf.

The frustum-sphere intersection code tests one plane against several spheres using SIMD instructions. The ObjectSet data is already laid out in a SIMD friendly way. To test one plane against several spheres, the plane's data is splatted out in the following way:

// `float4` is our cross platform abstraction of SSE, NEON etc.
struct SIMDPlane
{
    float4 normal_x; // the normal's x value replicted 4 times.
    float4 normal_y; // the normal's y value replicted 4 times.
    float4 normal_z; // etc.
    float4 d;
};

The single threaded code needed to do frustum-sphere culling is:

void simd_sphere_culling(const SIMDPlane planes[6], culling::ObjectSet &object_set)
{
    const auto all_true = bool4_all_true();
    const uint32_t n_objects = object_set.n_objects;

    uint32_t *visibility_flag = object_set.visibility_flag.begin();

    // Test each plane of the frustum against each sphere.
    for (uint32_t i = 0; i < n_objects; i += 4)
    {
        const auto ws_pos_x = float4_load_aligned(&object_set->ws_pos_x[i]);
        const auto ws_pos_y = float4_load_aligned(&object_set->ws_pos_y[i]);
        const auto ws_pos_z = float4_load_aligned(&object_set->ws_pos_z[i]);
        const auto radius = float4_load_aligned(&object_set->radius[i]);

        auto inside = all_true;
        for (unsigned p = 0; p < 6; ++p) {
            auto &n_x = planes[p].normal_x;
            auto &n_y = planes[p].normal_y;
            auto &n_z = planes[p].normal_z;
            auto n_dot_pos = dot_product(ws_pos_x, ws_pos_y, ws_pos_z, n_x, n_y, n_z);
            auto plane_test_point = n_dot_pos + radius;
            auto plane_test = plane_test_point >= planes[p].d;
            inside = vector_and(plane_test, inside);
        }

        // Store 0 for spheres that didn't intersect or ended up on the positive side of the
        // frustum planes. Store 0xffffffff for spheres that are visible.
        store_aligned(inside, &visibility_flag[i]);
    }
}

After the simd_sphere_culling call, the visibility_flag array contains 0 for all objects that failed the test and 0xffffffff for all objects that passed. We chain this together with the OOBB culling by doing a compactness pass over the visibility_flag array and populating an indirection array:

{
    // Splat out the planes to be able to do plane-sphere test with SIMD.
    const auto &frustum = camera.frustum();

    const SIMDPlane planes[6] = {
        float4_splat(frustum.planes[0].n.x),
        float4_splat(frustum.planes[0].n.y),
        float4_splat(frustum.planes[0].n.z),
        float4_splat(frustum.planes[0].d),

        float4_splat(frustum.planes[1].n.x),
        float4_splat(frustum.planes[1].n.y),
        float4_splat(frustum.planes[1].n.z),
        float4_splat(frustum.planes[1].d),

        float4_splat(frustum.planes[2].n.x),
        float4_splat(frustum.planes[2].n.y),
        float4_splat(frustum.planes[2].n.z),
        float4_splat(frustum.planes[2].d),

        float4_splat(frustum.planes[3].n.x),
        float4_splat(frustum.planes[3].n.y),
        float4_splat(frustum.planes[3].n.z),
        float4_splat(frustum.planes[3].d),

        float4_splat(frustum.planes[4].n.x),
        float4_splat(frustum.planes[4].n.y),
        float4_splat(frustum.planes[4].n.z),
        float4_splat(frustum.planes[4].d),

        float4_splat(frustum.planes[5].n.x),
        float4_splat(frustum.planes[5].n.y),
        float4_splat(frustum.planes[5].n.z),
        float4_splat(frustum.planes[5].d),
    };


    // Do frustum-sphere culling.
    simd_sphere_culling(planes, object_set);

    // Make sure to align the size to the simd lane count.
    const uint32_t n_aligned_objects = align_to_simd_lane_count(object_set.n_objects);

    // Store the indices of the objects that passed the frustum-sphere culling in the `indirection` array.
    Array<uint32_t> indirection(n_aligned_objects);

    const uint32_t n_visible = remove_not_visible(object_set, object_set.n_objects, indirection.begin());
}

Where remove_not_visible is:

uint32_t remove_not_visible(const ObjectSet &object_set, uint32_t count, uint32_t *output_indirection)
{
    const uint32_t *visibility_flag = object_set.visibility_flag.begin();
    uint32_t n_visible = 0U;
    for (uint32_t i = 0; i < count; ++i) {
        if (visibility_flag[i]) {
            output_indirection[n_visible] = i;
            ++n_visible;
        }
    }

    const uint32_t n_aligned_visible = align_to_simd_lane_count(n_visible);
    const uint32_t last_visible = n_visible? output_indirection[n_visible- 1] : 0;

    // Pad out to the simd alignment.
    for (unsigned i = n_visible; i < n_aligned_visible; ++i)
        output_indirection[i] = last_visible;

    return n_visible;
}

n_visible together with indirection provides the input for doing the frustum-OOBB culling on the objects that survived the frustum-sphere culling.

Frustum-OOBB culling

The frustum-OOBB culling takes ideas from Fabian Giesen's https://fgiesen.wordpress.com/2010/10/17/view-frustum-culling/ and Arseny Kapoulkine's http://zeuxcg.org/2009/01/31/view-frustum-culling-optimization-introduction/.

More specifically we use the Method 2: Transform box vertices to clip space, test against clip-space planes that both Fabian and Arseny write about. But we also go with Method 2b: Saving arithmetic ops that Fabian mentions. I won't dwelve into how the culling actually works, to understand that please read their posts.

The code is SIMDified to process several OOBBs at the same time. The same corner of four multiple OOBBs is tested against one frustum plane as a single SIMD operation.

To be able to write the SIMD code in a more intuitive form a few data structures and functions are used:

struct SIMDVector
{
    float4 x; // stores x0, x1, x2, x3
    float4 y; // stores y0, y1, y2, y3
    float4 z; // etc.
    float4 w;
};

A SIMDVector stores x, y, z & w for four objects. To store a matrix for four objects a SIMDMatrix is used:

struct SIMDMatrix
{
    SIMDVector x;
    SIMDVector y;
    SIMDVector z;
    SIMDVector w;
};

A SIMDMatrix-SIMDVector multiplication can then be written as a regular matrix-vector multiplication:

SIMDVector simd_multiply(const SIMDVector &v, const SIMDMatrix &m)
{
    float4 x = v.x * m.x.x;     x = v.y * m.y.x + x;    x = v.z * m.z.x + x;    x = v.w * m.w.x + x;
    float4 y = v.x * m.x.y;     y = v.y * m.y.y + y;    y = v.z * m.z.y + y;    y = v.w * m.w.y + y;
    float4 z = v.x * m.x.z;     z = v.y * m.y.z + z;    z = v.z * m.z.z + z;    z = v.w * m.w.z + z;
    float4 w = v.x * m.x.w;     w = v.y * m.y.w + w;    w = v.z * m.z.w + w;    w = v.w * m.w.w + w;
    SIMDVector res = { x, y, z, w };
    return res;
}

A SIMDMatrix-SIMDMatrix multiplication is:

SIMDMatrix simd_multiply(const SIMDMatrix &lhs, const SIMDMatrix &rhs)
{
    SIMDVector x = simd_multiply(lhs.x, rhs);
    SIMDVector y = simd_multiply(lhs.y, rhs);
    SIMDVector z = simd_multiply(lhs.z, rhs);
    SIMDVector w = simd_multiply(lhs.w, rhs);
    SIMDMatrix res = { x, y, z, w };
    return res;
}

The code needed to do the actual frustum-OOBB culling is:

void simd_oobb_culling(const SIMDMatrix &view_proj, const culling::ObjectSet &object_set, uint32_t n_objects, const uint32_t *indirection)
{
    // Get pointers to the necessary members of the object set.
    const float *min_x = object_set->min_x.begin();
    const float *min_y = object_set->min_y.begin();
    const float *min_z = object_set->min_z.begin();

    const float *max_x = object_set->max_x.begin();
    const float *max_y = object_set->max_y.begin();
    const float *max_z = object_set->max_z.begin();

    const float *world_xx = object_set->world_xx.begin();
    const float *world_xy = object_set->world_xy.begin();
    const float *world_xz = object_set->world_xz.begin();
    const float *world_xw = object_set->world_xw.begin();
    const float *world_yx = object_set->world_yx.begin();
    const float *world_yy = object_set->world_yy.begin();
    const float *world_yz = object_set->world_yz.begin();
    const float *world_yw = object_set->world_yw.begin();
    const float *world_zx = object_set->world_zx.begin();
    const float *world_zy = object_set->world_zy.begin();
    const float *world_zz = object_set->world_zz.begin();
    const float *world_zw = object_set->world_zw.begin();
    const float *world_tx = object_set->world_tx.begin();
    const float *world_ty = object_set->world_ty.begin();
    const float *world_tz = object_set->world_tz.begin();
    const float *world_tw = object_set->world_tw.begin();

    uint32_t *visibility_flag = object_set.visibility_flag.begin();

    for (uint32_t i = 0; i < n_objects; i += 4) {
        SIMDMatrix world;

        // Load the world transform matrix for four objects via the indirection table.

        const uint32_t i0 = indirection[i];
        const uint32_t i1 = indirection[i + 1];
        const uint32_t i2 = indirection[i + 2];
        const uint32_t i3 = indirection[i + 3];

        world.x.x = float4(world_xx[i0], world_xx[i1], world_xx[i2], world_xx[i3]);
        world.x.y = float4(world_xy[i0], world_xy[i1], world_xy[i2], world_xy[i3]);
        world.x.z = float4(world_xz[i0], world_xz[i1], world_xz[i2], world_xz[i3]);
        world.x.w = float4(world_xw[i0], world_xw[i1], world_xw[i2], world_xw[i3]);

        world.y.x = float4(world_yx[i0], world_yx[i1], world_yx[i2], world_yx[i3]);
        world.y.y = float4(world_yy[i0], world_yy[i1], world_yy[i2], world_yy[i3]);
        world.y.z = float4(world_yz[i0], world_yz[i1], world_yz[i2], world_yz[i3]);
        world.y.w = float4(world_yw[i0], world_yw[i1], world_yw[i2], world_yw[i3]);

        world.z.x = float4(world_zx[i0], world_zx[i1], world_zx[i2], world_zx[i3]);
        world.z.y = float4(world_zy[i0], world_zy[i1], world_zy[i2], world_zy[i3]);
        world.z.z = float4(world_zz[i0], world_zz[i1], world_zz[i2], world_zz[i3]);
        world.z.w = float4(world_zw[i0], world_zw[i1], world_zw[i2], world_zw[i3]);

        world.w.x = float4(world_tx[i0], world_tx[i1], world_tx[i2], world_tx[i3]);
        world.w.y = float4(world_ty[i0], world_ty[i1], world_ty[i2], world_ty[i3]);
        world.w.z = float4(world_tz[i0], world_tz[i1], world_tz[i2], world_tz[i3]);
        world.w.w = float4(world_tw[i0], world_tw[i1], world_tw[i2], world_tw[i3]);

        // Create the matrix to go from object->world->view->clip space.
        const auto clip = simd_multiply(world, view_proj);

        SIMDVector min_pos;
        SIMDVector max_pos;

        // Load the mininum and maximum corner positions of the bounding box in object space.
        min_pos.x = float4(min_x[i0], min_x[i1], min_x[i2], min_x[i3]);
        min_pos.y = float4(min_y[i0], min_y[i1], min_y[i2], min_y[i3]);
        min_pos.z = float4(min_z[i0], min_z[i1], min_z[i2], min_z[i3]);
        min_pos.w = float4_splat(1.0f);

        max_pos.x = float4(max_x[i0], max_x[i1], max_x[i2], max_x[i3]);
        max_pos.y = float4(max_y[i0], max_y[i1], max_y[i2], max_y[i3]);
        max_pos.z = float4(max_z[i0], max_z[i1], max_z[i2], max_z[i3]);
        max_pos.w = float4_splat(1.0f);

        SIMDVector clip_pos[8];

        // Transform each bounding box corner from object to clip space by sharing calculations.
        simd_min_max_transform(clip, min_pos, max_pos, clip_pos);

        const auto zero = float4_zero();
        const auto all_true = bool4_all_true();

        // Initialize test conditions.
        auto all_x_less = all_true;
        auto all_x_greater = all_true;
        auto all_y_less = all_true;
        auto all_y_greater = all_true;
        auto all_z_less = all_true;
        auto any_z_less = bool4_all_false();
        auto all_z_greater = all_true;

        // Test each corner of the oobb and if any corner intersects the frustum that object
        // is visible.
        for (unsigned cs = 0; cs < 8; ++cs) {
            const auto neg_cs_w = negate(clip_pos[cs].w);

            auto x_le = clip_pos[cs].x <= neg_cs_w;
            auto x_ge = clip_pos[cs].x >= clip_pos[cs].w;
            all_x_less = vector_and(x_le, all_x_less);
            all_x_greater = vector_and(x_ge, all_x_greater);

            auto y_le = clip_pos[cs].y <= neg_cs_w;
            auto y_ge = clip_pos[cs].y >= clip_pos[cs].w;
            all_y_less = vector_and(y_le, all_y_less);
            all_y_greater = vector_and(y_ge, all_y_greater);

            auto z_le = clip_pos[cs].z <= zero;
            auto z_ge = clip_pos[cs].z >= clip_pos[cs].w;
            all_z_less = vector_and(z_le, all_z_less);
            all_z_greater = vector_and(z_ge, all_z_greater);
            any_z_less = vector_or(z_le, any_z_less);
        }

        const auto any_x_outside = vector_or(all_x_less, all_x_greater);
        const auto any_y_outside = vector_or(all_y_less, all_y_greater);
        const auto any_z_outside = vector_or(all_z_less, all_z_greater);
        auto outside = vector_or(any_x_outside, any_y_outside);
        outside = vector_or(outside, any_z_outside);

        auto inside = vector_xor(outside, all_true);

        // Store the result in the `visibility_flag` array in a compacted way.
        store_aligned(inside, &visibility_flag[i]);
    }
}

The function simd_min_max_transforms used above is the function to transform each OOBB corner from object space to clip space by sharing some of the calculations, for completeness the function is:

void simd_min_max_transform(const SIMDMatrix &m, const SIMDVector &min, const SIMDVector &max, SIMDVector result[])
{
    auto m_xx_x = m.x.x * min.x;    m_xx_x = m_xx_x + m.w.x;
    auto m_xy_x = m.x.y * min.x;    m_xy_x = m_xy_x + m.w.y;
    auto m_xz_x = m.x.z * min.x;    m_xz_x = m_xz_x + m.w.z;
    auto m_xw_x = m.x.w * min.x;    m_xw_x = m_xw_x + m.w.w;

    auto m_xx_X = m.x.x * max.x;    m_xx_X = m_xx_X + m.w.x;
    auto m_xy_X = m.x.y * max.x;    m_xy_X = m_xy_X + m.w.y;
    auto m_xz_X = m.x.z * max.x;    m_xz_X = m_xz_X + m.w.z;
    auto m_xw_X = m.x.w * max.x;    m_xw_X = m_xw_X + m.w.w;

    auto m_yx_y = m.y.x * min.y;
    auto m_yy_y = m.y.y * min.y;
    auto m_yz_y = m.y.z * min.y;
    auto m_yw_y = m.y.w * min.y;

    auto m_yx_Y = m.y.x * max.y;
    auto m_yy_Y = m.y.y * max.y;
    auto m_yz_Y = m.y.z * max.y;
    auto m_yw_Y = m.y.w * max.y;

    auto m_zx_z = m.z.x * min.z;
    auto m_zy_z = m.z.y * min.z;
    auto m_zz_z = m.z.z * min.z;
    auto m_zw_z = m.z.w * min.z;

    auto m_zx_Z = m.z.x * max.z;
    auto m_zy_Z = m.z.y * max.z;
    auto m_zz_Z = m.z.z * max.z;
    auto m_zw_Z = m.z.w * max.z;

    {
        auto xyz_x = m_xx_x + m_yx_y;   xyz_x = xyz_x + m_zx_z;
        auto xyz_y = m_xy_x + m_yy_y;   xyz_y = xyz_y + m_zy_z;
        auto xyz_z = m_xz_x + m_yz_y;   xyz_z = xyz_z + m_zz_z;
        auto xyz_w = m_xw_x + m_yw_y;   xyz_w = xyz_w + m_zw_z;
        result[0].x = xyz_x;
        result[0].y = xyz_y;
        result[0].z = xyz_z;
        result[0].w = xyz_w;
    }

    {
        auto Xyz_x = m_xx_X + m_yx_y;   Xyz_x = Xyz_x + m_zx_z;
        auto Xyz_y = m_xy_X + m_yy_y;   Xyz_y = Xyz_y + m_zy_z;
        auto Xyz_z = m_xz_X + m_yz_y;   Xyz_z = Xyz_z + m_zz_z;
        auto Xyz_w = m_xw_X + m_yw_y;   Xyz_w = Xyz_w + m_zw_z;
        result[1].x = Xyz_x;
        result[1].y = Xyz_y;
        result[1].z = Xyz_z;
        result[1].w = Xyz_w;
    }

    {
        auto xYz_x = m_xx_x + m_yx_Y;   xYz_x = xYz_x + m_zx_z;
        auto xYz_y = m_xy_x + m_yy_Y;   xYz_y = xYz_y + m_zy_z;
        auto xYz_z = m_xz_x + m_yz_Y;   xYz_z = xYz_z + m_zz_z;
        auto xYz_w = m_xw_x + m_yw_Y;   xYz_w = xYz_w + m_zw_z;
        result[2].x = xYz_x;
        result[2].y = xYz_y;
        result[2].z = xYz_z;
        result[2].w = xYz_w;
    }

    {
        auto XYz_x = m_xx_X + m_yx_Y;   XYz_x = XYz_x + m_zx_z;
        auto XYz_y = m_xy_X + m_yy_Y;   XYz_y = XYz_y + m_zy_z;
        auto XYz_z = m_xz_X + m_yz_Y;   XYz_z = XYz_z + m_zz_z;
        auto XYz_w = m_xw_X + m_yw_Y;   XYz_w = XYz_w + m_zw_z;
        result[3].x = XYz_x;
        result[3].y = XYz_y;
        result[3].z = XYz_z;
        result[3].w = XYz_w;
    }

    {
        auto xyZ_x = m_xx_x + m_yx_y;   xyZ_x = xyZ_x + m_zx_Z;
        auto xyZ_y = m_xy_x + m_yy_y;   xyZ_y = xyZ_y + m_zy_Z;
        auto xyZ_z = m_xz_x + m_yz_y;   xyZ_z = xyZ_z + m_zz_Z;
        auto xyZ_w = m_xw_x + m_yw_y;   xyZ_w = xyZ_w + m_zw_Z;
        result[4].x = xyZ_x;
        result[4].y = xyZ_y;
        result[4].z = xyZ_z;
        result[4].w = xyZ_w;
    }

    {
        auto XyZ_x = m_xx_X + m_yx_y;   XyZ_x = XyZ_x + m_zx_Z;
        auto XyZ_y = m_xy_X + m_yy_y;   XyZ_y = XyZ_y + m_zy_Z;
        auto XyZ_z = m_xz_X + m_yz_y;   XyZ_z = XyZ_z + m_zz_Z;
        auto XyZ_w = m_xw_X + m_yw_y;   XyZ_w = XyZ_w + m_zw_Z;
        result[5].x = XyZ_x;
        result[5].y = XyZ_y;
        result[5].z = XyZ_z;
        result[5].w = XyZ_w;
    }

    {
        auto xYZ_x = m_xx_x + m_yx_Y;   xYZ_x = xYZ_x + m_zx_Z;
        auto xYZ_y = m_xy_x + m_yy_Y;   xYZ_y = xYZ_y + m_zy_Z;
        auto xYZ_z = m_xz_x + m_yz_Y;   xYZ_z = xYZ_z + m_zz_Z;
        auto xYZ_w = m_xw_x + m_yw_Y;   xYZ_w = xYZ_w + m_zw_Z;
        result[6].x = xYZ_x;
        result[6].y = xYZ_y;
        result[6].z = xYZ_z;
        result[6].w = xYZ_w;
    }

    {
        auto XYZ_x = m_xx_X + m_yx_Y;   XYZ_x = XYZ_x + m_zx_Z;
        auto XYZ_y = m_xy_X + m_yy_Y;   XYZ_y = XYZ_y + m_zy_Z;
        auto XYZ_z = m_xz_X + m_yz_Y;   XYZ_z = XYZ_z + m_zz_Z;
        auto XYZ_w = m_xw_X + m_yw_Y;   XYZ_w = XYZ_w + m_zw_Z;
        result[7].x = XYZ_x;
        result[7].y = XYZ_y;
        result[7].z = XYZ_z;
        result[7].w = XYZ_w;
    }
}

To get a compact indirection array of all the objects that passed the frustum-OOBB culling, the remove_not_visible function needs to be slightly modified:

uint32_t remove_not_visible(const ObjectSet &object_set, uint32_t count, uint32_t *output_indirection, const uint32_t *input_indirection/*new argument*/)
{
    const uint32_t *visibility_flag = object_set.visibility_flag.begin();
    uint32_t n_visible = 0U;
    for (uint32_t i = 0; i < count; ++i) {

        // Each element of `input_indirection` represents an object that has either been culled
        // or not culled. If it's not null then do a lookup to get the actual object index else
        // use `i` directly.
        const uint32_t index = input_indirection? input_indirection[i] : i;

        // `visibility_flag` is already compacted, so use `i` directly.
        if (visibility_flag[i]) {
            output_indirection[n_visible] = i;
            ++n_visible;
        }
    }

    const uint32_t n_aligned_visible = align_to_simd_lane_count(n_visible);
    const uint32_t last_visible = n_visible? output_indirection[n_visible- 1] : 0;

    // Pad out to the simd alignment.
    for (unsigned i = n_visible; i < n_aligned_visible; ++i)
        output_indirection[i] = last_visible;

    return n_visible;
}

Bringing the frustum-sphere and frustum-OOBB code together we get:

{
    // Splat out the planes to be able to do plane-sphere test with SIMD.
    const auto &frustum = camera.frustum();

    const SIMDPlane planes[6] = {
        float4_splat(frustum.planes[0].n.x),
        float4_splat(frustum.planes[0].n.y),
        float4_splat(frustum.planes[0].n.z),
        float4_splat(frustum.planes[0].d),

        float4_splat(frustum.planes[1].n.x),
        float4_splat(frustum.planes[1].n.y),
        float4_splat(frustum.planes[1].n.z),
        float4_splat(frustum.planes[1].d),

        float4_splat(frustum.planes[2].n.x),
        float4_splat(frustum.planes[2].n.y),
        float4_splat(frustum.planes[2].n.z),
        float4_splat(frustum.planes[2].d),

        float4_splat(frustum.planes[3].n.x),
        float4_splat(frustum.planes[3].n.y),
        float4_splat(frustum.planes[3].n.z),
        float4_splat(frustum.planes[3].d),

        float4_splat(frustum.planes[4].n.x),
        float4_splat(frustum.planes[4].n.y),
        float4_splat(frustum.planes[4].n.z),
        float4_splat(frustum.planes[4].d),

        float4_splat(frustum.planes[5].n.x),
        float4_splat(frustum.planes[5].n.y),
        float4_splat(frustum.planes[5].n.z),
        float4_splat(frustum.planes[5].d),
    };

    // Do frustum-sphere culling.
    simd_sphere_culling(planes, object_set);

    // Make sure to align the size to the simd lane count.
    const uint32_t n_aligned_objects = align_to_simd_lane_count(object_set.n_objects);

    // Store the indices of the objects that passed the frustum-sphere culling in the `indirection` array.
    Array<uint32_t> indirection(n_aligned_objects);

    const uint32_t n_visible = remove_not_visible(object_set, object_set.n_objects, indirection.begin(), nullptr);

    const auto &view_proj = camera.view() * camera.proj();

    // Construct the SIMDMatrix `simd_view_proj`.
    const SIMDMatrix simd_view_proj = {
        float4_splat(view_proj.v[xx]),
        float4_splat(view_proj.v[xy]),
        float4_splat(view_proj.v[xz]),
        float4_splat(view_proj.v[xw]),

        float4_splat(view_proj.v[yx]),
        float4_splat(view_proj.v[yy]),
        float4_splat(view_proj.v[yz]),
        float4_splat(view_proj.v[yw]),

        float4_splat(view_proj.v[zx]),
        float4_splat(view_proj.v[zy]),
        float4_splat(view_proj.v[zz]),
        float4_splat(view_proj.v[zw]),

        float4_splat(view_proj.v[tx]),
        float4_splat(view_proj.v[ty]),
        float4_splat(view_proj.v[tz]),
        float4_splat(view_proj.v[tw]),
    };

    // Cull objects via frustum-oobb tests.
    simd_oobb_culling(simd_view_proj, object_set, n_visible, indirection.begin());

    // Build up the indirection array that represents the objects that survived the frustum-oobb culling.
    const uint32_t n_oobb_visible = remove_not_visible(object_set, n_visible, indirection.begin(), indirection.begin());
}

The final call to remove_not_visible populates the indirection array with the objects that passed both the frustum-sphere and the frustum-OOBB culling. indirection together with n_oobb_visible is all that is needed to know what objects should be rendered.

Distributing the work over several threads

In Stingray, work is distributed by submitting jobs to a pool of worker threads -- conveniently called the ThreadPool. Submitted jobs are put in a thread safe work queue from which the worker threads pop jobs to work on. A task is defined as:

typedef void (*TaskCallback)(void *user_data);

struct TaskDefinition
{
    TaskCallback callback;
    void *user_data;
};

For the purpose of this article, the interesting methods of the ThreadPool are:

class ThreadPool
{
    // Adds `count` tasks to the work queue.
    void add_tasks(const TaskDefinition *tasks, uint32_t count);

    // Tries to pop one task from the queue and do that work. Returns true if any work was done.
    bool do_work();

    // Will call `do_work` while `signal` == value.
    void wait_atomic(std::atomic<uint32_t> *signal, uint32_t value);
};

The ThreadPool doesn't dictate how to synchronize when a job is fully processed, but usually a std::atomic<uint32_t> signal is used for that purpose. The value is 0 while the job is being processed and set to 1 when it's done. wait_atomic() is a convenience method that can be used to wait for such values:

void ThreadPool::wait_atomic(std::atomic<uint32_t> *signal, uint32_t value)
{
    while (signal->load(std::memory_order_acquire) == value) {
        if (!do_work())
            YieldProcessor();
    }
}

do_work:

bool ThreadPool::do_work()
{
    TaskDefinition task;
    if (pop_task(task)) {
        task.callback(task.user_data);
        return true;
    }
    return false;
}

Multi-threading the culling only requires a few changes to the code. For the simd_sphere_culling() method we need to add offset and count parameters to specify the range of objects we are processing:

void simd_sphere_culling(const SIMDPlane planes[6], culling::ObjectSet &object_set, uint32_t offset, uint32_t count)
{
    const auto all_true = bool4_all_true();
    const uint32_t n_objects = offset + count;

    uint32_t *visibility_flag = object_set.visibility_flag.begin();

    // Test each plane of the frustum against each sphere.
    for (uint32_t i = offset; i < n_objects; i += 4)
    {
        const auto ws_pos_x = float4_load_aligned(&object_set->ws_pos_x[i]);
        const auto ws_pos_y = float4_load_aligned(&object_set->ws_pos_y[i]);
        const auto ws_pos_z = float4_load_aligned(&object_set->ws_pos_z[i]);
        const auto radius = float4_load_aligned(&object_set->radius[i]);

        auto inside = all_true;
        for (unsigned p = 0; p < 6; ++p) {
            auto &n_x = planes[p].normal_x;
            auto &n_y = planes[p].normal_y;
            auto &n_z = planes[p].normal_z;
            auto n_dot_pos = dot_product(ws_pos_x, ws_pos_y, ws_pos_z, n_x, n_y, n_z);
            auto plane_test_point = n_dot_pos + radius;
            auto plane_test = plane_test_point >= planes[p].d;
            inside = vector_and(plane_test, inside);
        }

        // Store 0 for spheres that didn't intersect or ended up on the positive side of the
        // frustum planes. Store 0xffffffff for spheres that are visible.
        store_aligned(inside, &visibility_flag[i]);
    }
}

Bringing the previous code snippet together with multi-threaded culling:

{
    // Calculate the number of work items based on that each work will process `work_size` elements.
    const uint32_t work_size = 512;

    // `div_ceil(a, b)` calculates `(a + b - 1) / b`.
    const uint32_t n_work_items = math::div_ceil(n_objects, work_size);

    Array<CullingWorkItem> culling_work_items(n_work_items);
    Array<TaskDefinition> tasks(n_work_items);

    // Splat out the planes to be able to do plane-sphere test with SIMD.
    const auto &frustum = camera.frustum();

    const SIMDPlane planes[6] = {
        same code as previously shown...
    };

    // Make sure to align the size to the simd lane count.
    const uint32_t n_aligned_objects = align_to_simd_lane_count(object_set.n_objects);

    for (unsigned i = 0; i < n_work_items; ++i) {

        // The `offset` and `count` for the work item.
        const uint32_t offset = math::min(work_size * i, n_objects);
        const uint32_t count = math::min(work_size, n_objects - offset);

        auto &culling_item = culling_work_items[i];
        memcpy(culling_data.planes, planes, sizeof(planes));
        culling_item.object_set = &object_set;
        culling_item.offset = offset;
        culling_item.count = count;
        culling_item.signal = 0;

        auto &task = tasks[i];
        task.callback = simd_sphere_culling_task;
        task.user_data = &culling_item;
    }

    // Add the tasks to the `ThreadPool`.
    thread_pool.add_tasks(n_work_items, tasks.begin());
    // Wait for each `item` and if it's not done, help out with the culling work.
    for (auto &item : culling_work_items)
        thread_pool.wait_atomic(&item.signal, 0);
}

CullingWorkItem and simd_sphere_culling_task are defined as:

struct CullingWorkItem
{
    SIMDPlane planes[6];
    const culling::ObjectSet *object_set;
    uint32_t offset;
    uint32_t count;
    std::atomic<uint32_t> signal;
};

void simd_sphere_culling_task(void *user_data)
{
    auto culling_item = (CullingWorkItem*)(user_data);

    // Call the frustum-sphere culling function.
    simd_sphere_culling(culling_item->planes, *culling_item->object_set, culling_item->offset, culling_item->count);

    // Signal that the work is done.
    culling_item->store(1, std::memory_order_release);
}

The same pattern is used to multi-thread the frustum-OOBB culling. That is "left as an exercise for the reader" ;)

Conclusion

This type of culling is done for all of the objects that can be rendered, i.e. meshes, particle systems, terrain, etc. We also use it to cull light sources. It is used both when rendering the main scene and for rendering shadows.

I've left out a few details of our solution. One thing we also do is something called contribution culling. In the frustum-OOBB culling step, the extents of the OOBB corners are projected to the near plane and from that the screen space extents are derived. If the object is smaller than a certain threshold in any axis the object is considered as culled. Special care needs to be considered if any of the corners intersect or is behind the near plane so we don't have to deal with "external line segments" caused by the projection. If you don't know what that is see: http://www.gamasutra.com/view/news/168577/Indepth_Software_rasterizer_and_triangle_clipping.php. In our case the contribution culling is disabled by expanding the extents to span the entire screen when any corner intersects or is behind the near plane.

For our cascaded shadow maps, the extents are also used to detect if an object is fully enclosed by a cascade. If that is the case, then that object is culled from the later cascades. Let me illustrate with some ASCII:

+-----------+-----------+
|           |           |
|     /\    |           |
|    /--\   |           |
+-----------+-----------+
|           |           |
|           |           |
|           |           |
+-----------+-----------+

The squares are the different cascades. The top left square is the first cascades, the top right is the second cascade, bottom left the third and the bottom right is the fourth cascade. In this case the weird triangle shaped object is fully enclosed by the first cascade. What that means is that the object doesn't need to be rendered to any of the later cascades, since the shadow contribution from that object will be fully taken care of from the first cascade.

State reflection

2016-09-07T15:38:00.000+02:00

Overview

The Stingray engine has two controller threads -- the main thread and the render thread. These two threads build up work for our job system, which is distributed on the remaining threads. The main thread and the render thread are pipelined, so that while the main thread runs the simulation/update for frame N, the render thread is processing the rendering work for the previous frame (N-1). This post will dive into the details how state is propagated from the main thread to the render thread.

I will use code snippets to explain how the state reflection works. It's mostly actual code from the engine but it has been cleaned up to a certain extent. Some stuff has been renamed and/or removed to make it easier to understand what's going on.

The main loop

Here is a slimmed down version of the update loop which is part of the main thread:

while (!quit())
{
    // Calls out to the mandatory user supplied `update` Lua function, Lua is used 
    // as a scripting language to manipulate objects. From Lua worlds, objects etc
    // can be created, manipulated, destroyed, etc. All these changes are recorded
    // on a `StateStream` that is a part of each world.
    _game->update();

    // Flush state changes recorded on the `StateStream` for each world to
    // the rendering world representation.
    unsigned n_worlds = _worlds.size();
    for (uint32_t i = 0; i < n_worlds; ++i) {
        auto &world = *_worlds[i];
        _render_interface->update_world(world);
    }

    // Begin a new render frame.
    _render_interface->begin_frame();

    // Calls out to the user supplied `render` Lua function. It's up to the script
    // to call render on worlds(). The script controls what camera and viewport
    // are used when rendering the world.
    _game->render();

    // Present the frame.
    _render_interface->present_frame();

    // End frame.
    _render_interface->end_frame(_delta_time);

    // Never let the main thread run more than 1 frame a head of the render thread.
    _render_interface->wait_for_fence(_frame_fence);

    // Create a new fence for the next frame.
    _frame_fence = _render_interface->create_fence();
}

First thing to point out is the _render_interface. This is not a class full of virtual functions that some other class can inherit from and override as the name might suggest. The word "interface" is used in the sense that it's used to communicate from one thread to another. So in this context the _render_interface is used to post messages from the main thread to the render thread.

As said in the first comment in the code snippet above, Lua is used as our scripting language and from Lua things such as worlds, objects, etc can be created, destroyed, manipulated, etc.

The state between the main thread and the render thread is very rarely shared, instead each thread has its own representation and when state is changed on the main thread that state is reflected over to the render thread. E.g., the MeshObject, which is the representation of a mesh with vertex buffers, materials, textures, shaders, skinning, data etc to be rendered, is the main thread representation and RenderMeshObject is the corresponding render thread representation. All objects that have a representation on both the main and render thread are setup to work the same way:

class MeshObject : public RenderStateObject
{
};

class RenderMeshObject : public RenderObject
{
};

The corresponding render thread class is prefixed with Render. We use this naming convention for all objects that have both a main and a render thread representation.

The main thread objects inherit from RenderStateObject and the render thread objects inherit from RenderObject. These structs are defined as:

struct RenderStateObject
{
    uint32_t render_handle;
    StateReflection *state_reflection;
};

struct RenderObject
{
    uint32_t type;
};

The render_handle is an ID that identifies the corresponding object on the render thread. state_reflection is a stream of data that is used to propagate state changes from the main thread to the render thread. type is an enum used to identify the type of render objects.

Object creation

In Stingray a world is a container of renderable objects, physical objects, sounds, etc. On the main thread, it is represented by the World class, and on the render thread by a RenderWorld.

When a MeshObject is created in a world on the main thread, there's an explicit call to WorldRenderInterface::create() to create the corresponding render thread representation:

MeshObject *mesh_object = MAKE_NEW(_allocator, MeshObject);
_world_render_interface.create(mesh_object);

The purpose of the call to WorldRenderInterface::create is to explicitly create the render thread representation, acquire a render_handle and to post that to the render thread:

void WorldRenderInterface::create(MeshObject *mesh_object)
{
    // Get a unique render handle.
    mesh_object->render_handle = new_render_handle();

    // Set the state_reflection pointer, more about this later.
    mesh_object->state_reflection = &_state_reflection;

    // Create the render thread representation.
    RenderMeshObject *render_mesh_object = MAKE_NEW(_allocator, RenderMeshObject);

    // Pass the data to the render thread
    create_object(mesh_object->render_handle, RenderMeshObject::TYPE, render_mesh_object);
}

The new_render_handle function speaks for itself.

uint32_t WorldRenderInterface::new_render_handle()
{
    if (_free_render_handles.any()) {
        uint32_t handle = _free_render_handles.back();
        _free_render_handles.pop_back();
        return handle;
    } else
        return _render_handle++;
}

There is a recycling mechanism for the render handles and a similar pattern reoccurs at several places in the engine. The release_render_handle function together with the new_render_handle function should give the complete picture of how it works.

void WorlRenderInterface::release_render_handle(uint32_t handle)
{
    _free_render_handles.push_back(handle);
}

There is one WorldRenderInterface per world which contains the _state_reflection that is used by the world and all of its objects to communicate with the render thread. The StateReflection in its simplest form is defined as:

struct StateReflection
{
    StateStream *state_stream;
};

The create_object function needs a bit more explanation though:

void WorldRenderInterface::create_object(uint32_t render_handle, RenderObject::Type type, void *user_data)
{
    // Allocate a message on the `state_stream`.
    ObjectManagementPackage *omp;
    alloc_message(_state_reflection.state_stream, WorldRenderInterface::CREATE, &omp);

    omp->object_type = RenderWorld::TYPE;
    omp->render_handle = render_handle;
    omp->type = type;
    omp->user_data = user_data;
}

What happens here is that alloc_message will allocate enough bytes to make room for a MessageHeader together with the size of ObjectManagementPackage in a buffer owned by the StateStream. The StateStream is defined as:

struct StateStream
{
    void *buffer;
    uint32_t capacity;
    uint32_t size;
};

capacity is the size of the memory pointed to by buffer, size is the current amount of bytes allocated from buffer.

The MessageHeader is defined as:

struct MessageHeader
{
    uint32_t type;
    uint32_t size;
    uint32_t data_offset;
};

The alloc_message function will first place the MessageHeader and then comes the data, some ASCII to the rescue:

+-------------------------------------------------------------------+
| MessageHeader | data                                              |
+-------------------------------------------------------------------+
<- data_offset ->
<-                          size                                   ->

The size and data_offset mentioned in the ASCII are two of the members of MessageHeader, these are assigned during the alloc_message call:

template<Class T>
void alloc_message(StateStream *state_stream, uint32_t type, T **data)
{
    uint32_t data_size = sizeof(T);

    uint32_t message_size = sizeof(MessageHeader) + data_size;

    // Allocate message and fill in the header.
    void *buffer = allocate(state_stream, message_size, alignof(MessageHeader));
    auto header = (MessageHeader*)buffer;

    header->type = type;
    header->size = message_size;
    header->data_offset = sizeof(MessageHeader);

    *data = memory_utilities::pointer_add(buffer, header->data_offset);
}

The buffer member of the StateStream will contain several consecutive chunks of message headers and data blocks.

+-----------------------------------------------------------------------+
| Header | data | Header | data | Header | data | Header | data | etc   |
+-----------------------------------------------------------------------+

This is the necessary code on the main thread to create an object and populate the StateStream which will later on be consumed by the render thread. A very similar pattern is used when changing the state of an object on the main thread, e.g:

void MeshObject::set_flags(renderable::Flags flags)
{
    _flags = flags;

    // Allocate a message on the `state_stream`.
    SetVisibilityPackage *svp;
    alloc_message(state_reflection->state_stream, MeshObject::SET_VISIBILITY, &svp);

    // Fill in message information.
    svp->object_type = RenderMeshObject::TYPE;

    // The render handle that got assigned in `WorldRenderInterface::create`
    // to be able to associate the main thread object with its render thread 
    // representation.
    svp->handle = render_handle;

    // The new flags value.
    svp->flags = _flags;
}

Getting the recorded state to the render thread

Let's take a step back and explain what happens in the main update loop during the following code excerpt:

// Flush state changes recorded on the `StateStream` for each world to
// the rendering world representation.
unsigned n_worlds = _worlds.size();
for (uint32_t i = 0; i < n_worlds; ++i) {
    auto &world = *_worlds[i];
    _render_interface->update_world(world);
}

When Lua has been creating, destroying, manipulating, etc objects during update() and is done, each world's StateStream which contains all the recorded changes is ready to be sent over to the render thread for consumption. The call to RenderInterface::update_world() will do just that, it roughly looks like:

void RenderInterface::update_world(World &world)
{
    UpdateWorldMsg uw;

    // Get the render thread representation of the `world`.
    uw.render_world = render_world_representation(world);

    // The world's current `state_stream` that contains all changes made 
    // on the main thread.
    uw.state_stream = world->_world_reflection_interface.state_stream;

    // Create and assign a new `state_stream` to the world's `_world_reflection_interface`
    // that will be used for the next frame.
    world->_world_reflection_interface->state_stream = new_state_stream();

    // Post a message to the render thread to update the world.
    post_message(UPDATE_WORLD, &uw);
}

This function will create a new message and post it to the render thread. The world being flushed and its StateStream are stored in the message and a new StateStream is created that will be used for the next frame. This new StateStream is set on the WorldRenderInterface of the World, and since all objects being created got a pointer to the same WorldRenderInterface they will use the newly created StateStream when storing state changes for the next frame.

Render thread

The render thread is spinning in a message loop:

void RenderInterface::render_thread_entry()
{
    while (!_quit) {
        // If there's no message -- put the thread to sleep until there's
        // a new message to consume.
        RenderMessage *message = get_message();

        void *data = data(message);
        switch (message->type) {
            case UPDATE_WORLD:
                internal_update_world((UpdateWorldMsg*)(data));
                break;

            // ... And a lot more case statements to handle different messages. There
            // are other threads than the main thread that also communicate with the
            // render thread. E.g., the resource loading happens on its own thread
            // and will post messages to the render thread.
        }
    }
}

The internal_update_world() function is defined as:

void RenderInterface::internal_update_world(UpdateWorldMsg *uw)
{
    // Call update on the `render_world` with the `state_stream` as argument.
    uw->render_world->update(uw->state_stream);

    // Release and recycle the `state_stream`.
    release_state_stream(uw->state_stream);
}

It calls update() on the RenderWorld with the StateStream and when that is done the StateStream is released to a pool.

void RenderWorld::update(StateStream *state_stream)
{
    MessageHeader *message_header;
    StatePackageHeader *package_header;

    // Consume a message and get the `message_header` and `package_header`.
    while (get_message(state_stream, &message_header, (void**)&package_header)) {
        switch (package_header->object_type) {
            case RenderWorld::TYPE:
            {
                auto omp = (WorldRenderInterface::ObjectManagementPackage*)package_header;
                // The call to `WorldRenderInterface::create` created this message.
                if (message_header->type == WorldRenderInterface::CREATE)
                    create_object(omp);
            }
            case (RenderMeshObject::TYPE)
            {
                if (message_header->type == MeshObject::SET_VISIBILITY) {
                    auto svp = (MeshObject::SetVisibilityPackage*>)package_header;

                    // The `render_handle` is used to do a lookup in `_objects_lut` to
                    // to get the `object_index`.
                    uint32_t object_index = _object_lut[package_header->render_handle];

                    // Get the `render_object`.
                    void *render_object = _objects[object_index];

                    // Cast it since the type is already given from the `object_type`
                    // in the `package_header`.
                    auto rmo = (RenderMeshObject*)render_object;

                    // Call update on the `RenderMeshObject`.
                    rmo->update(message_header->type, package_header);
                }
            }
            // ... And a lot more case statements to handle different kind of messages.
        }
    }
}

The above is mostly infrastructure to extract messages from the StateStream. It can be a bit involved since a lot of stuff is written out explicitly but the basic idea is hopefully simple and easy to understand.

On to the create_object call done when (message_header->type == WorldRenderInterface::CREATE) is satisfied:

void RenderWorld::create_object(WorldRenderInterface::ObjectManagementPackage *omp)
{
    // Acquire an `object_index`.
    uint32_t object_index = _objects.size();

    // Same recycling mechanism as seen for render handles.
    if (_free_object_indices.any()) {
        object_index = _free_object_indices.back();
        _free_object_indices.pop_back();
    } else {
        _objects.resize(object_index + 1);
        _object_types.resize(object_index + 1);
    }

    void *render_object = omp->user_data;
    if (omp->type == RenderMeshObject::TYPE) {
        // Cast the `render_object` to a `MeshObject`.
        RenderMeshObject *rmo = (RenderMeshObject*)render_object;

        // If needed, do more stuff with `rmo`.
    }

    // Store the `render_object` and `type`.
    _objects[object_index] = render_object;
    _object_types[object_index] = omp->type;

    if (omp->render_handle >= _object_lut.size())
        _object_lut.resize(omp->handle + 1);
    // The `render_handle` is used
    _object_lut[omp->render_handle] = object_index;
}

So the take away from the code above lies in the general usage of the render_handle and the object_index. The render_handle of objects are used to do a look up in _object_lut to get the object_index and type. Let's look at an example, the same RenderWorld::update code presented earlier but this time the focus is when the message is MeshObject::SET_VISIBILITY:

void RenderWorld::update(StateStream *state_stream)
{
    StateStream::MessageHeader *message_header;
    StatePackageHeader *package_header;

    while (get_message(state_stream, &message_header, (void**)&package_header)) {
        switch (package_header->object_type) {
            case (RenderMeshObject::TYPE)
            {
                if (message_header->type == MeshObject::SET_VISIBILITY) {
                    auto svp = (MeshObject::SetVisibilityPackage*>)package_header;

                    // The `render_handle` is used to do a lookup in `_objects_lut` to
                    // to get the `object_index`.
                    uint32_t object_index = _object_lut[package_header->render_handle];

                    // Get the `render_object` from the `object_index`.
                    void *render_object = _objects[object_index];

                    // Cast it since the type is already given from the `object_type`
                    // in the `package_header`.
                    auto rmo = (RenderMeshObject*)render_object;

                    // Call update on the `RenderMeshObject`.
                    rmo->update(message_header->type, svp);
                }
            }
        }
    }
}

The state reflection pattern shown in this post is a fundamental part of the engine. Similar patterns appear in other places as well and having a good understanding of this pattern makes it much easier to understand the internals of the engine.

A New Localization System for Stingray

2016-09-06T13:23:00.000+02:00

The current Stingray localization system is based around the concept of properties. A property is any period separated part of the file name before the extension. Consider the following three files:

trees/larch_03.unit
trees/larch_03.fr.unit
trees/larch_03.ps4.unit

These three files all have the same type (.unit), and the same name (trees/larch_03), but their properties differ. The first one has no properties set. The second one has the property .fr and the last one has the property .ps4. (Note that resources can have more than one property.)

Properties are resolved in slightly different ways, depending on the kind of property. Platform properties are resolved at compile time, so if you compile for PS4, you will get the PS4 version of the resource (or the default version if there is no .ps4 specific version).

Other properties are resolved at resource load time. When you load a bunch of resources, which property variant is loaded depends on a global property preference order set from the script. A property preference order of ['.fr', '.es'] means that resources with the property .fr are be preferred, then resources with the property .es (if no .fr resource is available), and finally a resource without any properties at all.

This single mechanism is used for localizing strings, sounds, textures, etc. Strings, for example, are stored in .strings files, which are essentially just key-value stores:

file = "File"
open = "Open"
...

To create a French localized of this menu.strings resource, you just create a menu.fr.strings resource and fill it with:

file = "Fichier"
open = "Ouvert"
...

This basic localization system has served us well for many years, but it has some drawbacks that are starting to become more pronounced:

It doesn't allow file names with periods in them. Since we always interpret periods as properties, periods can't be a part of the regular file name. This isn't a huge problem when users name their own files, but as we are increasing the interoperability between Stingray and other software packages we more and more run into software that has, let's say peculiar, ways of naming its files. Renaming things by hand is cumbersome and can also break things when files cross-reference each other.
Switching language requires reloading the resource packages. This seems overly complicated. We have more memory these days than when we started building Stingray. In many cases, especially for strings, it makes more sense to keep them in memory all the time, so we can switch between them easily.
Just switching on platform isn't enough. Mobile devices range from very low-end to at least mid-end. Rather than having .ios and .android properties, we might want .low-quality and .high-quality and select which one to use based on the actual capabilities of the hardware.
Making editors work well with the property system has been surprisingly complicated. For example, when the editor runs on Windows, what should it show if there is a .win32 specialization of a resource -- the default version or the .win32 one? How would you edit a .ps4 resource when those are normally stripped out of the Windows runtime?

We used to have this wonky think where you could sort of cross-compile the resources and say that "I want to run on Windows, but as if I was running on PS4. But to be honest, that system never really worked that well and in the new editor we have gotten rid of it.

Interestingly, out of all these problems, it is the first one -- the most stupid one -- that is the main impetus for change.

The New System

The new system has several parts. First, we decided that for systems that deal with localization a lot, such as strings and sounds it makes sense to have the system actually be aware of localization. That way, we can provide the best possible experience.

So the .strings format has changed to:

file = {en = "File", fr = "Fichier", ...}
open = {en = "Open", fr = "Ouvert", ...}
...

All the languages are stored in the same file and to switch language you just call Localizer.set_language("fr"). We keep all the different languages in memory at all times. Even for a game with ridiculous amounts of text this still doesn't use much memory and it means we can hot-swap languages instantly.

This is a nice approach, but it doesn't work for all resources. We don't want to add this deep kind of integration to resources that are normally not localized, such as .unit and .texture. Still, there sometimes is a need to localize such resources. For example, a .texture might have text in it that needs to be localized. We may need a low-poly version of a .unit for a less capable platform. Or a less gory version of an animation for countries with stricter age ratings.

To make things easier for the editor we decided to ditch the property system all together, and instead go for a substitution strategy. There are no special magical parts of a resource's path -- it is just a name and a type. But if you want to, you can say to the engine that all instances of a certain resource should be replaced with another resource:

trees/larch_03.unit → trees/larch_03_ps4.unit

Note here that there is nothing special or magical about the trees/larch_03_ps4.unit. There is no problem with displaying it on Windows. You just edit it in the editor, like any other unit. However, when you play the game -- any time a trees/larch_03.unit is requested by the engine, a trees/larch_03_ps4.unit is substituted. So if you have authored a level full of larch_03 units, when the override above is in place, you will instead see larch_03_ps4 units.

There are many ways for this scheme to go wrong. The gameplay script might expect to find a certain node branch_43 in the unit -- a node that exists in larch_03.unit, but not in larch_03_ps4.unit and this may lead to unexpected behavior. The same problem existed in the old property system. We don't try to do anything special about this, because it is impossible. In the end, it is only the gameplay script that can know what it means for two things to be similar enough to be used interchangeably. Anyone working with localized resources just has to be careful not to break things.

Overrides can be specified from the Lua script:

Application.set_resource_override("unit", "trees/larch_03", "trees/larch_03_ps4");

Note that this is a much more powerful system than the old property system. Any resource can be set to override any other -- we are not restricted to work within the strict naming scheme required by the property system. Also, the override is dynamic and can be determined at runtime. So it can be based on dynamic properties, such as measured CPU or GPU performance -- or a user setting for the amount of gore they are comfortable with.

It can even be used for completely different things than localization or platform specific resources -- such as replacing the units in a level for a night-time or psychedelic version of the same level. And I'm sure our users will find many other ways of (ab)using this mechanism.

But this dynamic system is not quite enough to do everything we want to do.

First, since the override is dynamic and only happens at runtime, our packaging system can't be aware of it. Normally, our packaging system figures out all resource dependencies automatically. So when you say that you want a package with the forest level, the packaging system will automatically pull in the larch_03 unit that is used in that level, any textures used by that unit, etc. But since the packaging system can't know that at runtime you will replace larch_03 with larch_03_ps4, it doesn't know that larch_03_ps4 and its dependencies should go into the package as well.

You could add larch_03_ps4 to the package manually, since you know it will be used. That might work if you only have one or two overrides. However, even with a very small amount of overrides micromanaging packages in this way becomes incredibly tedious and error prone.

Second, we don't want to burden the packages with resources that will never be used. If we are making a game for digital distribution on iOS or Android we don't want to include large PS4-only resources in that game.

So we need a static override mechanism that is known by the package manager to make sure it includes and excludes the right resources. The simplest thing would be a big file that just listed all the overrides. For example, to override larch_03 on PS4 we would write something like:

resource_overrides = [
  {
    type = "unit"
    name = "trees/larch_03"
    override = "trees/larch_03_ps4"
    platforms = ["ps4"]
  }
]

This would work, but could again get pretty tedious if there are a lot of overrides. It would be nice with something that was a bit more automatic.

Since our users are already used to using name suffixes such as .fr and .ps4 for localization, we decided to build on the same mechanism -- creating overrides automatically based on suffix rules:

resource_overrides = [
  {suffix = "_ps4", platforms = ["ps4"]}
]

This rule says that when we are compiling for the platform PS4, if we find a resource that has the same name as another resource, but with the added suffix _ps4, that resource will automatically be registered as an override for that resource:

trees/larch_03.unit → trees/larch_03_ps4.unit
leaves/larch_leaves.texture → leaves/larch_leaves_ps4.unit

In addition to platform settings, the system also generalizes to support other flags:

resource_overrides = [
  {suffix = "_fr", flags = ["fr"]}
  {suffix = "_4k", flags = ["4K"]}
  {suffix = "_noblood", flags = ["noblood", "PG-13"]}
]

This defines the _fr suffix for French localization. A 4K suffix _4k for high-quality versions of resources suitable for 4K monitors. And a _noblood suffix that selects resources without blood and gore.

The flags can be set at compile time with:

--compile --resource-flag-true 4K

This means that we are compiling a 4K version of the game, so when bundling only the 4K resources will be included and the other versions will be stripped out. Just as if we were compiling for a specific platform.

But we can also choose to resolve the flags at runtime:

--compile --resource-flag-runtime noblood

With this setting, both the regular resource and the _noblood resource will be included in the package and loaded into memory. And we can hot swap between them with:

Application.set_resource_flag("noblood", true)

I have not decided yet whether in addition to these two alternatives we should also have an option that resolves at package load time. I.e., both variants of the resource would be included on disk, but only one of them would be loaded into memory and if you wanted to switch resource you would have to unload the package and load it back into memory again.

I can see some use cases for this, but on the other hand adding more options complicates the system and I like to keep things as simple as possible.

A nice thing about this suffix mapping is that it can be configured to be backwards compatible with the old property system:

resource_overrides = [
  {suffix = ".fr", flags = ["fr"]}
  {suffix = ".ps4", platforms = ["ps4"]}
  {suffix = ".xb1", platforms = ["xb1"]}
]

Whenever we change something in Stingray we try to make it more flexible and data-driven, while at the same time ensuring that the most common cases are still easy to work with. This rewrite of the localization is a good example:

It fixes the problem with periods in file names. Periods are now only an issue if you have made an explicit suffix mapping that matches them.
We can switch language (or any other resource setting) at runtime.
The new system is more flexible -- it doesn't just handle localization and platform specific resources, we can set up whatever resource categories we want. And we can even dynamically override individual resources.
The editor no longer needs to do anything special to deal with the concept of "properties". Resources that are used to override other resources can be edited in the editor just like any other resource.
And the system can easily be configured to be backwards compatible with the old localization system.

I still feel slightly queasy about using name matching to drive parts of this system. Name matching is a practice that can go horribly wrong. But in this case, since the name matching is completely user controlled I think it makes a good compromise between purity and usability.

Render Config Extensions

2016-08-16T12:26:00.000+02:00

Untitled Document.md

The rendering pipe in Stingray is completely data-driven, meaning that everything from which GPU buffers (render targets etc) that are needed to compose the final rendered frame to the actual flow of the frames is described in the render_config file - a human readable json file. I have covered this in various presentations [1,2] over the years so I won’t be going into more details about it in this blog post, instead I’d like to focus on a new feature that we are rolling out in Stingray v1.5 - Render Config Extensions.

As Stingray is growing to cater to more industries than game development we see lots of feature requests that don’t necessarily fit in with our ideas of what should go into the default rendering pipe that we ship with Stingray. This has made it apparent that we need a way of doing deep integrations of new rendering features without having to duplicate the entire render_config file.

This is where the render_config_extension files comes into play. A render_config_extension is very similar to the main render_config except that instead of having to describe the entire rendering pipe it appends and inserts different json blocks into the main render_config.

When the engine starts the boot ini-file specifies what render_config to use as well as an array of render_config_extensions to load when setting up the renderer.

render_config = "core/stingray_renderer/renderer"
render_config_extensions = ["clouds-resources/clouds", "prism/prism"]

The array describes the initialization order of the extensions which makes it possible for the project author to control how the different extensions stacks on top of each other. It also makes it possible to build extensions that depends on other extensions.

A render_config_extension consists of two root blocks: append and insert_at:

append

The append block is used for everything that is order independent and allows you to append data to the following root blocks of the main render_config:

shader_libraries – lists additional shader_libraries to load
render_settings – add more render_settings (quality settings, debug flags, etc.)
shader_pass_flags – add more shader_pass_flags (used by shader system to dynamically turn on/off passes)
global_resources – additional global GPU resources to allocate on boot
resource_generators – expose new resource_generators
viewports – expose new viewport templates
lookup_tables – append to the list of resource_generators to execute when booting the renderer (mainly used for generating lookup tables)

One thing to note about extending these blocks is that we currently do not do any kind of name collision checking, so using a prefix to mimic a namespace for your extension is probably a good idea.

// example append block from JPs volumetric clouds plugin
append = {
  render_settings = {
    clouds_enabled = true
    clouds_raw_data_visualization = false
    clouds_weather_data_visualization = false
  }

  shader_libraries = [
    "clouds-resources/clouds"       
  ]

  global_resources = [
    // Clouds modelling resources:
    { name="clouds_result_texture1" type="render_target" image_type="image_3d" width=256 height=256 layers=256 format="R8G8B8A8" }
    { name="clouds_result_texture2" type="render_target" image_type="image_3d" width=64 height=64 layers=64 format="R8G8B8A8" }
    { name="clouds_result_texture3" type="render_target" image_type="image_2d" width=128 height=128 format="R8G8B8A8" }
    { name="clouds_weather_texture" type="render_target" image_type="image_2d" width=256 height=256 format="R8G8B8A8" }
  ]
}

insert_at

The insert_at block allows you to insert layers and modifiers into already existing layer_configurations and resource_generators, either belonging to the main render_config file or a render_config_extension listed earlier in the render_config_extensions array of engine boot ini-file.

// example insert_at block from JPs volumetric clouds plugin
insert_at = {
  post_processing_development = {
    modifiers = [
      { type="dynamic_branch" render_settings={ clouds_weather_data_visualization=true }
        pass = [
          { type="fullscreen_pass" shader="debug_weather" input=["clouds_weather_texture"] output=["output_target"]  }
        ]
      }
    ]
  }

  skydome = {
    layers = [
      { resource_generator="clouds_modifier" profiling_scope="clouds" }
    ]
  }
}

The object names under the insert_at block refers to extension_insertion_points listed in the main render_config file or one of the previously loaded render_config_extension files. We’ve chosen not to allow extensions to inject anywhere they like (using line numbers or similar crazyness), instead we expose a bunch of extension “hooks” at various places in the main render_config file. By doing this we hope to have a somewhat better chance of not breaking existing extensions as we continue to develop and potentially do bigger refactorings of the default render_config file.

Future work

This extension mechanism is somewhat of an experiment and we might need to rethink parts of it in a later version of Stingray. We’ve briefly discussed a potential need for dealing with versioning, i.e. allowing extensions to explicitly list what versions of Stingray they are compatible with (and maybe also allow extensions to have deviating implementations depending on version). Some kind of enforced name spacing and more aggressive validation to avoid name collisions have also been debated.

In the end we decided to ignore these potential problems for now and instead push for getting a first version out in 1.5 to unblock plugin developers and internal teams wanting to do efficient “deep” integrations of various rendering features. Hopefully we won’t regret this decision too much later on. ;)

References

[1] Flexible Rendering for Multiple Platforms (Tobias Persson, GDC 2012)
[2] Benefits of data-driven renderer (Tobias Persson, GDC 2011)

Volumetric Clouds

2016-07-31T13:19:00.000+02:00

There has been a lot of progress made recently with volumetric clouds in games. The folks from Reset have posted a great article regarding their custom dynamic clouds solution, Egor Yusov published Real-time Rendering of Physics-Based Clouds using Precomputed Scattering in GPU Pro 6, last year Andrew Schneider presented Real-time Volumetric Cloudscapes of Horizon: Zero Dawn, and just last week Sébastien Hillaire presented Physically Based Sky, Atmosphere and Cloud Rendering in Frostbite. Inspired by all this latest progress we decided to implement a Stingray plugin to get a feel for the challenge that is real time clouds rendering.

Note: This article isn't an introduction to volumetric cloud rendering but more of a small log of the development process of the plugin. Also, you can try it out for yourself or look at the code by downloading the Stingray plugin. Feel free to contribute!

Modeling

The modeling of our clouds is heavily inspired by the Real-time Volumetric Rendering Course Notes and Real-time Volumetric Cloudscapes of Horizon: Zero Dawn. It uses a set of 3d and 2d noises that are modulated by a coverage and altitude term to generate the 3d volume to be rendered.

I was really impressed at the shapes that can be created from such simple building blocks. While you can definitely see cases where some tiling occurs, it’s not as bad as you would imagine. Once the textures are generated the tough part is to find the right sampling spaces and scales at which they should be sampled in the atmosphere. It's difficult to get a good balance between tiling artifacts vs getting enough high frequency details for the clouds. On top of that cache hits are greatly affected by the sampling scale used so it's another factor to consider.

Finding good sampling scales for all of these textures and choosing by how much the extrusion texture should affect the low frequency clouds is very time consuming. With some time you eventually build intuition for what will look good in most scenarios but it’s definitely a difficult part of the process.

We also generate some curl noise which is used to perturb and animate the clouds slightly. I've found that adding noise to the sampling position also reduces linear filtering artifacts that can arise when ray marching these low resolution 3d textures.

One thing that often bothered me is the oddly shaped cumulus clouds that can arise from tilled 3d noise. Those cases are particularly noticeable for distant clouds. Adding extra cloud coverage for lower altitude sampling positions minimizes this artifact.

Raymarching the volume at full resolution is too expensive even for high end graphics cards. So as suggested by Real-time Volumetric Cloudscapes of Horizon: Zero Dawn we reconstruct a full frame over 16 frames. I've found that to retain enough high frequency details of the clouds, we need a fairly high number of samples. We are currently using 256 steps when raymarching. We offset the starting position of the ray by a 4x4 Bayer matrix pattern to reduce banding artifacts that might appear due to undersampling. Mikkel Gjoel shared some great tips for banding reduction while presenting The Rendering Of Inside and encouraged the use of blue noise to remove banding patterns. While this gives better results there is a nice advantage of using a 4x4 pattern here: since we are rendering interleaved pixels it means that when rendering one frame we are rendering all pixels with the same Bayer offset. This yields a significant improvement in cache coherency compared to using a random noise offset per pixel. We also use an animated offset which allows us to gather a few extra samples through time. We use a 1d Halton sequence of 8 values and instead of using 100% of the 16ᵗʰ frame we use something like 75% to absorb the Halton samples.

To re-project the cloud volume we try to find a good approximation of the cloud's world position. While raymarching we track a weighted sum of the absorption position and generate a motion vector from it.

This allows us to reproject clouds with some degree of accuracy. Since we build one full resolution frame every 16ᵗʰ frame it’s important to track the samples as precisely as possible. This is especially true when the clouds are animated. Finding the right number of temporal samples you want to integrate over time is a compromise between getting a smoother signal for trackable pixels vs having a more noisy signal for invalidated pixels.

Lighting

To light the volume we use the "Beer-Powder" term described by Real-time Volumetric Cloudscapes of Horizon: Zero Dawn. It's a nice model since it simulates some of the out-scattering that occurs at the edges of the clouds. We discovered early on that it was going to be difficult to find terms that looked good for both close and distant clouds. So (for now anyways) a lot of the scattering and extinction coefficients are view dependent. This proved to be a useful way of building intuition for how each term affects the lighting of the clouds.

We also added the ambient term described by the Real-time Volumetric Rendering Course Notes which is very useful to add detail where all light is absorbed by the volume.

The ambient function described takes three parameters: sampling altitude, bottom color and top color. Instead of using constant values, we calculate these values by sampling the atmosphere at a few key locations. This means our ambient term is dynamic and will reflect the current state of the atmosphere. We use two pairs of samples perpendicular to the sun vector and average them to get the bottom and top ambient colors respectively.

Since we already calculated an approximate absorption position for the reprojection, we use this position to change the absorption color based on the absorption altitude.

Finally, we can reduce the alpha term by a constant amount to skew the absorption color towards the overlayed atmospheric color. By default this is disabled but it can be interesting to create some very hazy skyscapes. If this hack is used, it's important to protect the scattering highlight colors somewhat.

Animation

The animation of the clouds consists of a 2d wind vector, a vertical draft amount and a weather system.

We dynamically calculate a 512x512 weather map which consists of 5 octaves of animated Perlin noise. We remap the noise value differently for each rgb component. This weather map is then sampled during the raymarch to update the coverage, cloud type and wetness terms of the current cloud sample. Right now we resample this weather term for each ray step but a possible optimization would be to sample the weather data and the start and end of the ray positions and interpolate these values at each step. All of the weather terms come in sunny/stormy pairs so that we can lerp them based in a probability of rain percentage. This allows the weather system to have storms coming in and out.

The wetness term is used to update a structure of terms which defines how the clouds look based on how much humidity they carry. This is a very expensive lerp which happens per ray march and should be reduced to the bare minimum (the raymarch is instruction bound so each removed lerp is a big win optimization wise). But for the current exploratory phase it’s proving useful to be able to tweak a lot of these terms individually.

Future work

I think that as hardware gets more powerful realtime cloudscape solutions will be used more and more. There is tons of work left to do in this area. It is absolutely fascinating, challenging and beautiful. I am personally interested in improving the sense of scale the rendered clouds can have. To do so, I feel that the key is to reveal more and more of the high frequency details that shape the clouds. I think smaller cloud features are key to put in perspective the larger cloud features around them. But extracting higher frequency details usually comes at the cost of increasing the sampling rate.

We also need to think of how to handle shadows and reflections. We've done some quick tests by updating a 512x512 opacity shadow map which seemed to work ok. Since it is not a view frustum dependent term we can absorb the cost of updating the map over a much longer period of time than 16 frames. Also, we could generate this map by taking fewer samples in a coarser representation of the clouds. The same approach would work for generating a global specular cubemap.

I hope we continue to see more awesome presentations at GDC and Siggraph in the coming years regarding this topic!

The Poolroom

2016-04-01T22:46:00.003+02:00

The Poolroom

Figure 1 : Poolroom Pool Table

The poolroom was my first attempt at creating a truly rich environmental experience with Stingray. Most architectural visualization scenes you see are antiseptically clean and uncomfortably modern. I wanted to break away from that. I wanted an environment I would feel at home with, not one that a movie star would buy for sheer resale value to another movie star. I also wanted the challenge of working with natural and texturally rich materials. Not white on white, as is generally the case.

Figure : Poolroom Clock

To this end, I started looking for cozy but luxurious spaces on google and eventually came across a nice reference photo I could work with. Warm rich woods, lots of games, a bar, and well... those all speak to me. For better or worse, I felt this room was one I would personally feel comfortable in. So I took on the challenge of re-creating that environment in 3D inside Stingray.

The challenges

The poolroom gave me some major challenges. Some I knew would be trouble from the start, but some I didn’t realize until I started rendering lightmaps. Most of my difficulties came down to handling materials properly.

Figure 3 : Poolroom Bar

Coming to grips with physically based shaders

In addition to being my first complete Arch-Viz scene in Stingray, this was also my first real stab at using physically based shading (PBS). Although physically based shading is similar in many regards to traditional texturing, it has its own set of tricks and gotchas. I actually had to re-do the scenes materials more than once as I learned the proper way to do things.

For example, my scene was predominantly dark woods. With dark woods, you really have to be sure you get the albedo material in the correct luminosity range or you end up with difficulties when you light the scene. In my first attempts, I found my light being just eaten up by the darkness of the wood’s color map. I kept cranking up the light Intensities, but this would flood the scene and lead to harsh and broken light bakes.

Figure 4 : Arcade Game /p>

Eventually, once I understood the effect of the color map’s luminosity and got the values in line, I started getting great results with normalized light intensities. My lighting began responding favorably with deep, rich lightmap bakes. When you get the physical properties of the materials right, Stingray’s light baker is both fast and very good. But I can’t stress enough: with PBS, you must ensure that your luminosity values are accurate.

Reference photo was HDR

When I was building out the scene and trying to mimic the reference photo’s lighting, I realized that the original image was made using some high-dynamic range techniques. I couldn’t seem to get the same level of exposure and visual detail in the shadowed areas of my scene.

Figure 5 : Before Ambient Fills

Figure 6 : After Ambient Fills

Because of this, I had to do some pretty fun trickery with my scene lighting. In the end, I got it by placing some subtle, non-shadow casting lights in key areas to bring up the brightness a little in those areas.

Figure 6 : Soft Controlled Lighting

All in all, the scene took a lot of lighting work to get just right. I have to say that I was very happy with how closely I was able to match the lighting, given that the original photo was HDR.

Lived-in but not dirty

The last big challenge was also related to materials. I had to find that fine balance of a room that is clean and tidy but also obviously lived-in. So often I find Arch-Viz work feels unnaturally smooth and clean, which can destroy the belief of the space. I really wanted my scene to break through the uncanny valley and feel real.

I handled this mostly by creating some very simple grunge maps, and applying them to the roughness maps using a simple custom shader. This was easy to build in Stingray’s node-based shader graph:

Figure 8 : Simple RMA style shader with tiling and grunge map with adjustment.

I have this shader set up so I can control the tiling of the color map, normals and other textures. The grunge map, on the other hand, is sampled using UV coordinates from the lightmap channel. This helps to hide the tiling over large areas like the walls, because the grunge value that gets multiplied in to the roughness is always different each time the other textures repeat.

Balancing the grunge properly was the biggest challenge here, but in the end, some still shots even get me doing a double-take. When that happens, I know I’m doing well. I also posted progress along the way on my Facebook page — when I had friends saying, “whoa, when can I come visit?” I knew I was nailing it.

3D modeling

Figure 9 : Record Player Model in Maya LT

I don’t have much that’s special to say about the 3D modeling process. I simply modeled all my assets the same way anyone would. Attention to detail is really the trick, and making sure that I created hand-made lightmap UVs for every object was critical to ensure the best light baking. Otherwise it was just simple modeling.

Figure 10 : Poolroom Model in MayaLT

One thing to note, however, is that I only used 3D tools that came with the Stingray package, except for Substance Designer and a little Photoshop. I did the entire scene’s modeling in MayaLT. Sometimes people think cheap is not good, but I believe this proves otherwise. MayaLT is incredible. I am super happy with the results and speed at which you can work with it. Best of all, it’s part of the package, so no additional costs.

Material design

Laying out the materials in the scene was pretty straightforward for the most part. At one point, I experimented with using more species of wood, but the different parts of the room started to feel disconnected. I started removing materials from my list, and eventually when I ended up with only a small handful the room came together as you see it.

Figure 11 : Record Player Material Design in Substance

I guess something else I should mention is performance shaders. Stingray comes with a great, flexible standard shader, but I wanted to eke out every little bit of performance I could on this scene while keeping the quality very high. Without much trouble, I created a library of my own purpose-built shaders (like the one mentioned earlier). I used these for various tasks. Simple colors, RMA (roughness-metallic-ambient occlusion), RMA-tiling shaders and a few others came together really quickly. From this handful of shaders, I was able to increase performance while simplifying my design process. I find it comforting how Stingray deals with shaders… it is just very easy to iterate and save a version. Much better usability than other systems I have tried.

Figure 12 : Shader Library

Fun stuff

Well, most game dev is hard work, the fun is at the end when you get to finally relax and see your efforts paid off. But there were definitely some really fun parts of making the poolroom.

One was the clock. It’s a small, almost easter-egg kind of thing, but I programmed the clock fully. Meaning, its hands move, the pendulum swings, and it also rings the hour. So if you are exploring the poolroom and it happens to be when the hour changes in your system clock, the clock in the game rings the hour for you. So two o’clock rings two times, four o’clock rings four times, etc. The half-hour always strikes once. I modeled the clock after one that my father gave me, so I put some extra love into it. It is basically exactly the clock that hangs in my living room.

Figure 13 : Clock Model in MayaLT

Figure 14 : Clock Model in Stingray

I also gave the record player some extra attention, because my good friend Mathew Harwood was kind enough to do all the audio for the project. I felt the music really set the scene, and he even worked on it over my twitch stream so we could get feedback from some people who were watching. So yeah, press + or - in the game to start and stop the record player, complete with animated tone arm. Nothing super crazy, just a nice little touch.

Figure 15 : Record Player in Stingray

Community effort

One thing I found really neat about this project was that I streamed the entire creation process on my Twitch channel. I have never streamed much before this project, but it made the process much more fun. I had people to talk with, and often my viewers were helpful to me in suggesting ideas and noticing things I had not noticed. It was very collaborative and a great learning exercise for me and for my viewers. We got to learn from each other, which is the dream!

For example, the record player likely would not have been done to the level I did it had one of my viewers not pushed me to make a really detailed player. Because of this push, it ended up being a focus of the level, and even has some animation and basic controls a user can interact with.

Stop by my Twitch channel sometime at twitch.tv/paulkind3d and say hi, I’d love to meet you.

Hot Reloadable JavaScript, Batman!

2016-01-31T22:02:00.000+01:00

JavaScript is my new favorite prototyping language. Not because the language itself is fantastic. I mean, it's not too bad. It actually has a lot of similarity to Lua, but it's hidden under a heavy layer of WAT!?, like:

Browser incompatibilities!?
Semi-colons are optional, but you "should" put them there anyway!?
Propagation of null, undefined and NaN until they cause an error very far from where they originated!?
Weird type conversions!? "0" == false!?
Every function is also an object constructor!? x = new add(5,7)!?
Every function is also a method!?
You must check everything with hasOwnProperty() when iterating over objects!?

But since Lua is a work of genius and beauty, being a half-assed version of Lua is still pretty good. You could do worse, as languages go.

And JavaScript is actually getting better. Browser compatibility is improving, automatic updates is a big factor in this. And if your goal is just to prototype and play, as opposed to building robust web applications, you can just pick your favorite browser, go with that and don't worry about compatibility. The ES6 standard also adds a lot of nice little improvements, like let, const, class, lexically scoped this (for arrow functions), etc.

But more than the language, the nice thing about JavaScript is that comes with a lot of the things you need to do interesting stuff -- a user interface, 2D and 3D drawing, a debugger, a console REPL, etc. And it's ubiquitous -- everybody has a web browser. If you do something interesting and want to show it to someone else, it is as easy as sending a link.

OK, so it doesn't have file system access (unless you run it through node.js), but who cares? What's so fun about reading and writing files anyway? The 60's called, they want their programming textbooks back!

I mean in JavaScript I can quickly whip up a little demo scene, add some UI controls and then share it with a friend. That's more exciting. I'm sure someone will tell me that I can do that in Ruby too. I'm sure I could, if I found the right gems to install, picked what UI library I wanted to use and learned how to use that, found some suitable bundling tools that could package it up in an executable, preferably cross-platform. But I would probably run into some annoying and confusing error along the way and just give up.

With increasing age I have less and less patience for the sysadmin part of programming. Installing libraries. Making sure that the versions work together. Converting a configure.sh script to something that works with our build system. Solving PATH conflicts between multiple installed cygwin and mingw based toolchains. Learning the intricacies of some weird framework that will be gone in 18 months anyway. There is enough of that stuff that I have to deal with, just to do my job. I don't need any more. When I can avoid it, I do.

One thing I've noticed since I started to prototype in JavaScript is that since drawing and UI work is so simple to do, I've started to use programming for things that I previously would have done in other ways. For example, I no longer do graphs like this in a drawing program:

Instead I write a little piece of JavaScript code that draws the graph on an HTML canvas (code here: pipeline.js).

JavaScript canvas drawing cannot only replace traditional drawing programs, but also Visio (for process diagrams), Excel (graphs and charts), Photoshop and Graphviz. And it can do more advanced forms of visualization and styling, that are not possible in any of these programs.

For simple graphs, you could ask if this really saves any time in the long run, as compared to using a regular drawing program. My answer is: I don't know and I don't care. I think it is more important to do something interesting and fun with time than to save it. And for me, using drawing programs stopped being fun some time around when ClarisWorks was discontinued. If you ask me, so called "productivity software" has just become less and less productive since then. These days, I can't open a Word document without feeling my pulse racing. You can't even print the damned things without clicking through a security warning. Software PTSD. Programmers, we should be ashamed of ourselves. Thank god for Markdown.

Another thing I've stopped using is slide show software. That was never any fun either. Keynote was at least tolerable, which is more than you can say about Powerpoint. Now I just use Remark.js instead and write my slides directly in HTML. I'm much happier and I've lost 10 pounds! Thank you, JavaScript!

But I think for my next slide deck, I'll write it directly in JavaScript instead of using Remark. That's more fun! Frameworks? I don't need no stinking frameworks! Then I can also finally solve the issue of auto-adapting between 16:9 and 4:3 so I don't have to letterbox my entire presentation when someone wants me to run it on a 1995 projector. Seriously, people!

This is not the connector you are looking for!

And I can put HTML 5 videos directly in my presentation, so I don't have to shut down my slide deck to open a video in a separate program. Have you noticed that this is something that almost every speaker does at big conferences? Because apparently they haven't succeeded in getting their million dollar presentation software to reliably present a video file! Software! Everything is broken!

Anyhoo... to get back off topic, one thing that surprised me a bit about JavaScript is that there doesn't seem to be a lot of interest in hot-reloading workflows. Online there is JSBin, which is great, but not really practical for writing bigger things. If you start googling for something you can use offline, with your own favorite text editor, you don't find that much. This is a bit surprising, since JavaScript is a dynamic language -- hot reloading should be a hot topic.

There are some node modules that can do this, like budo. But I'd like something that is small and hackable, that works instantly and doesn't require installing a bunch of frameworks. By now, you know how I feel about that.

After some experimentation I found that adding a script node dynamically to the DOM will cause the script to be evaluated. What is a bit surprising is that you can remove the script node immediately afterwards and everything will still work. The code will still run and update the JavaScript environment. Again, since this is only for my personal use I've not tested it on Internet Explorer 3.0, only on the browsers I play with on a daily basis, Safari and Chrome Canary.

What this means is that we can write a require function for JavaScript like this:

function require(s)
{
    var script = document.createElement("script");
    script.src = s + "?" + performance.now();
    script.type = "text/javascript";
    var head = document.getElementsByTagName("head")[0];
    head.appendChild(script);
    head.removeChild(script);
}

We can use this to load script files, which is kind of nice. It means we don't need a lot of <script> tags in the HTML file. We can just put one there for our main script, index.js, and then require in the other scripts we need from there.

Also note the deftly use of + "?" + performance.now() to prevent the browser from caching the script files. That becomes important when we want to reload them.

Since for dynamic languuages, reloading a script is the same thing as running it, we can get automatic reloads by just calling require on our own script from a timer:

function reload()
{
    require("index.js");
    render();
}

if (!window.has_reload) {
    window.has_reload = true;
    window.setInterval(reload, 250);
}

This reloads the script every 250 ms.

I use the has_reload flag on the window to ensure that I set the reload timer only the first time the file is run. Otherwise we would create more and more reload timers with every reload which in turn would cause even more reloads. If I had enough power in my laptop the resulting chain reaction would vaporize the universe in under three minutes. Sadly, since I don't all that will happen is that my fans will spin up a bit. Damnit, I need more power!

After each reload() I call my render() function to recreate the DOM, redraw the canvas, etc with the new code. That function might look something like this:

function render()
{
    var body = document.getElementsByTagName("body")[0];
    while (body.hasChildNodes()) {
        body.removeChild(body.lastChild);
    }

    var canvas = document.createElement("canvas");
    canvas.width = 650;
    canvas.height = 530;
    var ctx = canvas.getContext("2d");
    drawGraph(ctx);
    body.appendChild(canvas);
}

Note that I start by removing all the DOM elements under <body>. Otherwise each reload would create more and more content. That's still linear growth, so it is better than the exponential chain reaction you can get from the reload timer. But linear growth of the DOM is still pretty bad.

You might think that reloading all the scripts and redrawing the DOM every 250 ms would create a horrible flickering display. But so far, for my little play projects, everything works smoothly in both Safari and Chrome. Glad to see that they are double buffering properly.

If you do run into problems with flickering you could try using the Virtual DOM method that is so popular with JavaScript UI frameworks these days. But try it without that first and see if you really need it, because ugh frameworks, amirite?

Obviously it would be better to reload only when the files actually change and not every 250 ms. But to do that you would need to do something like adding a file system watcher connected to a web socket that could send a message when a reload was needed. Things would start to get complicated, and I like it simple. So far, this works well enough for my purposes.

As a middle ground you could have a small bootstrap script for doing the reload:

window.version = 23;
if (window.version != window.last_version) {
    window.last_version = window.version;
    reload();
}

You would reload this small bootstrap script every 250 ms. But it would only trigger a reload of the other scripts and a re-render when you change the version number. This avoids the reload spamming, but it also removes the immediate feedback loop -- change something and see the effect immediately which I think is really important.

As always with script reloads, you must be a bit careful with how you write your scripts to ensure thy work nicely with the reload feature. For example, if you write:

class Rect
{
    ...
};

It works well in Safari, but Chrome Canary complains on the second reload that you are redefining a class. You can get around that by instead writing:

var Rect = class {

Now Chrome doesn't complain anymore, because obviously you are allowed to change the content of a variable.

To preserve state across reloads, I just put the all the state in a global variable on the window:

window.state = window.state || {}

The first time this is run, we get an empty state object, but on future reloads we keep the old state. The render() function uses the state to determine what to draw. For example, for a slide deck I would put the current slide number in the state, so that we stay on the same page after a reload.

Here is a GIF of the hot reloading in action. Note that the browser view changes as soon as I save the file in Atom:

(No psychoactive substances where consumed during the production of this blog post. Except caffeine. Maybe I should stop drinking coffee?)

Stingray Support -- Hello, I Am Someone Who Can Help

2016-01-29T23:00:00.002+01:00

Hello, I am someone who can help.

Here at the Autodesk Games team, we pride ourselves on supporting users of the Stingray game engine in the best ways possible – so to start, let’s cover where you can find information!

General Information Here!

Games Solutions Learning Channel on YouTube:

This is a series of videos about Stingray by the Autodesk Learning Team. They'll be updating the playlist with new videos over time. They're pretty responsive to community requests on the videos, so feel free to log in and comment if there's something specific you'd like to see.

Check out the playlist on YouTube.

Autodesk Stingray Quick Start Series, with Josh from Digital Tutors:

We enlisted the help from Digital Tutors to set up a video series that runs through the major sections of Stingray so you can get up and running quickly.

Check out the playlist on YouTube.

Autodesk Make Games learning site:

This is a site that we've made for people who are brand new to making games. If you've never made a game before, or never touched complex 3D tools or a game engine, this is a good place to start. We run you through Concept Art and Design phases, 3D content creation, and then using a game engine. We've also made a bunch of assets available to help brand new game makers get started.

www.autodesk.com/MakeGames

Creative Market:

The Creative Market is a storefront where game makers can buy or sell 3D content. We've got a page set up just for Stingray, and it includes some free assets to help new game makers get started.

https://creativemarket.com/apps/stingray

Stingray Online Help

Here you'll find more getting started movies, how-to topics, and references for the scripting and visual programming interfaces. We're working hard to get you all the info you need, and we're really excited to hear your feedback.

http://help.autodesk.com/view/Stingray/ENU/

Forum Support Tutorial Channel on YouTube:

This is a series of videos that answers recurring forums questions by the Autodesk Support Team. They'll be updating the playlist with new videos over time. They're pretty responsive to community requests on the videos, so feel free to log in and comment if there's something specific you'd like to see.

Check out the playlist on YouTube.

You should also visit the Stingray Public Forums here, as there is a growing wealth of information and knowledge to search from.

Let's Get Started

Let’s get started. Hi, I’m Dan, nice to meet you. I am super happy to help you with any of your Stingray problems, issues, needs or general questions! However, I’m going to need to ask you to HELP ME, HELP YOU!!

It’s not always apparent when a user asks for help just exactly what that user is asking for. That being the case, here is some useful information on how to ask for help and what to provide us so that we can help you better and more quickly!

Make sure you are very clear on what your specific problem is and describe it as best you can.

Include pictures or screen shots you may have

Tell us how you came to have this problem

Give us detailed reproduction steps on how to arrive at the issue you are seeing

Attach your log files!

They can be found here: C:\Users\”USERNAME”\AppData\Local\Autodesk\Stingray\Logs

Attach any file that is a specific problem (zip it so it attaches to forum post)
Make sure to let us know your system specifications
Make sure to let us know what Stingray engine version you are using

On another note … traduire, traduzir, 翻译, Übersetzen, þýða, переведите, ਅਨੁਵਾਦ, , and ... translate! We use English as our main support language, however, these days – translate.google.com is really, really good! If English is not your first language, please feel free to write your questions and issues in your native language and we will translate it and get back to you. I often find that it is easier to understand from a translation and this helps us get you help just that much more quickly!

In Conclusion

So just to recap, make sure you are ready when you come to ask us a question! Have your issue sorted out, how to reproduce it, what engine version you are running, your system specs and attach your log files. This will help us, help you, just that much faster and we can get you on your way to making super awesome content in the Stingray game engine. Thanks!

Dan Matlack

Product Support Specialist – Games Solutions

Autodesk, Inc.

Introducing the Stingray Package Manager (spm)

2016-01-20T11:48:00.001+01:00

The Stingray Package Manager, or spm, is a small Ruby program that is responsible for downloading specific versions of the external artifacts (libraries, sample projects, etc) that are needed to build the Stingray engine and tools. It's a small but important piece of what makes one-button builds possible.

By one-button builds I mean that it should be possible to build Stingray with a single console command and no human intervention. It should work for any version in the code history. It should build all tools, plugins and utilities that are part of the project (as well as meaningful subsets of those for faster builds). In addition, it should work for all target platforms, build configurations (debug, development, release) and options (enabling/disabling Steam, Oculus, AVX, etc).

Before you have experienced one-button builds it's easy to think: So what? What's the big deal? I can download a few libraries manually, change some environment variables when needed, open a few Visual Studio projects and build them. Sure, it is a little bit of work every now and then, but not too bad.

In fact, there are big advantages to having a one-button build system in place:

New developers and anyone else interested in the code can dive right in and don't have to spend days trying to figure out how to compile the damned thing.
Build farms don't need as much baby sitting (of course build farms always need some baby sitting).
All developers build the engine in the same way, the results are repeatable and you don't get bugs from people building against the wrong libraries.
There is a known way to build any previous version of the engine, so you can fix bugs in old releases, do bisect tests to locate bad commits, etc.

But more than these specific things, having one-button builds also gives you one less thing to worry about. As programmers we are always trying to fit too much stuff into our brains. We should just come to terms with the fact that as a species, we're really not smart enough to be doing this. That is why I think that simplicity is the most important virtue of programming. Any way we can find to reduce cognitive load and context switching will allow us to focus more on the problem at hand.

In addition to spm there are two other important parts of our one-button build system:

The cmake configuration files for building the various targets.
A front-end ruby script (make.rb) that parses command-line parameters specifying which configuration to build and makes the appropriate calls to spm and cmake.

But let's get back to spm. As I said at the top, the responsibility of spm is to download and install external artifacts that are needed by the build process. There are some things that are important:

Exact versions of these artifacts should be specified so that building a specific version of the source (git hash) will always use the same exact artifacts and yield a predictable result.
Since some of these libraries are big, hundreds of megabytes, even when zipped (computers are a sadness), it is important not to download more than absolutely necessary for making the current build.
For the same reason we also need control over how we cache older versions of the artifacts. We don't want to evict them immediately, because then we have to download hundreds of megabytes every time we switch branch. But we don't want to keep all old versions either, because then we would pretty soon run out of space on small disks.

The last two points are the reason why something like git-lfs doesn't solve this problem out-of-the box and some more sophisticated package management is needed.

spm takes inspiration from popular package managers like npm and gem and offers a similar set of sub commands. spm install to install a package. spm uninstall to uninstall, etc. At it's heart, what spm does is a pretty simple operation:

Upon request, spm downloads a specific artifact version (identified by a hash) from an artifact repository. We support multiple artifact repositories, such as S3, git and Artifactory. The artifact is unzipped and stored in a local library folder where it can be accessed by the build scripts. As specific artifact versions are activated and deactivated we move them in and out of the local artifact cache.

We don't use unique folder names for artifact versions. So the folder name of an artifact (e.g., luajit-2.1.0-windows) doesn't tell us the exact version (y0dqqY640edvzOKu.QEE4Fjcwxc8FmlM). spm keeps track of that in internal data structures.

There are advantages and disadvantages to this approach:

We don't have to change the build scripts when we do minor fixes to a library, only the version hash used by spm.
We avoid ugly looking hashes in the folder names and don't have to invent our own version numbering scheme, in addition to the official one.
We can't see at a glance which specific library versions are installed without asking spm.
We can't have two versions of the same library installed simultaneously, since their names could collide, so we can't run parallel builds that use different library versions.
If library version names were unique we wouldn't even need the cache folder, we could just keep all the versions in the library folder.

I'm not 100 % sure we have made the right choice, it might be better to enforce unique names. But it is not a huge deal so unless there is a big impetus for change we will stay on the current track.

spm knows which versions of the artifacts to install by reading configuration files that are checked in as part of the source code. These configuration files are simple JSON files with entries like this:

cmake = {
    groups = ["cmake", "common"]
    platforms = ["win64", "win32", "osx", "ios", "android", "ps4", "xb1", "webgl"]
    lib = "cmake-3.4.0-r1"
    version = "CZRgSJOqdzqVXey1IXLcswEuUkDtmwvd"
    source =  {
        type = "s3"
        bucket = "..."
        access-key-id = "..."
        secret-access-key = "..."
    }
}

This specifies the name of the packet (cmake), the folder name to use for the install (cmake-3.4.0-r1), the version hash and how to retrieve it from the artifact repository (these source parameters can be re-used between different libraries).

To update a library, you simply upload the new version to the repository, modify the version hash and check in the updated configuration file.

The platforms parameter specifies which platforms this library is used on and groups is used to group packages together in meaningful ways that make spm easier to use. For example, there is an engine group that contains all the packages needed to build the engine runtime and a corresponding editor group for building the editor.

So if you want to install all libraries needed to build the engine on Xbox One, you would do:

spm install-group -p xb1 engine

This will install only the libraries needed to build the engine for Xbox One and nothing else. For each library, spm will:

If the library is already installed -- do nothing.
If the library is in the cache -- move it to the library folder.
Otherwise -- download it from the repository.

Downloads are done in parallel, for maximum speed, with a nice command-line based progress report:

The cache is a simple MRU cache that can be pruned either by time (throw away anything I haven't used in a month) or by size (trim the cache down to 10 GB, keeping only the most recently used stuff).

Of course, you usually never have even have to worry about calling spm directly, because make.rb will automatically call it for you with the right arguments, based on the build parameters you have specified to make.rb. It all happens behind the scene.

Even the cmake binary itself is installed by the spm system, so the build is able to bootstrap itself to some extent. Unfortunately, the bootstrap process is not 100 % complete -- there are still some things that you need to do manually before you can start using the one-button builds:

Install Ruby (for running spm.rb and make.rb).
Specify the location of your library folder (with an SR_LIB_DIR environment variable).
Install a suitable version of Visual Studio and/or XCode.
Install the platform specific SDKs and toolchains for the platforms you want to target.

I would like to get rid of all of this and have a zero-configuration bootstrap procedure. You sync the repository, give one command and bam -- you have everything you need.

But some of these things are a bit tricky. Without Ruby we need something else for the initial step that at least is capable of downloading and installing Ruby. We can't put restricted software in public repositories and it might be considered hostile to automatically run installers on the users' behalf. Also, some platform SDKs need to be installed globally and don't offer any way of switching quickly between different SDK versions, thwarting any attempt to support quick branch switching.

But we will continue to whittle away at these issues, taking the simplifications where we can find them.

Data Driven Rendering in Stingray

2015-12-18T20:36:00.000+01:00

We’re all familiar with the benefits that a data driven architecture brings to gameplay: code is decoupled from data, enabling live linking and rapid iteration. Placing new objects in the editor or modifying the speed of a character has an immediate effect on a live game instance. Really speeds up the development process as you fine tune scripts, gameplay and other content.

What about graphics programming? It turns out that the same architecture and associated benefits apply to Stingray’s renderer.

Just by modifying configuration files (albeit somewhat complex configuration files) we can implement new shader programs, post-processing effects and even different cascading shadow map implementations. All in real time, on a live game instance. Which is a big win for graphics programmers: try out new ideas, fine tune shaders all with real-time feedback. No more of that long edit/compile/run/debug cycle. And this applies to the entire rendering pipeline: everything from the object space to world space transforms to shadow casting and the final rendering pass is all exposed as config file data, not as C++ code as with traditional architectures.

I gave a presentation on this topic a while back which has now found it’s way to our YouTube channel:

https://www.youtube.com/channel/UC0fIe6XV1PjilADTei9JMOA

By the way, there’s a lot of other great Stingray content up there so please check it out! The renderer presentation can be found under “Stingray Render Config Tutorial.”

The details as well as a PowerPoint can be found there. The code changes to add a trivial greyscale post-processing effect involve:

settings.ini:

The render_config variable points to the renderer.config file. Settings.ini also provides a section to override default settings found in the next file, renderer.render_config

core/stingray_renderer/renderer.render_config:

Points to our shader libraries, text files containing actual shader programs. A section called global_resources allocates graphics buffers, such as scratch buffers for the cascading shadow maps and G-buffers for deferred rendering along with the main framebuffer. And most of the actual rendering is invoked in the resource_generators section. Again, more details in the YouTube video though a surprising amount can be learned just by grepping through the various config files and playing with the settings. Which is easy to do since it’s all data driven!

core/stingray_renderer/shader_libraries/development.shader_source:

One of several shader libraries. While shader code can be entered as text here, Stingray also provides a graphical node-based shader editor. And we support ShaderFX materials from Max or Maya. It’s often easier (and more portable) to implement shaders graphically.

But whatever method you choose to implement shaders in, the key point is that Stingray's entire rendering pipeline is fully accessible through configuration files. With our data driven architecture making complex rendering changes, while still non-trivial, is a whole lot faster and easier (and portable!) than working with platform-specific C++ code.

Temporal Reprojection and SAO

2015-09-10T15:45:00.000+02:00

We've recently re-visited our AO solution with the goal of improving its performance on consoles. We currently use the Scalable Ambient Obscurance algorithm presented by Morgan McGuire. Out target was to bring down the cost of the entire effect to something between 1-1.5ms on Xbox One. To achieve this it was clear that we needed to reduce the number of taps we were taking for each AO sample. An important part of this work went towards improving the efficiency of the temporal reprojection of our AO buffer. I thought I'd share a few observations we've made along the way.

Distributing AO Samples

When reprojecting data the key to success is to make sure your samples are well distributed through time. The Halton sequence was made popular for reprojection methods after Brian Karis presented his High Quality Temporal Supersampling. It is a sequence that gives well distributed samples in space as well as in time.

If we add an offset to a single AO sample we can see that after 2π it repeats (as expected)

So we use 8 samples ranging from [0, 2π] distributed by using the first 8 terms of a base 3 Halton sequence: {1/3, 2/3, 1/9, 4/9, 7/9, 2/9, 5/9, 8/9} x 2π

And this is the result using 6 AO samples.

Notice that there is some banding that can appear. By adding a dithered offset to the current sample's radius we can remove this quite nicely. We use a 4x4 pattern based on the Bayer matrix.

And here are our samples distributed through time. If anyone has a better way of distributing these I would be very interested to know.

Reprojection function

When doing any temporal reprojection it is crucial to have a good reprojection function. I like to refer to this as a similarity function. Its purpose is to identify how likely it is that the reprojected samples correspond to the samples of the current pixel. When writing this function it's quite important to have a convenient way to visualize it. If the function gets complicated, it's a good idea to isolate the different terms of the function so that you can reason and debug them individually. The reprojection function we use is a combination of three main terms.

Disocclusion Term

This is a simple term which identifies depth differences and classifies previous pixels as disoccluded. We use the relative depth difference described by Huw Bowles in Iterative Image Warping.

depth_similarity = saturate(pow(prev_depth/current_depth, 4) + DEPTH_MIN_SIMILARITY);

Velocity Term

The second term is also very straight forward and consists simply of reducing the similarity for fast moving pixels. A moving pixel has less chance of reprojecting successfully than a still one.

velocity_similarity = saturate(velocity * VELOCITY_SCALAR);

Dangerous Samples Term

The idea is to identify if the AO samples we are gathering are touching moving objects. To do this efficiently, we encode a moving bit as part of the depth buffer info passed in to the SAO algorithm. If you use the mip chain described by the SAO paper, make sure that you forward that 'moving bit' to the lower levels of the mip chain. This idea was presented by Anton Michels during the Labs R&D: Rendering Techniques in Rise of the Tomb Raider presentation earlier this year. Since each AO sample will need to read the depth information we get the 'moving bit' read for free.

samples_similarity = saturate(num_moving_samples * MOVING_SAMPLES_SCALAR);
samples_similarity *= (LOW_VELOCITY_SIMILARITY - MOVING_SAMPLES_MIN_SIMILARITY);
samples_similarity = lerp(samples_similarity, prev_samples_similarity, 0.9);
samples_similarity = min(samples_similarity, current_samples_similarity);

To try and invalidate samples associated with fast moving object, the Dangerous Samples term is accumulated through time. An idea described quite well by Oliver Mattausch in his TSSAO Gpu-Pro2 article which he called 'smooth invalidation'.

Here's the kind of ghosting we get without identifying 'dangerous' samples.

With this term as part of the reprojection function we can eliminate most of the ghosting that arises from the temporal reprojection:

Putting it all together

The final similarity term is calculated by combining all terms together.

similarity = depth_similarity * LOW_VELOCITY_SIMILARITY - velocity_similarity;
similarity *= (LOW_VELOCITY_SIMILARITY - HIGH_VELOCITY_SIMILARITY);
similarity = saturate(similarity - samples_similarity);

Well, that's it! Nothing too ground breaking but I thought I'd share. If anyone has ideas or suggestions on how to improve any of this please let us know!

bitsquid: development blog

Physical Cameras in Stingray

Stingray Representation

Post Effects

Properties Mapper

Validating Results

Validating materials and lights in Stingray

Material parameters mapping

Investigating material differences

Investigating light differences

Results and final thoughts

Physically Based Lens Flare

Code and Results

Ghosts

Lens Interface Description

Ray Tracing

Aperture

Starburst

Anti Reflection Coating

Optimisations

Conclusion

Reprojecting Reflections

Rebuilding the Entity Index

Background

Property System and Entity Index

Example:

Entity

entity_1.id

entity_2.id

entity_3.id

Shifting control of the InstanceId

entity_1.id

entity_2.id

entity_3.id

entity_1.id

entity_2.id

entity_3.id

Options for implementation

Entity and component creation

Building a Prototype chain

Storage of the prototype

Building the Prototype index

Measuring the results

Final words

Stingray Renderer Walkthrough #8: stingray-renderer & mini-renderer

Introduction

Stingray Renderer

Extension insertion points

Shadows

Clustered shading

Clearing & VR mask

G-buffer

Reflections & Lighting

Various stuff

Post Processing

Final touches

Mini Renderer

Wrap up

Stingray Renderer Walkthrough #7: Data-driven rendering

Introduction

Meet the render_config

Render Settings & Misc

Resource Sets

Layer Configurations

Resource Generators

Future improvements

Scalability

Memory

DX12 improvements

Wrap up

Stingray Renderer Walkthrough #6: RenderInterface

The glue layer

The interface

Blocking functions

Non-blocking functions

Wrap up

Stingray Renderer Walkthrough #5: RenderDevice

Overview

Resource Management

Shaders

Meet the `render_config`

`sort_key` breakdown

`RenderJobPackage`