Variable Rate Shading: first impressions

Variable Rate Shading (VRS) is a new DX12 feature introduced recently, that can be used to control shading rate. To be more precise, it is used to reduce shading rate, as opposed to the Multi Sampling Anti Aliasing (MSAA) technique which is used to increase it. When using MSAA, every pixel gets allocated multiple samples in the render target, but unless multiple triangles touch it, it will be only shaded once. VRS on the other hand doesn’t allocate multiple samples per pixel, instead it can broadcast one shaded pixel to nearby pixels, and only shade a group of pixels once. The shading rate means how big is the group of pixels that can get shaded as one.

Basics

DirectX 12 lets the developer specify the shading rate as a block of pixels, it can be 1×1 (default, most detailed), 1×2, 2×1, 2×2 (least detailed) in the basic hardware implementation. Optionally, hardware can also support 2×4, 4×2, 4×4 pixel group at an additional capability level. The granularity of the shading rate selection can be controlled per draw call by the basic Tier1 VRS hardware. Controlling by draw call is already a huge improvement over MSAA, because that means shading rate is not consistent across the screen. To set the shading rate, it couldn’t be easier:

commandlist5->RSSetShadingRate(D3D12_SHADING_RATE_2X2, nullptr); // later about second parameter

That’s it, unlike MSAA, we don’t need to do any resolve passes, it just works as is.

The Tier2 VRS feature level lets the developer specify the shading rate granularity even per triangle by using the SV_ShadingRate HLSL semantic for a uint shader input parameter. The SV_ShadingRate can be written as output from the vertex shader, domain shader, geometry shader and mesh shader. In all of the cases, the shading rate will be set per primitive, not per vertex, even though vertex and domain shaders only support the per vertex execution model. The triangle will receive the shading rate of the provoking vertex, which is the first vertex of the three vertices that make up the triangle. The pixel shader can also read the shading rate as an input parameter, which could be helpful in visualizing the rate.

The Tier2 VRS implementation also supports controlling the shading rate by a screen aligned texture. The screen aligned texture is a R8_UINT formatted texture, which contains the shading rate information per tile. A tile can be 8×8, 16×16 or 32×32 pixel block, it can be queried from DX12 as part of the D3D12_FEATURE_DATA_D3D12_OPTIONS6 structure:

D3D12_FEATURE_DATA_D3D12_OPTIONS6 features_6;
device->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS6, &features_6, sizeof(features_6));
features_6.VariableShadingRateTier; // shading rate image and per primitive selection only on tier2
features_6.ShadingRateImageTileSize; // tile size will be 8, 16 or 32
features_6.AdditionalShadingRatesSupported; // Whether 2x4, 4x2 and 4x4 rate is supported

Which means, that the shading rate image resolution will be:

width = (screen_width + tileSize - 1) / tileSize;
height = (screen_height + tileSize - 1) / tileSize;

To bind the shading rate image, a call exists:

commandlist5->RSSetShadingRateImage(texture);

The shading rate image will need to be written from a compute shader through an Unordered Access View (RWTexture2D<uint>). Before binding it with the RSSetShadingRateImage command, it needs to be in the D3D12_RESOURCE_STATE_SHADING_RATE_SOURCE.

So there are multiple ways to set the shading rate: RSSetShadingRate, SV_ShadingRate, RSSetShadingRateImage, but which one will be in effect? This can be specified through the second parameter of the RSSetShadingRate() call with an array of combiners. The combiners can specify which shading rate selector will be chosen. For example, one that specified the least detailed shading rate (D3D12_SHADING_RATE_COMBINER_MAX), or the most detailed (D3D12_SHADING_RATE_COMBINER_MIN), or by other logic. Right now, I just want to apply the coarsest shading rate that was selected at all times, so I call this at the beginning of every command list:

D3D12_SHADING_RATE_COMBINER combiners[] =
{
	D3D12_SHADING_RATE_COMBINER_MAX,
	D3D12_SHADING_RATE_COMBINER_MAX,
};
GetDirectCommandList(cmd)->RSSetShadingRate(D3D12_SHADING_RATE_1X1, combiners);

Next, I’d like to show some of the potential effects these can play into.

  • Materials
    For example, if there is an expensive material or not very important, lower the shading rate of it. It can easily be a per material setting that can be set at authoring time.
Comparison of native (full rate) and variable rate shading (4×4 reduction). Texture sampling quality is reduced with VRS.
Lighting quality is reduced with VRS (below) compared to full resolution shading (above), but geometry edges are retained.
  • Particle systems
    When drawing off screen particles into a low resolution render target, we can save performance easily, but it will be difficult to compose back the particles and retain smooth edges with the geometry in the depth buffer. Instead, we can choose to render with full resolution, and reduce shading rate. We can also keep using hardware depth testing this way and improve performance.
4×4 shading rate reduction for the particles. Large overlapping particles with lighting must reduce shading rate or rendered lower resolution for good performance. (ground plane also using VRS here)
  • Objects in the distance
    Objects in distance can easily reduce shading rate by draw call or per primitive rate selection.
  • Objects behind motion blur or depth of field
    Fast moving or out of focus objects can be shaded more coarsely and the shading rate image features can be used for this. In my first integration, I am using a compute shader that dispatches one thread group for each tile and each thread in the group reads a pixel’s velocity until all pixels in the tile are read. Each pixel’s velocity is mapped to a shading rate and then the most detailed one is determined via atomic operation inside the tile. The shader internally must write values that are taken from D3D12_SHADING_RATE struct, so in order to keep an API independent implementations, these values are not hard coded, but provided in a constant buffer. [My classification shader source code here, but it will probably change in the future]
Shading rate classification by velocity buffer and moving camera.
  • Alpha testing
    For vegetations, we often want to do alpha testing, and often the depth prepass is used with alpha testing. In that case, we don’t have to alpha test in the second pass when we render colors and use more expensive pixel shaders, because the depth buffer is computed and we can rely on depth testing against previous results. Then the idea is that we will be able to reduce shading rate for the alpha tested vegetation only in the second pass, while retaining high resolution alpha testing quality from the depth prepass.
Vegetation particle system using alpha testing

Problems:

  • One of my observations is that when zoomed in to a reduced shading rate object on the screen, we may see some blockiness, as if it used point/nearest neighbor sampling method, but after some thinking it makes sense, because only a single pixel value is broadcasted to to all neighbors, and no filtering or resolving takes place.
  • Also, mip level selection will be different in coarse shaded regions, because derivatives are larger when using larger pixel blocks. This is because samples are farther away from each other. For me personally, it doesn’t matter because the result is blocky anyways. This should be applied at places where the users will notice these less likely anyway. I am not sure how I would handle it with the Tier2 features, but the per draw call rate selection could be balanced with setting a lower mip LOD bias for the samplers in the draw call when there is coarse shading selected.
  • Although off screen particles can retain more correct composition with the depth buffer with VRS, if we are rendering soft particles (blending computed in the shader from linear depth buffer and particle plane difference), the soft regions where the depth test is not happening will produce some blockiness:
There are particles in the foreground, not depth tested and instead blended in the shader which causes blockiness with VRS enabled (left)
  • Classification
    There are many more aspects to consider when classifying tiles for the image based shading rate selection. Right now, the simple thing to try was select increasingly coarser shading rates with increasing minimum title velocity. The other things to consider would be that I can think of and heard of: depth of field focus, depth discontinuity, visible surface detail. All these are most likely fed from the previous frame and reprojected with the current camera matrices. Tweaking and trying all of these will have to wait for me and probably depending on the kind of game/experience one is making. Strategy games will likely not care about motion blur, unlike racing games.

Performance

Enabling the VRS gets me a significant performance boost, especially when applying to large geometries, such as the floor in Sponza (that also uses an expensive parallax occlusion mapping shader for displacement mapping), or the large billboard particles that are overlapping and using an expensive lighting shader. Some performance results using RTX 2060, 4k resolution:

  • Classification:
    0.18ms – from velocity buffer only
  • Forward rendering:
    5.6ms – stationary camera (full res shading)
    1.8ms – moving camera (variable rate shading)
    4ms – only floor (with parallax occlusion mapping) set to constant 4×4 shading rate
  • Motion blur:
    0.75ms – stationary camera (plus curtain moving on screen)
    3.6ms – moving camera
left: visualizing variable shading rate
right: motion blur amount
Motion blur increases cost when blur amount increases, but VRS reduces cost at the same time
  • Billboard particle system (large particles close to camera)
    4ms – unlit, full resolution shading
    3ms – unlit, 4×4 shading rate
    24.7ms – shaded, full resolution shading
    3.4ms – shaded, 4×4 shading rate
particle performance test – very large particles on screen, overlapping, sampling shadow map and lighting calculation

Thanks for reading, you can read about VRS in more detail in the DX12 specs.

As Philip Hammer called out on Twitter, the Nvidia VRS extension is also available in Vulkan, OpenGL and DX11:

Capsule Collision Detection

Capsule shapes are useful tools for handling simple game physics. Here you will find out why and how to detect and handle collisions between capsule and triangle mesh, as well as with other capsules, without using a physics engine.

(Note on not using a physics engine: There is reason for doing this without physics engine, even if you have a physics engine up and running within your game engine. Physics engine is probably configured for physically correct behaviour which is not necessarily the most gameplay friendly. It also comes with a lot of parameters and lot of overhead, such as the need to set up and manage a physics collision world, probably separately from your own entity system. Also, it is fun to learn how these things work.)

The capsule collisions in action

Capsules are often a better choice to represent game entities than spheres or boxes. Think about a human player character for example:

The bounding sphere doesn’t let the player close to the wall

As you can see, the sphere collision shape will not let the player close to walls if it is large enough to contain the player. But it might work well for other types of more rounded characters. For example imagine this type of roundish entity:

And now more seriously, take a look at the bounding box collision:

Bouding box is better than sphere, but it has corner cases (literally)

The box shape has another problem, that it is not rounded, so it will not be able to traverse stairs, and can be jerky when sliding along walls (shown later). These will lead to weird or unsatisfying gameplay experience.

Capsule is even better

A capsule shape will provide a result that match player expectation more closely. From programming standpoint, it will also be more complicated to implement, but not by much. The implementation is just an extension to sphere collision detection, so let’s start with that.

Sphere – Sphere collision

This is quite simple, we take the distance between the two sphere centres, and if it is less than the added radius of both, they are intersecting. We are also interested in the penetration depth and normal vector to handle game physics such as sliding on top of surfaces/walls (shown later).

Sphere – Triangle collision

This is probably the most complicated part and we should break it down. First, we determine the triangle’s plane from its corner points and see whether the sphere intersects the plane (if not, we do not continue). This is done by taking the distance between the plane and sphere centre. I tried to make the following sample codes to resemble HLSL code, but look at it like pseudo-code:

float3 p0, p1, p2; // triangle corners
float3 center; // sphere center
float3 N = normalize(cross(p1 – p0, p2 – p0)); // plane normal
float dist = dot(center – p0, N); // signed distance between sphere and plane
if(!mesh.is_double_sided() && dist > 0)
  continue; // can pass through back side of triangle (optional)
if(dist < -radius || dist > radius)
  continue; // no intersection

If the code is still running, that means the sphere is intersecting the infinite plane of the triangle. Next, we determine if there is any point inside the triangle that is inside the sphere or not. There are two cases. The first case is that if the sphere centre projected to the plane is inside the triangle perimeter, then the sphere is intersecting the triangle:

When sphere centre projection is inside the triangle…
float3 point0 = center – N * dist; // projected sphere center on triangle plane

// Now determine whether point0 is inside all triangle edges: 
float3 c0 = cross(point0 – p0, p1 – p0) 
float3 c1 = cross(point0 – p1, p2 – p1) 
float3 c2 = cross(point0 – p2, p0 – p2)
bool inside = dot(c0, N) <= 0 && dot(c1, N) <= 0 && dot(c2, N) <= 0;

If inside is true, then we can just take the point0 (projected sphere centre as the closest point on the triangle to the sphere. Otherwise, we are not sure we are intersecting with the triangle, so find the closest point to the sphere centre on all three edges, choose the closest of the three and determine if it is within the sphere radius. So for all three edge line segments, we will determine the closest point to the sphere centre and take the closest one:

When sphere centre projection is not inside triangle…
float3 ClosestPointOnLineSegment(float3 A, float3 B, float3 Point)
{
  float3 AB = B – A;
  float t = dot(Point – A, AB) / dot(AB, AB);
  return A + saturate(t) * AB; // saturate(t) can be written as: min((max(t, 0), 1)
}

float radiussq = radius * radius; // sphere radius squared

// Edge 1:
float3 point1 = ClosestPointOnLineSegment(p0, p1, center);
float3 v1 = center – point1;
float distsq1 = dot(v1, v1);
bool intersects = distsq1 < radiussq;

// Edge 2:
float3 point2 = ClosestPointOnLineSegment(p1, p2, center);
float3 v2 = center – point2;
float distsq2 = dot(vec2, vec2);
intersects |= distsq2 < radiussq;

// Edge 3:
float3 point3 = ClosestPointOnLineSegment(p2, p0, center);
float3 v3 = center – point3;
float distsq3 = dot(v3, v3);
intersects |= distsq3 < radiussq;

At the end of this, we will know whether we intersect with the triangle or not. If we intersect, we must compute the penetration depth and normal so that we will be able to properly handle the collision (bounce back or slide on surface). Like I said, we need to determine the closest intersection point for this. It is either inside the triangle, or on one of the edges:

if(inside || intersects)
{
  float3 best_point = point0;
  float3 intersection_vec;

  if(inside)
  {
    intersection_vec = center – point0;
  }
  else  
  {
    float3 d = center – point1;
    float best_distsq = dot(d, d);
    best_point = point1;
    intersection_vec = d;

    d = center – point2;
    float distsq = dot(d, d);
    if(distsq < best_distsq)
    {
      distsq = best_distsq;
      best_point = point2;
      intersection_vec = d;
    }

    d = center – point3;
    float distsq = dot(d, d);
    if(distsq < best_distsq)
    {
      distsq = best_distsq;
      best_point = point3; 
      intersection_vec = d;
    }
  }

  float3 len = length(intersection_vec);  // vector3 length calculation: 
  sqrt(dot(v, v))
  float3 penetration_normal = penetration_vec / len;  // normalize
  float penetration_depth = radius – len; // radius = sphere radius
  return true; // intersection success
}

That’s it for the sphere – triangle intersection!

Capsule collision

The capsule collision is a simple extension to the sphere intersection test. I define the capsule with two points, base and tip, and the radius:

Points A an B are derived from Base, Tip points and radius. They will also be useful in computations. In game code I found that it makes more sense to handle the Base and Tip points instead of A and B sphere centers.

The main idea with the implementation, is that we want to do a sphere intersection test by selecting the most relevant sphere along the capsule axis, clamped to the A-B line segment. Let’s see it in action in the following cases:

Capsule – Capsule collision

We are looking for BestA and BestB points.

As you can see, the intersection test is at the end performed against two spheres. To select the most relevant spheres, the previously defined ClosestPointOnLineSegment() function will come in handy. We will compute the A and B line endpoints on both capsules, then take vectors and distances between all endpoints between the two capsules, to finally compute the closest points between two lines:

// capsule A:
float3 a_Normal = normalize(a.tip – a.base); 
float3 a_LineEndOffset = a_Normal * a.radius; 
float3 a_A = a.base + a_LineEndOffset; 
float3 a_B = a.tip - a_LineEndOffset;

// capsule B:
float3 b_Normal = normalize(b.tip – b.base); 
float3 b_LineEndOffset = b_Normal * b.radius; 
float3 b_A = b.base + b_LineEndOffset; 
float3 b_B = b.tip - b_LineEndOffset;

// vectors between line endpoints:
float3 v0 = b_A – a_A; 
float3 v1 = b_B – a_A; 
float3 v2 = b_A – a_B; 
float3 v3 = b_B – a_B;

// squared distances:
float d0 = dot(v0, v0); 
float d1 = dot(v1, v1); 
float d2 = dot(v2, v2); 
float d3 = dot(v3, v3);

// select best potential endpoint on capsule A:
float3 bestA;
if (d2 < d0 || d2 < d1 || d3 < d0 || d3 < d1)
{
  bestA = a_B;
}
else
{
  bestA = a_A;
}

// select point on capsule B line segment nearest to best potential endpoint on A capsule:
float3 bestB = ClosestPointOnLineSegment(b_A, b_B, bestA);

// now do the same for capsule A segment:
bestA = ClosestPointOnLineSegment(a_A, a_B, bestB);

We selected the two best possible candidates on both capsule axes. What remains is to place spheres on those points and perform the sphere intersection routine:

float3 penetration_normal = bestA – bestB;
float len = length(penetration_normal);
penetration_normal /= len;  // normalize
float penetration_depth = a.radius + b.radius – len;
bool intersects = penetration_depth > 0;

Done, we can now detect collision between two capsules.

Capsule – Triangle collision

The principle is very similar to the previous example, we select the closest point on the capsule line to the triangle, place a sphere on that point and then perform the sphere – triangle test. We can find the closest point on a line to triangle with ray tracing. First we trace the plane with the capsule line (ray). Then we determine the closest point on the triangle to the trace point. The reference point is the closest point on the capsule line to that. We will place the sphere on the reference point and do a normal triangle – sphere intersection test as usual.

Only tracing the plane and finding closest point on capsule line is not enough. We must find closest point on capsule line to the triangle. We are looking for the Reference point (pink).
It seems that this image has a mistake: the segment marked with “penetration depth”, is really (radius – penetrationdepth)
// Compute capsule line endpoints A, B like before in capsule-capsule case:
float3 CapsuleNormal = normalize(tip – base); 
float3 LineEndOffset = CapsuleNormal * radius; 
float3 A = base + LineEndOffset; 
float3 B = tip - LineEndOffset;

// Then for each triangle, ray-plane intersection:
//  N is the triangle plane normal (it was computed in sphere – triangle intersection case)
float t = dot(N, (p0 - base) / abs(dot(N, CapsuleNormal)));
float3 line_plane_intersection = base + CapsuleNormal * t;

float3 reference_point = {find closest point on triangle to line_plane_intersection};

// The center of the best sphere candidate:
float3 center = ClosestPointOnLineSegment(A, B, reference_point);

The “find closest point on triangle to line_plane_intersection” refers to the same approach that was used in the sphere – triangle intersection case, when we try to determine if the sphere centre projection is inside the triangle, or find a closest edge point. Here, instead of the sphere centre projection, we use the line_plane_intersection point:

// Determine whether line_plane_intersection is inside all triangle edges: 
float3 c0 = cross(line_plane_intersection – p0, p1 – p0) 
float3 c1 = cross(line_plane_intersection – p1, p2 – p1) 
float3 c2 = cross(line_plane_intersection – p2, p0 – p2)
bool inside = dot(c0, N) <= 0 && dot(c1, N) <= 0 && dot(c2, N) <= 0;

if(inside)
{
  reference_point = line_plane_intersection;
}
else
{
  // Edge 1:
  float3 point1 = ClosestPointOnLineSegment(p0, p1, line_plane_intersection);
  float3 v1 = line_plane_intersection – point1;
  float distsq = dot(v1, v1);
  float best_dist = distsq;
  reference_point = point1;

  // Edge 2:
  float3 point2 = ClosestPointOnLineSegment(p1, p2, line_plane_intersection);
  float3 v2 = line_plane_intersection – point2;
  distsq = dot(vec2, vec2);
  if(distsq < best_dist)
  {
    reference_point = point2;
    best_dist = distsq;
  }

  // Edge 3:
  float3 point3 = ClosestPointOnLineSegment(p2, p0, line_plane_intersection);
  float3 v3 = line_plane_intersection – point3;
  distsq = dot(v3, v3);
  if(distsq < best_dist)
  {
    reference_point = point3;
    best_dist = distsq;
  }
}

After this (that you placed the sphere onto the reference point), just finish it off with the sphere – triangle intersection, and you are done.

There is a corner case with this. When the capsule line is parallel to the triangle plane, the ray – plane trace will fail. You can detect this case by checking the dot product between the ray direction (CapsuleNormal) and triangle normal (N). If it’s zero, then they are parallel. When this happens, instead of relying on the tracing, a simple fix is to just take an arbitrary point (for example the first corner) from the triangle and use that as reference point. The reference point will be used the same way as the intersection point, to determine the closest point on the capsule A-B segment.

Usage

Now that you have capsule collision, you will want to use the results. You will obviously know when two primitives intersect now, but you also get the penetration normal and depth. You can use these for character sliding on surfaces.

If the capsule collides with a surface, then modify the velocity so that it will be absorbed in the direction of the impact and instead slide parallel to the surface:

Absorbing penetration force (red arrow) will make the box collider stop in place because it is perpendicular to collision plane (purple line). The collision plane is perpendicular to the penetration normal (blue arrow)
The penetration force (red arrow) is nicely absorbed and allows sliding upwards (green arrow).
collisionobject, normal, depth = SceneIntersectCapsule(capsule);
if(collisionobject)
{
  // Modify player velocity to slide on contact surface:
  float3 velocity_length = length(velocity);
  float3 velocity_normalized = velocity / velocity_length;
  float3 undesired_motion = normal * dot(velocity_normalized, normal);
  float3 desired_motion = velocity_normalized – undesired_motion;
  velocity = desired_motion * velocity_length;

  // Remove penetration (penetration epsilon added to handle infinitely small penetration):
  capsule.Translate(normal * (depth + 0.0001f));
}

The player sliding on the contact surface is a very nice and expected experience for the player. It handles collision with the wall as well as with the floor. It will automatically handle moving a slope or stairs as well, however, if there is gravity, the player will be constantly sliding down slopes and small stairs. The gravity also makes it difficult to climb the slopes and stairs, which is not very desirable. In my experience, it is best to perform the collision detection twice, once for the player motion forces, and once with only the gravity force, but in the gravity collision phase, don’t do the sliding along surface if the surface is not sloped enough. You can check how much it is sloped by comparing the dot product of the penetration normal and the “up” vector:

if(dot(normal, float3(0, 1, 0)) > 0.3f)
{
  ground_intersection = true;
  gravity_force = 0;
}

Discrete collision detection

This kind of collision detection can detect when a stationary capsule intersect with an other one or a triangle, but if the capsule is moving very fast, it can miss objects in between two game update ticks. This phenomenon is called the bullet through paper problem. To avoid this, we must do continuous collision detection. A very simple solution for this is just to do multiple steps of collision detection during one game frame. For example, let’s say that we want to divide the game frame into 5 collision sub-steps:

int ccd_max = 5; 
for(int ccd = 0; ccd < ccd_max; ++ccd)
{
  float step = velocity / ccd_max;
  capsule.Translate(step);
  collisionobject, normal, depth = SceneIntersectCapsule(capsule);
  // …
}

The more sub-steps we do, the more accurate the result will be, as well as more expensive.

As a bonus, an idea worth to try in the future is to apply one capsule collision detection step for moving spheres instead of multiple step based continuous collision detection phases. For example to implement geometry collision for a particle system.

Optimization

Some simple optimizations here can get us a long way. The most obvious one is to perform AABB (axis aligned bounding box) checks before you do a capsule collision test. AABB for a capsule is very easily generated, for example generate two AABBs for the two opposite spheres, and merge the resulting AABBs. Only check collisions between objects whose AABBs are overlapping the capsule. I also got a very prominent speed up by doing the AABB check between capsule AABB and triangle AABB. This was also very important for a large terrain mesh. In continuous collision detection fashion, you can also create a large AABB encompassing the starting and ending capsule positions (according to velocity) for all the CCD steps, so the sub-steps will only check a subset of potential collider objects.

It would be also preferable to use a BVH (bounding volume hierarchy, hierarchical bounding box acceleration structure) to reduce the number of comparisons for detecting overlapping objects/triangles.

An other thing that you probably want to return early as possible when doing the continuous collision detection sub-steps. So don’t search for the closest intersection, but return immediately if you find one. Let the gameplay code adjust position/velocity and continue with the next sub-step. It might be the case that no further collision occurs after it anyway. At least in my experience, this approach posed no visible problems.

I based the sphere collision detection approach on how the DirectXMath library does it, but modified in some places to work better for my case.

I hope you found it useful. If you notice any mistakes or have any ideas to make it better, hit the comment section below.

Tile-based optimization for post processing

One way to optimize heavy post processing shaders is to determine which parts of the screen could use a simpler version. The simplest form of this is use branching in the shader code to early exit or switch to a variant with reduced sample count or computations. This comes with a downside that even the parts where early exit occur, must allocate as many hardware resources (registers or groupshared memory) as the heaviest path. Also, branching in the shader code can be expensive when the branch condition is not uniform (for example: constant buffer values or SV_GroupID are uniform, but texture coordinates or SV_DispatchThreadID are not), because multiple threads can potentially branch differently and cause divergence in execution and additional instructions.

Instead of branching, we can consider to have multiple different shaders with different complexities and use the DispatchIndirect functionality of DX11+ graphics APIs. The idea is to first determine which parts of the screen will require what kind of complexity, this will be done per tile. A tile could be for example at 32×32 or 16×16 pixels in size, but it could depend on the kind of post process. What do I mean by complexity? Let’s take depth of field for example. Depth of field is used to focus on an object and make it sharp, while making the foreground and background blurry. The part of the screen that will appear completely sharp will use early exit shader, without doing any blurring. The other parts of the screen will instead perform a blur. For a more complex depth of field effect, such as the one described in Siggraph 2014 by Jorge Jimenez as “scatter-as-you-gather” depth of field (such a great presentation!), the blur can be separated into two kinds. A simple blur can be used where the whole tile contains similar CoC (Circle of Confusion – the amount of blur, basically), and a more expensive weighted blur will be used in the tiles where there are vastly different CoC values. With this knowledge we can design a prepass shader that classifies tiles as how complex shader they require. After that, we just DispatchIndirect multiple times back to back – one for each complexity and using different shaders. This requies a lot of setup work, so it’s a good idea to leave it for last step in the optimizations.

The implementation will consist of the following steps:

1) Tile classification.

Usually you are already gathering some per tile information for these kind of post processes. For example for the Depth of field I put the classification in the same shader that computes the min-max CoC per tile. This shader writes to 3 tile list in my case:

  • early exit tiles (when tile maximum CoC is small enough)
  • cheap tiles (maxCoC – minCoc is small enough)
  • expensive tiles (everything else)

Take a look at this example:

Focusing on the character face…
(The cute model is made by Sketchfab user woopoodle)
The face is classified as early out (blue), while parts of the geometry that is facing the camera are cheap (green). The rest is expensive (red), because the CoC range is large
Focusing on the statue in the background…
The lion head in the back is early exit (blue that’s barely visible), there are a lot of cheap tiles (green), because the CoC is capped to a maximum value, while the character silhouette is expensive (red), because it contains focused and blurred pixels as well (or just high CoC difference)

For motion blur, we can also do similar things, but using min-max of velocity magnitudes.
For screen space reflection, we could classify by reflectivity and roughness.
For tiled deferred shading, we can classify tiles for different material types. Uncharted 4 did this as you can find here: http://advances.realtimerendering.com/s2016/index.html
Possibly other techniques can benefit as well.
The tile classification shader is implemented via stream compaction, like this:

static const uint POSTPROCESS_BLOCKSIZE = 8;

static const uint TILE_STATISTICS_OFFSET_EARLYEXIT = 0;
static const uint TILE_STATISTICS_OFFSET_CHEAP = TILE_STATISTICS_OFFSET_EARLYEXIT + 4;
static const uint TILE_STATISTICS_OFFSET_EXPENSIVE = TILE_STATISTICS_OFFSET_CHEAP + 4;

RWByteAddressBuffer tile_statistics;
RWStructuredBuffer<uint> tiles_earlyexit;
RWStructuredBuffer<uint> tiles_cheap;
RWStructuredBuffer<uint> tiles_expensive;

[numthreads(POSTPROCESS_BLOCKSIZE, POSTPROCESS_BLOCKSIZE, 1)]
void main(uint3 DTid : SV_DispatchThreadID) // this runs one thread per tile
{
  // ...
  const uint tile = (DTid.x & 0xFFFF) | ((DTid.y & 0xFFFF) << 16); // pack current 2D tile index to uint

  uint prevCount;
  if (max_coc < 0.4f)
  {
    tile_statistics.InterlockedAdd(TILE_STATISTICS_OFFSET_EARLYEXIT, 1, prevCount);
    tiles_earlyexit[prevCount] = tile;
  }
  else if (abs(max_coc - min_coc) < 0.2f)
  {
    tile_statistics.InterlockedAdd(TILE_STATISTICS_OFFSET_CHEAP, 1, prevCount);
    tiles_cheap[prevCount] = tile;
  }
  else
  {
    tile_statistics.InterlockedAdd(TILE_STATISTICS_OFFSET_EXPENSIVE, 1, prevCount);
    tiles_expensive[prevCount] = tile;
  }
}

2) Kick indirect jobs.

If your tile size is the same as the POSTPROCESS_BLOCKSIZE that will render the post process, you could omit this step and just stream compact to the indirect argument buffers inside the classification shader itself. But in my case I am using 32×32 pixel tiles, while the thread count of the compute shaders is 8×8. So the “kick jobs” shader will compute the actual dispatch count and write indirect argument buffers. This shader will also be resposible to reset the counts for the next frame. It is using only 1 thread:

static const uint DEPTHOFFIELD_TILESIZE = 32;

static const uint INDIRECT_OFFSET_EARLYEXIT = TILE_STATISTICS_OFFSET_EXPENSIVE + 4;
static const uint INDIRECT_OFFSET_CHEAP = INDIRECT_OFFSET_EARLYEXIT + 4 * 3;
static const uint INDIRECT_OFFSET_EXPENSIVE = INDIRECT_OFFSET_CHEAP + 4 * 3;

#define sqr(a)		((a)*(a))

RWByteAddressBuffer tile_statistics;
RWStructuredBuffer<uint> tiles_earlyexit;
RWStructuredBuffer<uint> tiles_cheap;
RWStructuredBuffer<uint> tiles_expensive;

[numthreads(1, 1, 1)]
void main(uint3 DTid : SV_DispatchThreadID)
{
  // Load statistics:
  const uint earlyexit_count = tile_statistics.Load(TILE_STATISTICS_OFFSET_EARLYEXIT);
  const uint cheap_count = tile_statistics.Load(TILE_STATISTICS_OFFSET_CHEAP);
  const uint expensive_count = tile_statistics.Load(TILE_STATISTICS_OFFSET_EXPENSIVE);

  // Reset counters:
  tile_statistics.Store(TILE_STATISTICS_OFFSET_EARLYEXIT, 0);
  tile_statistics.Store(TILE_STATISTICS_OFFSET_CHEAP, 0);
  tile_statistics.Store(TILE_STATISTICS_OFFSET_EXPENSIVE, 0);

  // Create indirect dispatch arguments:
  const uint tile_replicate = sqr(DEPTHOFFIELD_TILESIZE / POSTPROCESS_BLOCKSIZE); // for all tiles, we will replicate this amount of work
  tile_statistics.Store3(INDIRECT_OFFSET_EARLYEXIT, uint3(earlyexit_count * tile_replicate, 1, 1));
  tile_statistics.Store3(INDIRECT_OFFSET_CHEAP, uint3(cheap_count * tile_replicate, 1, 1));
  tile_statistics.Store3(INDIRECT_OFFSET_EXPENSIVE, uint3(expensive_count * tile_replicate, 1, 1));
}

Note that the tile_statistics buffer will also be used as the indirect argument buffer for DispatchIndirect later, but using offsets into the buffer. The value tile_replicate is key to have different tile size than the threadcount of the post processing shaders (POSTPROCESS_BLOCKSIZE). Essentially we will dispatch multiple thread groups per tile to account for this difference. However, TILESIZE should be evenly divisible by POSTPROCESS_BLOCKSIZE to keep the code simple.

3) Use DispatchIndirect

BindComputeShader(&computeShaders[CSTYPE_DEPTHOFFIELD_MAIN_EARLYEXIT]);
DispatchIndirect(&buffer_tile_statistics, INDIRECT_OFFSET_EARLYEXIT);

BindComputeShader(&computeShaders[CSTYPE_DEPTHOFFIELD_MAIN_CHEAP]);
DispatchIndirect(&buffer_tile_statistics, INDIRECT_OFFSET_CHEAP);

BindComputeShader(&computeShaders[CSTYPE_DEPTHOFFIELD_MAIN]);
DispatchIndirect(&buffer_tile_statistics, INDIRECT_OFFSET_EXPENSIVE);

Note that if you are using DX12 or Vulkan, you don’t even need to synchronize between these indirect executions because they touch mutually exclusive parts of the screen. Unfortunately, DX11 will always wait for a compute shader to finish before starting the next which is a slight inefficiency in this case.

4) Execute post process

You will need to determine which tile and which pixel are you currently shading. I will refer to tile which is the 32 pixel wide large tile taht was classified and subtile that is the 8 pixel wide small tile that corresponds to the thread group size. You can see what I mean on this drawing:

It is also called “Tileception”

So first we read from the corresponding tile list (early exit shader reads from early exit tile list, cheap shader from cheap tile list, and so on…) and unpack the tile coordinate like this:

// flattened array index to 2D array index
inline uint2 unflatten2D(uint idx, uint dim)
{
  return uint2(idx % dim, idx / dim);
}

RWStructuredBuffer<uint> tiles;

[numthreads(POSTPROCESS_BLOCKSIZE * POSTPROCESS_BLOCKSIZE, 1, 1)]
void main(uint3 Gid : SV_GroupID, uint3 GTid : SV_GroupThreadID)
{
  const uint tile_replicate = sqr(DEPTHOFFIELD_TILESIZE / POSTPROCESS_BLOCKSIZE);
  const uint tile_idx = Gid.x / tile_replicate;
  const uint tile_packed = tiles[tile_idx];
  const uint2 tile = uint2(tile_packed & 0xFFFF, (tile_packed >> 16) & 0xFFFF);

After we got the tile, we can continue to compute the pixel we want to shade:

  const uint subtile_idx = Gid.x % tile_replicate;
  const uint2 subtile = unflatten2D(subtile_idx, DEPTHOFFIELD_TILESIZE / POSTPROCESS_BLOCKSIZE);
  const uint2 subtile_upperleft = tile * DEPTHOFFIELD_TILESIZE + subtile * POSTPROCESS_BLOCKSIZE;
  const uint2 pixel = subtile_upperleft + unflatten2D(GTid.x, POSTPROCESS_BLOCKSIZE);

Note that we are running a one dimensional kernel instead of 2 dimensional. I think this simplifies the implementation because the tile lists are also one dimensional, but we need to use the unflatten2D helper function to convert 1D array index to 2D when computing the pixel coordinate. Also note that this code adds instructions to the shader, but up until the last line, those instructions are not divergent (because they are relying on the uniform SV_GroupID semantic), so they can be considered cheap as they are not using the most precious hardware resources, the VGPR (Vector General Purpose Registers).

After we have the pixel coordinate, we can continue writing the regular post processing code as usual.

Interestingly, I have only seen performance benefit with this optimization when the loops were unrolled in the blurring shaders. Otherwise, they were not bottlenecked by register usage, but the dynamic loops were not performing very well. After that, I have experienced an improvement with the tiling optimization, about 0.3 milliseconds were saved in Depth of Field at 4k resolution on Nvidia GTX 1070 GPU.

Probably further tests need to be conducted and trying different tile sizes and classification threshold values. However, this is already a good one to have in your graphics optimization toolbox.

You can find my implementation of this in WickedEngine for the motion blur and depth of field effects.

Let me know if you spot any mistakes or have feedback in the comments!

Entity-component system

Here goes my idea of an entity-component system written in C++. I’ve been using this in my home-made game engine, Wicked Engine for exactly a year now and I am still very happy with it. The focus is on simplicity and performance, not adding many features.

This entity-component system (ECS from now on) is used to manage the game scene. That means storing game objects in memory, updating and rendering them, writing them to disk and loading them from disk to memory. The simple ECS that can handle these consists of:

  • Entity: a unique identifier (a number). Imagine it like an ID in a database.
  • Component: a data blob (a struct). This is the data that is associated with an entity. It is up to the user to create the data type, by creating a struct or a class in C++.
  • System: data transformation (a function). This is any function that operates on a number of components. The user will write simple functions that take components as inputs and produce components as outputs. It is just any C++ function, really.
  • ComponentManager: entity to component mapping container. Each entity can have one component held in it. The user can check if an entity and component exists, and query an entity’s component. This is the thing that we will implement here and give to the user as a tool to work with.

The ECS will not only be simple, but performance is a big concern. We will want to have a lot of game objects, and manage them as fast as we can. Today, memory access is often the slow part of an application. The best way to avoid fetching memory and waiting for it to arrive, is to make good use of the cache and load from cache as often as we can. By keeping data in linear arrays, we are in a very good position, because most of the time we will iterate through every component and do something with them. The CPU is good at this when data is accessed linearly and memory can be prefetched into cache ahead of time. An other nice feature is that linear arrays will lend themselves very well to multithreading; you can use parallel for loops very simply and efficiently to process a large number of data. The ComponentManager will be responsible to manage components in a linear fashion. This is also called in many places Data Oriented Design (DoD).

Let’s look at an example first, what kind of code the user of the ECS will need to write:

// Create a new unique entity:
Entity entity = CreateEntity();

// A component:
struct Position
{
  float x, y, z;
};

// ComponentManager to hold components:
ComponentManager<Position> positions;

// Check whether the entity has a position component:
positions.Contains(entity); // returns false

// Create a new Position component for the entity:
Position& position = positions.Create(entity);

// Check whether the entity has a position component:
positions.Contains(entity); // returns true now

// A system that moves every entities' position up by one:
void UpdatePositionsSystem(ComponentManager<Position>& positions)
{
  for(size_t i = 0; i < position.GetCount(); ++i)
  {
    positions[i].y += 1;
  }
}

Apart from the simple example above, the ECS will have some other functionality too, but that comes later, when we need it. Let’s implement what we have so far, one by one.

Entity

First, we have the CreateEntity(). This should give us a unique number identifier. The Entity is just a typedef for a 32-bit unsigned int:

#include <cstdint>
typedef uint32_t Entity;

In the simplest implementation for CreateEntity(), you could give a monotonically increasing number:

Entity CreateEntity()
{
  static Entity next = 0;
  return ++next;
}

However, that comes with a lot of issues if you will ever need to serialize the entities, write them to disk and load later. Just imagine starting your program, creating some entities, saving them, then close the program, start again, create some entities (starts from one) then loading the previous entities (those that also started from one). Your entities will no longer be unique in this case. We will address this later, in the Serialization part, for now you can use this, then later replace this function if you are interested in serialization.

By the way, we start from one, because we reserve zero as an invalid entity handle. This is useful, for example giving default value to entities that have not yet been created:

static const Entity INVALID_ENTITY = 0;

This is it for the Entity for now. The crux of the implementation will be after this, the ComponentManager.

ComponentManager

The ComponentManager is a container that we will implement. It will be storing components and entities in linear arrays, for the previously mentioned performance reasons. By storing both components and entities, we have the ability to query a component’s entity or an entity’s component for a specific index position inside the ComponentManager. Very useful if we just iterate through all components and want to look up into an other type of ComponentManager for a component’s entity. Speaking of lookups, sometimes we want to just look up a specific entity’s component, without iteration through the whole array, for that a hash map will be perfect. This ComponentManager can be templated based on component type, so it can look sort of like a C++ standard container:

#include <vector>
#include <unordered_map>

template<typename Component>
class ComponentManager
{
private:
  std::vector<Component> components;
  std::vector<Entity> entities;
  std::unordered_map<Entity, size_t> lookup;
};

We then start implementing the functionality for our previous example. With a Contains() method, we can check if an entity has this kind of component or not:

bool Contains(Entity entity) const
{
  return lookup.find(entity) != lookup.end();
}

Of course, first, we should be able to create components for entities with a Create() function:

Component& Create(Entity entity)
{
  // INVALID_ENTITY is not allowed!
  assert(entity != INVALID_ENTITY);

  // Only one of this component type per entity is allowed!
  assert(lookup.find(entity) == lookup.end());

  // Entity count must always be the same as the number of components!
  assert(entities.size() == components.size());
  assert(lookup.size() == components.size());

  // Update the entity lookup table:
  lookup[entity] = components.size();

  // New components are always pushed to the end:
  components.push_back(Component());

  // Also push corresponding entity:
  entities.push_back(entity);

  return components.back();
}

I like to put in a lot of asserts wherever I can, so a debug version of the program will catch any issues. For simplicity’s sake, I like the limitation that an entity can have only one components of the same kind. If the user would ever want more components, more ComponentManager instances can be used, but I never wanted to have more than one component of the same kind for an entity.

To iterate through every component (or entity, because there is the same amount of components and entities), we can query the amount of components/entities:

size_t GetCount() const { return components.size(); }

Then to iterate in a linear fashion, we can get a component reference from the linear array by a size_t indexer and the [] operator:

Component& operator[](size_t index) { return components[index]; }

All of these methods are a part of the templated ComponentManager class, so they reside in the header file. In fact, all our implementation will be in the same header file as part of the ComponentManager class. So far, this is enough to use the simple example that I written above. However, this is not enough functionality. Next, I will give some other useful functions that the ComponentManager should have and use cases for them.

For example, I often iterate through components and use the [] operator to index a component. But what if I need the Entity for the same component? It is not required that the component data struct contain an Entity reference. Instead, the ComponentManager can give the corresponding Entity back to the user:

Entity GetEntity(size_t index) const { return entities[index]; }

The use case is if we have different components for one entity, and one component wants to look at an other component’s data:

ComponentManager<Position> positions;
// ... continued from previous example, entity has a position now...

struct Mass
{
 float value;
};
ComponentManager<Mass> masses;
masses.Create(entity);

for(size_t i = 0; i < positions.GetCount(); ++i)
{
  Position& position = positions[i];
  Entity entity = positions.GetEntity(i);
  Mass* mass = masses.GetComponent(entity);
  if(mass != nullptr)
  {
    position.y += 1.0f / mass->value;
  }
}

In the above example, in order to update the position, we need to get access to the mass, but that is stored in an other component. We check if the entity has a mass, and if it has, we update the position according to the mass. The GetComponent() function also needs to be implemented now:

Component* GetComponent(Entity entity)
{
  auto it = lookup.find(entity);
  if (it != lookup.end())
  {
    return &components[it->second];
  }
  return nullptr;
}

As you can see, this will return the pointer to a component if it exists, nullptr otherwise. This performs a hash map lookup, that is slower than linear access indexing, but we won’t use it as frequently. Or at least, we shouldn’t! We should lay out our data that map lookups like this will be minimized. It is often better to duplicate data in an ECS and keep the cache warm.

What about deleting game objects? This can be a frequently used functionality of the ECS, so we must implement it. We can remove the last element of a linear array trivially, we just decrease the array size by one (std::vector’s pop() function) and destroy the last object. Removing from the middle of the array is a bit more tricky. We replace the element to be removed with the last element in the array, and decrease the array’s size. This is done by the Remove() function (duh):

void Remove(Entity entity)
{
  auto it = lookup.find(entity);
  if (it != lookup.end())
  {
    // Directly index into components and entities array:
    const size_t index = it->second;
    const Entity entity = entities[index];

    if (index < components.size() - 1)
    {
      // Swap out the dead element with the last one:
      components[index] = std::move(components.back()); // try to use move
      entities[index] = entities.back();

      // Update the lookup table:
      lookup[entities[index]] = index;
    }

    // Shrink the container:
    components.pop_back();
    entities.pop_back();
    lookup.erase(entity);
  }
}

As you can see, we also have to keep the lookup table in sync with the entity and component arrays. I also try to use C++ move semantics and avoid copying component data if possible. It will be possible if you use only default constructors and destructors, or write move constructors for your components yourself. Otherwise, copy will be used.

If you remove an element like this, it can be problematic however, if you rely on component ordering. There was at least one case where I relied on the order. However, so far we have a pretty well usable ECS already. The following sections of the blog will be about two bigger topics:

  • Scene graph hierarchy: the problem is that we rely on component ordering in the ECS hierarchy traversal
  • Serialization: the problem is that Entity uniqueness must persist after serialization/deserialization.

If you are interested in the above topics, stick around. 🙂

Scene Graph

For me, this was actually the most interesting part of implementing the ECS because this had the most question marks in my head about. In a usual pointer based scene graph tree implementation, we can just use recursion and traverse all tree nodes starting from the root node trivially. With the ECS implementation, this is done a bit differently. We have components here, stored in linear arrays, and we can’t use pointers to index components, because they can move around in the ComponentManager when they are being added and removed. So how to do it?

The basic scene graph is a tree of nodes, where each node has a spatial position, and all its children’s positions will be relative to their parent. This means, that we can have:

  • TransformComponent: position, scale, rotation, world matrix
  • HierarchyComponent: entity’s parent, parent’s inverse world matrix at the time of attaching

With these two components, and iterating components linearly from first to last, we are able to represent hierarchy in a scene. There is one other missing piece: we will rely on the order of components in the linear arrays. Because children can refer to their single parent, but parents have no knowledge of their (possibly many) children, parents must be placed before their children in the component arrays, so that when a child computes its transform world matrix, the parent is already up to date because it was computed earlier.

For example, let’s say that we have an Attach() function:

void Attach(Entity entity, Entity parent);

This function will operate on the HierarchyComponent array and add a new component at the end, that will be the child:

struct HierarchyComponent
{
  Entity parentID = INVALID_ENTITY;
  Matrix world_parent_inverse_bind;
};
ComponentManager<HierarchyComponent> hierarchy;

void Attach(Entity entity, Entity parent)
{
  hierarchy.Create(entity).parentID = parent;
  // ... to be continued
}

However, the child we are adding might have been already a parent of other children. In that case, we want to search for a component whose parent is the current entity which we are attaching and move it before it.

// ...continued

for (size_t i = 0; i < hierarchy.GetCount(); ++i)
{
  const HierarchyComponent& child = hierarchy[i];
	
  if (child.parentID == entity)
  {
    hierarchy.MoveItem(hierarchy.GetCount() - 1, i);
    break;
  }
}

But that’s not sufficient as someone pointed out in the comments, because if a whole sub-tree is reattached, not only the parent can get in the wrong place, but all children of that, so we have to carefully check every entity and move them if they are actually a parent of some component that’s before them. So replace the previous block with this more complicated one instead:

// ...continued

if (hierarchy.GetCount() > 1)
{
  for (size_t i = hierarchy.GetCount() - 1; i > 0; --i)
  {
    Entity parent_candidate_entity = hierarchy.GetEntity(i);
    const HierarchyComponent& parent_candidate = hierarchy[i];
    for (size_t j = 0; j < i; ++j)
    {
      const HierarchyComponent& child_candidate = hierarchy[j];

      if (child_candidate.parentID == parent_candidate_entity)
      {
        hierarchy.MoveItem(i, j);
        ++i; // next outer iteration will check the same index again as parent candidate, however things were moved upwards, so it will be a different entity!
        break;
      }
    }
  }
}

Explanation: We start from the end of the hierarchy (from the component we just added), and check if any previous components are its children or not. If we find a child before it, we move it before the child (but keep the ordering of all components after the child intact). We move towards the beginning of the hierarchy chain and for every component, check if there is a child before them and move if necessary.

The MoveItem() is a new method for the ComponentManager class. It will move an element to a specified index inside the array and moves all elements that are in the way by one position:

void MoveItem(size_t index_from, size_t index_to)
{
  assert(index_from < GetCount());
  assert(index_to < GetCount());
  if (index_from == index_to)
  {
    return;
  }

  // Save the moved component and entity:
  Component component = std::move(components[index_from]);
  Entity entity = entities[index_from];

  // Every other entity-component that's in the way gets moved by one and lut is kept updated:
  const int direction = index_from < index_to ? 1 : -1;
  for (size_t i = index_from; i != index_to; i += direction)
  {
    const size_t next = i + direction;
    components[i] = std::move(components[next]);
    entities[i] = entities[next];
    lookup[entities[i]] = i;
  }

  // Saved entity-component moved to the required position:
  components[index_to] = std::move(component);
  entities[index_to] = entity;
  lookup[entity] = index_to;
}

That could have been done in a different way, by allocating a temporary memory and moving all the prior elements in one go instead of one by one. The current solution instead avoids extra memory allocations.

After calling MoveItem() in the Attach() function, we are not sure anymore that references remained intact, so we query the currently added child:

HierarchyComponent& parentcomponent = *hierarchy.GetComponent(entity);

Then we continue to query the TransformComponent of the parent (create it if it doesn’t exist):

TransformComponent* transform_parent = transforms.GetComponent(parent);
if (transform_parent == nullptr)
{
  transform_parent = &transforms.Create(parent);
}
// Save the parent's inverse worldmatrix:
parentcomponent.world_parent_inverse_bind = MatrixInverse(transform_parent->worldmatrix);

This is it for the Attach() function. But sometimes, we also detach children from parents. The Detach function() could be like this:

void Scene::Component_Detach(Entity entity)
{
  const HierarchyComponent* parent = hierarchy.GetComponent(entity);

  if (parent != nullptr)
  {
    TransformComponent* transform = transforms.GetComponent(entity);
    if (transform != nullptr)
    {
      transform->ApplyTransform();
    }
    hierarchy.Remove_KeepSorted(entity);
  }
}

So we look into the hierarchy array, if the entity is in it, we detach it by applying transformation (the current position, rotation, scale is converted to world matrix) and then removing the component from the hierarchy. Again, we need to remove it while keping the ordering sorted. For that, a Remove_KeepSorted() function was implemented:

void Remove_KeepSorted(Entity entity)
{
  auto it = lookup.find(entity);
  if (it != lookup.end())
  {
    // Directly index into components and entities array:
    const size_t index = it->second;
    const Entity entity = entities[index];

    if (index < components.size() - 1)
    {
      // Move every component left by one that is after this element:
      for (size_t i = index + 1; i < components.size(); ++i)
      {
        components[i - 1] = std::move(components[i]);
      }
      // Move every entity left by one that is after this element:
      for (size_t i = index + 1; i < entities.size(); ++i)
      {
        entities[i - 1] = entities[i];
        lookup[entities[i - 1]] = i - 1;
      }
    }

    // Shrink the container:
    components.pop_back();
    entities.pop_back();
    lookup.erase(entity);
  }
}

So we remove an element from the middle, but instead of swapping in the last one, we move every element after this one to the left, so ordering is preserved.

So we can attach, detach components now. All of it makes sense when we iterate the hierarchy and update transforms in the right order. This is the Hierarchy Update System:

void RunHierarchyUpdateSystem(
  const ComponentManager<HierarchyComponent>& hierarchy,
  ComponentManager<TransformComponent>& transforms,
  )
{
  for (size_t i = 0; i < hierarchy.GetCount(); ++i)
  {
    const HierarchyComponent& parentcomponent = hierarchy[i];
    Entity entity = hierarchy.GetEntity(i);

    TransformComponent* transform_child = transforms.GetComponent(entity);
    TransformComponent* transform_parent = transforms.GetComponent(parentcomponent.parentID);
    if (transform_child != nullptr && transform_parent != nullptr)
    {
      transform_child->UpdateTransform_Parented(*transform_parent, parentcomponent.world_parent_inverse_bind);
    }
  }
}

We iterate the hierarchy components, but also look up from transforms, and use the parent’s inverse bind matrix when updating the transform matrix.

This concludes the basic implementation of the ECS scene graph traversal. The Attach and Detach functions are more heavy than with a pointer based approach, but the traversal itself which is the commonly executed path, should be faster, if we have a flat hierarchy (eg. one parent has many children). If the hierarchy is more deep however, the pointer based tree traversal can beat the ECS implementation. This needs more investigation from me, but I was happy with keeping only the ECS hierarchy path for everything, instead of having also a pointer based scene graph, for example skeleton bones, which usually have deep hierarchy.

A closing side note: Even though hierarchy is contained in a linear array now, multithreading is still not possible with a parallel for loop in this specific case. Because the order of the components matter, the execution order also matters. You can still find ways to parallelize, for example by doing the whole hierarchy update on a separate thread, while also doing the material update system or other kind of unrelated systems in parallel with it.

Serialization

One of my main objectives was to efficiently serialize and deserialize the whole entity component system in the engine. The ComponentManager lends itself nicely for this with its linear arrays. We can just easily dump the array contents (entity array and component array) into a file and write them to disk. Loading them is also very convenient. There is only one issue. Entity serialization needs some extra care to keep them unique. We can no longer use simple monotonically increasing IDs for Entities. Instead, let’s generate random numbers:

#include <random>

std::random_device rand_dev;
std::mt19937 generator(rand_dev());

Entity CreateEntity()
{
  std::uniform_int_distribution<int>  distr(1, INT_MAX);
  return distr(generator);
}

The above code uses C++ random generator to generate uniformly distributed random numbers between 1 and INT_MAX. That range is enough so that we are unlikely to get collisions very soon (Note: it recommended to use 64-bit keys instead, that I discovered since writing this blog). But! there is an other problem as well: What if we want to load the same entity collection twice? Imagine that we serialized a model and saved to the disk. We want to load the model twice into the scene. The Entity numbers were randomized, but if we load them twice, they will no longer be unique!

I use the following solution: have a specific serialization post-process for Entities. We combine them with a random seed that is unique per deserialization pass. So instead of simply deserializing Entity like this:

Archive archive;
// ...
archive >> entity;

We use a slightly more complicated method:

uint32_t seed = get_random_number_once_per_serialization();
// ...
archive >> entity;
if (entity != INVALID_ENTITY && seed > 0)
{
  entity = ((entity << 1) ^ seed) >> 1;
}

Of course, I just combined that into a SerializeEntity(Entity entity, uint32_t seed); function. So I just make sure to always serialize Entities with that function.

Sometimes, you don’t want to process the entity IDs when serializing, for example when you don’t save them to disk. In that case, you can leave the seed as zero and entity IDs will remain unmodified.

I also gave the ComponentManager the ability to serialize itself, by having a Serialize() function. That will also accept the Entity seed. However, we must look out that in some cases, components can hold Enity members to reference something. This is the case with HierarchyComponent as discussed before, but other cases, like a Mesh Instance holding the Mesh Entity ID are possible too. If the ComponentManager’s Serialize function is used, then we must ensure that the Component array gets serialized and the components’ Entity reference members are seeded correctly. Thus, a Serialize(Archive archive, uint32_t seed) function will be necessary for Components in this case.

Well, it was a very long write up, I hope someone finds it useful. At least I kind of documented this thing for myself, so that’s already worth it.

My implementation of the ECS can be found here. It contains some additional functionality that was out of scope for this blog. The main use case is Scene management, most of the systems and components are implemented in wiScene.

Let me know in the comments if you find mistakes or have questions, or just simply want to discuss something!

Improved normal reconstruction from depth

In a 3D renderer, we might want to read the scene normal vectors at some point, for example post processes. We can write them out using MRT – multiple render target outputs from the object rendering shaders and write the surface normals to a texture. But that normal map texture usually contains normals that have normal map perturbation applied on them, which some algorithms don’t like. Algorithms that need to ray trace/ray march from the surface in multiple directions around the normal direction are not good fit with these normal mapped textures, because there is a disconnect between what is in the normal map and what is the real surface triangle normal. SSAO for example falls into this category. We want to sample rays in the normal-oriented hemisphere from the surface and check those rays against the depth buffer. The depth buffer however, contains flat triangle geometry, that doesn’t include the normal map sufrace perturbation. Because of this, rays can go below the surface that the ray is started from, and that will result in some unwanted darkening.

We could write out the flat surface normals with MRT, that would give us the perfect solution. But we probably don’t want to have yet an other render target to write in our object shader pass, to conserve memory bandwidth.

There are ways to reconstruct the flat surface normals from the depth buffer. The idea is that we want to find the normal of the triangle that the depth buffer pixel belongs to. We can find out the slope of the triangle by first computing the pixel world position from the depth buffer, than using ddx(), ddy() and cross() to determine the normal direction:

float depth = texture_depth.SampleLevel(sampler_point_clamp, uv, 0).r;
float3 P = reconstructPosition(uv, depth, InverseViewProjection);
float3 normal = normalize(cross(ddx(P), ddy(P)));

By the way, there are multiple ways to reconstruct world position from the depth buffer. One of those ways is by using the native depth buffer and the inverse view projection matrix that was used to render that depth:

float3 reconstructPosition(in float2 uv, in float z, in float4x4 InvVP)
{
  float x = uv.x * 2.0f - 1.0f;
  float y = (1.0 - uv.y) * 2.0f - 1.0f;
  float4 position_s = float4(x, y, z, 1.0f);
  float4 position_v = mul(InvVP, position_s);
  return position_v.xyz / position_v.w;
}

This solution to compute normals is only usable in a pixel shader. In a compute shader you don’t have access to ddx() and ddy() functions. What you can do instead is to compute world positions of nearby pixels, and use them as triangle corners. If you have the triangle corners, you can use cross() to compute the triangle normal, which gives you the flat normal of the current pixel:

float2 uv0 = uv; // center
float2 uv1 = uv + float2(1, 0) / depth_dimensions; // right 
float2 uv2 = uv + float2(0, 1) / depth_dimensions; // top

float depth0 = texture_depth.SampleLevel(sampler_point_clamp, uv0, 0).r;
float depth1 = texture_depth.SampleLevel(sampler_point_clamp, uv1, 0).r;
float depth2 = texture_depth.SampleLevel(sampler_point_clamp, uv2, 0).r;

float3 P0 = reconstructPosition(uv0, depth0, InverseViewProjection);
float3 P1 = reconstructPosition(uv1, depth1, InverseViewProjection);
float3 P2 = reconstructPosition(uv2, depth2, InverseViewProjection);

float3 normal = normalize(cross(P2 - P0, P1 - P0));

At this point, you might call it a day and use your flat normals. However, you might also notice some edge cases – both of these techniques literally fail on object edges. Even though different object might be at nearby pixels, we are only naively using the depth buffer and make the assumption that nearby samples are forming actual triangles that exist in the scene.

The image above is a screenshot of the reconstructed normal buffer of the Cornell Box scene. Most notably, the edges of the boxes and the silhouette of the bunny became outlined, but that shouldn’t be the case. If we zoom in we see better:

If you use these flat normals, you can get strong artifacts with SSAO on the object edges:

Notice the left edge of the boxes and the bunny. Those normals caused havoc on those parts. Interestingly, the right edges for example don’t show the artifact. But that depends on which nearby samples you select and reconstruct the triangle corners from, you could select triangles that would fail on right edges and not fail on left edges. In more complicated scenes however, with more color variety and not so huge edges, these artifacts could remain hidden.

So what’s the solution? We should select the triangle that will be most correct for our center sample. The most correct triangle is the one whose corners are most probably on the same object as the center sample. We can approximate that by for example comparing depths of nearby samples to the center sample and picking the ones that are closest, and forming a triangle from those.

An effective way to do this is to sample the pixel neighbors in a cross pattern. That will give us the center sample, the left from it, the one above it, the one on the right and finally the one below it. From these five samples, four triangles can be formed that we could be interested in:

We will select the best triangle corner of the horizontal cross axis, and one best corner from the vertical cross axis by comparing their depth values to the center sample and selecting the closest one. For the base triangle corner, we will always use the center sample. When we decided on the best P0, P1 and P2 triangle corners, use the cross product just like before to determine the triangle normal:

float3 normal = normalize(cross(P2 - P0, P1 - P0));

Watch out however for the triangle winding order. If you assume counter clockwise order like me, then stick with it for all candidate triangles! That means, triangles in the cross pattern can be formed like this for counter clockwise ordering:

triangle 0 = P0: center, P1: right, P2: up
triangle 1 = P0: center, P1: down, P2: right
triangle 2 = P0: center, P1: up, P2: left
triangle 3 = P0: center, P1: left, P2: down

And with this technique, those edge artifacts are gone:

Now we are taking a lot of samples per pixel and also doing complicated math for all neighbors (position reconstruction) and branching in the shader, so it will introduce perfromance penalty. The good thing is that it can be optimized more in a compute shader, with group shared memory. We can store the depths and reconstructed positions for a whole 8×8 pixel tile in shared memory efficiently. We will also have an extra border of pixels stored around the tile. Additionally, we can avoid storing depths and reconstructed positions separately, and instead just store reconstructed positions in view space – the view space positions’ Z value stores linear distance, which we can use instead of depth to determine the best triangle candidate. I will demonstrate the full technique in HLSL shader code.

First, we create the group shared memory for the 8×8 pixel tile + one pixel border:

static const uint POSTPROCESS_BLOCKSIZE = 8; // 8x8 pixel tile
static const uint TILE_BORDER = 1;
static const uint TILE_SIZE = POSTPROCESS_BLOCKSIZE + TILE_BORDER * 2;
groupshared float2 tile_XY[TILE_SIZE*TILE_SIZE]; // view space position XY
groupshared float tile_Z[TILE_SIZE*TILE_SIZE]; // view space position Z

The groupshared memory is storing view space positions. I separated the Z values into a deinterleaved array, because those will be loaded more frequently than XY components, when we determine the best corner points.

The compute shader is running in a 8×8 block size:

[numthreads(POSTPROCESS_BLOCKSIZE, POSTPROCESS_BLOCKSIZE, 1)]
void main(uint3 DTid : SV_DispatchThreadID, uint3 Gid : SV_GroupID, uint3 GTid : SV_GroupThreadID, uint groupIndex : SV_GroupIndex)
{
  // ... shader body continues here...

First thing, we will preload the tile shared memory with the border region with view space positions:

const int2 tile_upperleft = Gid.xy * POSTPROCESS_BLOCKSIZE - TILE_BORDER;
for (uint t = groupIndex; t < TILE_SIZE * TILE_SIZE; t += POSTPROCESS_BLOCKSIZE * POSTPROCESS_BLOCKSIZE)
{
  const uint2 pixel = tile_upperleft + unflatten2D(t, TILE_SIZE);
  const float2 uv = (pixel + 0.5f) * xPPResolution_rcp;
  const float depth = texture_depth.SampleLevel(sampler_linear_clamp, uv,0);
  const float3 position = reconstructPosition(uv, depth, InverseProjection);
  tile_XY[t] = position.xy;
  tile_Z[t] = position.z;
}
GroupMemoryBarrierWithGroupSync();

Some things to note from the above code block:
xPPResolution_rcp is a shader constant, containing the postprocess texture resolution reciprocal. That is: 1.0f / resolution.xy
– I am using InverseProjection matrix for the reconstructPosition. That will give the view space position, instead of world space position if I were using InverseViewProjection matrix.
– The unflatten2D() function converts 1D array index to 2D array index, and looks like this:

uint2 unflatten2D(uint idx, uint2 dim)
{
  return uint2(idx % dim.x, idx / dim.x);
}

We must synchronize at the end with GroupMemoryBarrierWithGroupSync() to wait all threads to finish populating its part of the group shared memory.

Next thing is to load the depth/view space Z values in the cross pattern and perform the comparisons:

const uint cross_idx[5] = {
  flatten2D(TILE_BORDER + GTid.xy, TILE_SIZE),			// 0: center
  flatten2D(TILE_BORDER + GTid.xy + int2(1, 0), TILE_SIZE),	// 1: right
  flatten2D(TILE_BORDER + GTid.xy + int2(-1, 0), TILE_SIZE),	// 2: left
  flatten2D(TILE_BORDER + GTid.xy + int2(0, 1), TILE_SIZE),	// 3: down
  flatten2D(TILE_BORDER + GTid.xy + int2(0, -1), TILE_SIZE),	// 4: up
};

const float center_Z = tile_Z[cross_idx[0]];

const uint best_Z_horizontal = abs(tile_Z[cross_idx[1]] - center_Z) < abs(tile_Z[cross_idx[2]] - center_Z) ? 1 : 2;
const uint best_Z_vertical = abs(tile_Z[cross_idx[3]] - center_Z) < abs(tile_Z[cross_idx[4]] - center_Z) ? 3 : 4;

Here, the flatten2D() function is resposible for converting from 2D array index to 1D array index. It can be implemented like this:

uint flatten2D(uint2 coord, uint2 dim)
{
  return coord.x + coord.y * dim.x;
}

At the end of this, we will know which neighbor is the best in horizontal direction (best_Z_horizontal) and which is best in vertical direction (best_Z_vertical). The horizontal can be either 1 (right) or 2 (left), while the vertical can be 3 (down) or 4 (up).

Finally, determine the best triangle corners with branching and create the triangle normal:

float3 P1 = 0, P2 = 0;
if (best_Z_horizontal == 1 && best_Z_vertical == 4)
{
  P1 = float3(tile_XY[cross_idx[1]], tile_Z[cross_idx[1]]);
  P2 = float3(tile_XY[cross_idx[4]], tile_Z[cross_idx[4]]);
}
else if (best_Z_horizontal == 1 && best_Z_vertical == 3)
{
  P1 = float3(tile_XY[cross_idx[3]], tile_Z[cross_idx[3]]);
  P2 = float3(tile_XY[cross_idx[1]], tile_Z[cross_idx[1]]);
}
else if (best_Z_horizontal == 2 && best_Z_vertical == 4)
{
  P1 = float3(tile_XY[cross_idx[4]], tile_Z[cross_idx[4]]);
  P2 = float3(tile_XY[cross_idx[2]], tile_Z[cross_idx[2]]);
}
else if (best_Z_horizontal == 2 && best_Z_vertical == 3)
{
  P1 = float3(tile_XY[cross_idx[2]], tile_Z[cross_idx[2]]);
  P2 = float3(tile_XY[cross_idx[3]], tile_Z[cross_idx[3]]);
}

const float3 P0 = float3(tile_XY[cross_idx[0]], tile_Z[cross_idx[0]]);
const float3 normal = normalize(cross(P2 - P0, P1 - P0));

Of course, computing the normals in view space means that you will either need to convert them to world-space, or do the SSAO in view space. I chose to do the rest of the shader in view space instead of converting, and reuse the view space position and normal for shooting the rays in view space.

And the normal buffer result is – although not perfect – is closer to what we want. There are still some errors along the edges, but in a reduced amount, and they don’t interfere with the SSAO:

However, I’ve had a slight mistake here, because the SSAO postprocess resolution is half of the depth buffer resolution. When I match the postprocess resolution with the depth buffer resolution, the normal reconstruction is actually nearly perfect (there are a few rogue pixels here and there on sharp corners, but you have to look for them really hard):

And this is with matching resolution, but not using the improvement, the outline is still present and will mess up the SSAO:

However, I don’t have to match their resolution in practice, because the SSAO results were already good with the improved reconstruction. At the end of the day, implementing the reconstruction improvement was worth it.

Well, that’s it, I hope you found it useful. I think it was definitely useful for me at least to document this for myself for future reference. 🙂

Take a look at SSAO implementation in Wicked Engine using this technique and feel free to use it for whatever you want.

I also saw this improvement technique described in a stackoverflow answer.

This is an other technique for normal reconstruction from Yuwen Wu. It discusses and solves a corner case that was not described here.

Mistakes, typos, questions? Let me know in the comments.

Thoughts on light culling: stream compaction vs flat bit arrays

I had my eyes set on the light culling using flat bit arrays technique for a long time now and finally decided to give it a try. Let me share my notes on why it seemed so interesting and why I replaced the stream compaction technique with this. I will describe both techniques and make comparisons between them. Wall of text incoming with occasional code pieces, in brain dump style.

(For this post, I considered a layout where the comparison between the two techniques for each point follow each other immediately, but decided against it, and instead there are two big blocks: stream compaction technique, then the flat bit array technique following it. The points are in similar order for both, so comparison could be easily made by having two browser windows opened for example. It wouldn’t be so simple to do the opposite thing.)

Stream compaction

Idea is based on Johan Andresson’s 2011 GDC talk: DirectX 11 rendering in Battlefield 3

Culling phase:

  • Each tile contains list of lights. Each entry is one uint that is an index into light array. So each tile contains max_lightcount_per_tile number of uints. For me it was 256 entries per tile, and decided at compile time.
    • Possible to compact even further in a way that the average light count per tile is specified (instead of max) when allocating tile buffer and have one more level of indirection. The other indirection here is introducing a “light grid” buffer/texture that would contain the offset and count into the “light index buffer”, while the “light index buffer” contains the light indices in a very compact stream (next tile list’s first entry immediately follows the last light index in the previous tile). This way one tile can contain an unbounded number of lights (well, still bounded on the global tile buffer allocation size). The construction of such structure is complicated and not worth the trouble in my experience (it uses even more atomics). You can see such technique presented in [this other blog]. I have implemented this technique, but in the end I chose to keep the hard coded max items per tile and have one less level of indirection.
  • Culling shader performs stream compaction, that is, using a light index list in groupshared memory and an atomic counter to append lights to the list. Something like this:
groupshared uint ArrayLength = 0;
groupshared uint Array[256];

void AppendLight(uint lightIndex)
{
    uint index;
    InterlockedAdd(ArrayLength, 1, index);
    if (index < 256)
    {
        Array[index] = lightIndex;
    }
}
  • Stream compaction can be scalarized with wave intrinsics so that 64x less amount of atomic operations are performed. The idea here is that we will have a per-wavefront bitmask containing set bits for all lanes that wanted to append. This is retrieved by WaveActiveBallot(IsLightCulledInCurrentThread()). This bitmask first gives us how many lanes wanted to append (countbits() function), and the first lane does an InterlockedAdd() with this value to the groupshared light counter. The rest of the lanes don’t do the atomic, so atomics are significantly reduced. The first lane’s result of the InterlockedAdd() returns the wavefront’s offset to the groupshared light index list and broadcasted to the other lanes via WaveReadLaneFirst() intrinsic function. The rest of the lanes need to just write their light index to the list, starting from the broadcasted offset, and doing a prefix sum on the ballot bitmask (WavePrefixSum()) that will give the real offset to the groupshared light index list.
  • Stream compaction is unordered because the item append order relies on one atomic counter. That is fine for lights which use additive blending where the order of operations doesn’t matter for the final output. But once we have decals or environment probes, those are not blended additively, so the undefined order of the items would result in flickering between tiles. To avoid that, we can do a bitonic sort in the compute shader. Doing a bitonic sort in the compute shader is a heavy weight operation, but we can keep track if there were any decals or probes in the tile and only do the sorting if there are. A bitonic sort that runs in a single thread group looks like this (from Gareth Thomas’s tile based particle rendering talk):
// Parameter is SV_GroupIndex
void BitonicSort( in uint localIdxFlattened )
{
    uint numArray = ArrayLength;
    uint numArrayPowerOfTwo = 2 << firstbithigh(numArray - 1);

    for( uint nMergeSize = 2; nMergeSize <= numArrayPowerOfTwo; nMergeSize = nMergeSize * 2 )
    {
        for( uint nMergeSubSize = nMergeSize >> 1; nMergeSubSize > 0; nMergeSubSize = nMergeSubSize >> 1 )
        {
            uint tmp_index = localIdxFlattened;
            uint index_low = tmp_index & ( nMergeSubSize - 1 );
            uint index_high = 2 * ( tmp_index - index_low );
            uint index = index_high + index_low;

            uint nSwapElem = nMergeSubSize == nMergeSize >> 1 ? index_high + ( 2 * nMergeSubSize - 1 ) - index_low : index_high + nMergeSubSize + index_low;
			
            if( nSwapElem < numArray && index < numArray )
            {
                if( Array[ index ] < Array[ nSwapElem ] )
                {
                    uint uTemp = Array[ index ];
                    Array[ index ] = Array[ nSwapElem ];
                    Array[ nSwapElem ] = uTemp;
                }
            }
            GroupMemoryBarrierWithGroupSync();
        }
    }
}
  • The best culling granularity in my experience turned out to be having 16×16 pixel tiles and 16×16 threads in the culling threadgroup. A large thread group is beneficial, because we will have a large groupshared memory reservation, and the bitonic sort works best with a large thread count. Remember that even though we are only interested in the decal and probe ordering, we have to sort the whole entity array, including lights, because lights are in the same array as decals and probes. A large thread group and large groupshared memory usage means reduced occupancy of the shader, so parallel execution between thread groups is reduced.
  • We are still only in the culling shader phase, and an other difference is how the groupshared memory (also called LDS or local data share) is initialized at the beginning. The depth range min-max and the 2.5D culling mask are initialized, and the counter that is responsible to insert the light indices into the array is zeroed out. The LDS light array doesn’t need to be initialized.
  • At the end of the culling shader, we export the light list by having each thread export one light index. Also, the first thread exports a “tile statistics” uint that says how long is the list, how many decals,probes and lights are inside it.

Rendering phase:

  • We process the light list in the forward+ object shader or the tiled deferred compute shader trivially. We read the “tile statistics” uint first, and determine the decals count, probe count and light count in the tile. The following uints are indices into the global entity array that the camera can see, so just iterate through all of them until we reach the last one. Having the decal and probe count is handy, because for some objects, or particles, etc. we don’t want to apply decals, so we can just skip through them and start with the first light index instead. Also, if we process decals, and the top decal fully covers the ones under it, we can skip processing them and continue with the lights.
  • We have high divergence if we process like this inside a pixel shader (forward+), but we can help a bit by using wave intrinsics to scalarize it. It is not trivial and in the worst case it can lead to redundant processing of items. See more info in Drobot’s presentation, the section about “hierarchical container scalarization”.
  • The good thing about processing a compact light index list is that it is compact, so every light index that we process will be contributing to the tile.
Light culling in Sponza

Flat bit arrays

Idea from Michal Drobot’s 2017 Siggraph presentation: Improved culling for tiled and clustered rendering

Culling phase:

  • Each tile contains bit buckets. One bit bucket is a 32-bit uint, that is responsible to index up to 32 lights. So each tile contains max_lightcount_in_view/32 number of uints. That means that for the same amount of max lights per tile as stream compaction method, we use 32x less memory.
  • Culling shader simply computes which bucket can index the currently processed light, then computes which bit maps to the light index, then performs atomic OR operation to the groupshared light bucket list. The groupshared light bucket list is max_lightcount_in_view/32 number of units. The following code piece demonstrates this:
groupshared uint tile[256 / 32];

void AppendLight(uint lightIndex)
{
    const uint bucket_index = entityIndex / 32;
    const uint bucket_place = entityIndex % 32;
    InterlockedOr(tile[bucket_index], 1 << bucket_place);
}
  • Bitmask insertion could probably be scalarized in a way that no atomic operations would be necessary. I imagine we would just do a WaveActiveBitOr() instead of InterlockedOr() on the bucket, and if it is the first lane (WaveIsFirstLane()) it would just write the bucket without using atomics. (unfortunately I can’t get sm6.0 working on my current laptop, so I haven’t tried this yet)
  • Bitmask insertion is ordered, we will always get a deterministic order of items, which is the same order as the CPU assembled the global light array. If we have decals and environment probes, we only ensure that they are culled and added to the global light list (or should we say entity list at this point) before the lights (on the CPU). The reason is, because we want to apply decals before applying lights, so that they are lighted just like the surface underneath. Environment probes can be applied before or after the lights, because they will be added to the specular result. However, ordering between locally placed probes does matter, because they are blended amongst themselves with alpha blending. So in the end, we don’t have to sort in the culling shader with the flat bit array technique. That is a big relief and greatly simplifies the shader.
  • The best culling granularity for flat bit arrays turned out to be having 16×16 pixel tiles, however we can also reduce the thread group size to be 8×8 for example to have better occupancy, because there is no sort and little LDS use compared to the compaction technique. But then some shader complexity is added, because each thread will be responsible to work on a small subgroup of the tile. Consider that with 16×16 pixels and 16×16 threads, each thread can work on a single pixel when reading the depth buffer and creating the depth bounds and maybe a 2.5D culling bitmask (Takahiro Harada’s 2012 talk in Siggraph Asia, and also my blog about it). But when reducing the thread count to 8×8, each thread must load 2×2 pixel block instead. However, this also reduces the atomic contention for computing the min-max depth bounds or the 2.5D bitmask (4x times less atomic operations for these phases), because the operations inside the subgroup are performed as non atomic. The whole tile result then will be the result of atomic operations between the subgroups, so in the end that just means less atomics were performed. We don’t have to modify the culling part of the shader however, because that doesn’t work on a thread/pixel granularity, but thread/entity granularity instead. The last thing here worth to mention is that 8×8 thread group size seems like a perfect fit for AMD GCN architecture, which has 64 threads in a wavefront, so the barriers in the compute shader will be even omitted completely. On Nvidia, the wavefront (warp) size is 32 threads, but modifying the thread size to be for example 8×4 threads would also require redesigning how subgroups are processed, which I didn’t tackle for now. It would be an interesting experiment to see how different subgroup distributions would perform. In my implementation, a thread is operating on a 2×2 pixel block, but that could be rewritten so that a thread operates on a row of 4 pixels or something else. Maybe a 8×8 pixel block should be processed first, then move on to the next 8×8 pixel block, and do this 2×2 times. Need to try these and compare. An other idea here: the 2×2 pixel block is also nice for gathering from the depth buffer, we could use one GatherRed() instruction to load 4 samples for the whole subgroup, however that will require a custom code path that won’t work if we choose an other kind of distribution.
  • We don’t have a light array counter in the flat bit aray technique, but instead all the light buckets need to be set to zero, indicating that no lights are contained yet. This is done by each thread just zeroing out one bucket (one uint) until all of them are zeroed out. Apart from that, the common things such as tile depth range are also initialized like always.
  • At the end of the culling shader, each thread exports one bucket, until all buckets were written to global memory. There is no need to have a “tile statistics” entry, because the bitfield directly corresponds to all entities inside the main camera. So the tile statistics can be replaced with just a constant buffer value that contains how many decals/envprobes/lights are inside the main camera.

Rendering phase:

  • Processing the flat bit arrays relies on the firstbitlow() instrinsic in HLSL that gives us the first set bit’s index in the uint bucket, starting from the lowest bit. If the bucket is zero, we continue with the next bucket (thus skipping over 32 non-visible lights), otherwise call firstbitlow(bucket), compute the light index from the bucket index and the bit index, then process the light. Then remove the bit and continue to get the next first low bit until the bucket is not zero. If we want to skip decals, we can do that here too: we can trivially start from the specific bucket’s specific bit that we want to start iterating on (for example if we want to leave out the decals but there are 42 decals inside the camera (first light in entity 43), we can start processing from the second bucket, and mask off the first 10 bits (42-32=10). Here is a pseudo-code how we can extract light index bits from the buckets:
// retrieve tile index (for 16x16 block size):
const uint2 tileIndex = uint2(floor(pixel.xy / 16));

// compute flattened tile index because tile buffer is linear array:
//  g_tileCount = ScreenResoulution.xy / 16
//  g_bucketCountPerTile = max_light_count / 32
const uint flatTileIndex = (tileIndex.x + tileIndex.y * g_tileCount.x) * g_bucketCountPerTile;

for(uint bucket = 0; bucket < g_bucketCountPerTile; ++bucket)
{
    uint bucket_bits = Tiles[flatTileIndex + bucket];

    while(bucket_bits != 0)
    {
        const uint bucket_bit_index = firstbitlow(bucket_bits);
        const uint entity_index = bucket * 32 + bucket_bit_index;
        bucket_bits ^= 1 << bucket_bit_index;
        
        Light light = LightArray[entity_index];
        // process light here...
    }
}
  • Processing these in pixel shaders for forward+ will still yield high VGPR usage because of divergent access to the buckets. However, the scalarization of the buckets themselves is really elegantly done with a single line: bucket_bits = WaveReadLaneFirst(WaveActiveBitOr(bucket_bits)); After the buckets are scalarized, loading from the global light array (entity array) is also scalar.
  • A thing to note about flat bit arrays, is that there can be “holes” in the buckets, so let’s say lightIndex 2 and lightIndex 987 are the only ones contained inside the tile, then the buckets between bucket 0 and bucket 30 are zero, but they can’t be detected in a simple way, so the buckets will be read but not processed. With an other level of indirection, this effect could be reduced, for example having some coarse tiles that are culled with stream compaction, then the finer tile bit buckets would contain indices to those coarse tile light lists. I don’t expect this to be a huge problem, so I haven’t gone there. And it would require the sorting again too. Also, as pointed out by Matt Pettineo on Twitter, the indirection doesn’t have to use stream compaction, but could be an other bitmask, where each bit represents one bucket of 32 entities.
Light culling in Bistro

Results:

I’ll compare some results on the bistro scene. The configuration is for GPU times and measured with timestamp queries:

  • Common:
    • Nvidia GTX 1050 laptop GPU
    • Bistro scene (inside + outside)
    • 1000 lights, 0 decals, 0 probes
    • Forward+ rendering
    • 4096 max lights in frustum
    • non-scalarized versions (DX11)
  • Stream compaction method:
    • 255 max lights per tile (255 lights + 1 tile statistics)
    • light culling (4k): 5.6 ms
    • light culling (1080p): 1.4 ms
    • opaque geometry (4k): 43.6 ms
    • opaque geometry (1080p): 13 ms
  • Flat bit array method:
    • 4096 max lights per tile (4096 / 32 = 128 buckets per tile)
    • light culling (4k): 2.5 ms
    • light culling (1080p): 0.64 ms
    • opaque geometry (4k): 41.7 ms
    • opaque geometry (1080p): 12.6 ms

The flat bit array method is the winner here, no matter how I look at it. The biggest boost is in the culling shader that is greatly simplified, but the opaque geometry pass is consistently better as well. (4k means 3840×2160 resolution here, 1080p is full HD 1920×1080)

An other small test I made was for the decals, and I just used a simple cornell box, then placed 100 decals into one spot, overlapping with each other, 4k resolution and covering a small part of the screen:

  • Test: 100 small overlapping decals (4k)
  • Stream compaction method:
    • decal culling + sorting: 3.63 ms
    • opaque geometry: 3.67 ms
  • Flat bit arrays method:
    • decal culling: 1.33 ms
    • opaque geometry: 2.96 ms
that was the 100 small decals, red tile = 100 decals, black = 0 decals

It’s a win again, and the closer I zoomed the camera into the decals, the bigger the difference was between the two techniques. This is the result for full screen decal coverage, 100 overlapping decals, still 4k resolution:

  • Test: 100 full screen overlapping decals (4k)
  • Stream compaction method:
    • decal culling + sorting: 17.82 ms
    • opaque geometry: 84.57 ms
  • Flat bit arrays method:
    • decal culling: 2.12 ms
    • opaque geometry: 71.48 ms

To be honest, I did not expect that kind of speedup, so I am very pleased and can say time was well spent implementing this.

Other notes:

  • One interesting thing I noticed is that I first had one huge loop, and the decal + probe + light evaluation were all included in that single loop. Performance was horrible, especially on the Intel 620 GPU. It was something like +10 ms for opaque geometry pass on a scene with 4 big shadow casting lights, even if no decals or probes were processed at all. After that, I broken up the loop into 3 loops, one for decals, one for probes and one for lights. The performance jump was very noticable. This applies for both techniques, try to break up entities that are processed differently to achieve the best performance. This is due to the compiler can manage to do a better job reserving and reusing registers, so in the end you can have a reduced amount of registers used, even if there are more loops. This is because register lifetime can be constrained to loop scope (if they are not unrolled).
  • An other thing related to the Intel 620 GPU was the annoying fact that I noticed some small flickering of the tiles with the stream compaction technique that only happened rarely, even with a completely static scene. I could never figure it out, but it doesn’t happen with the flat bit array technique. Now I suspect that it might be related to the thread group size, because I also had bugs with tiled deferred compute shader if it was running with 16×16 thread group size and using a RW texture.
  • After getting to know the flat bit array technique, I also implemented a similar one for old-school forward rendering path. For that, the CPU can cull lights per object and provide a per draw call bitmask in a constant buffer. The bitmask is a uint4, where 1 uint represents 32 decals, 1 uint represents 32 probes, and 2 uints allow 64 lights to be evaluated for forward rendered objects. I never wanted to do this previously, because then a big constant buffer needs to be updated with light indices per draw call. An other candidate how to send this bitmask to the shader would be the instance buffer, but having it in the constant buffer means scalarized loads (and also don’t have to propagate the mask from vertex to pixel shader – less parameters to send is good to avoid parameter cache bottleneck). Unfortunately, culling efficiency is reduced if many instances are visible, because a constant buffer will be per draw call granularity instead of per instance.
  • It will be also interesting to try not using a culling compute shader, but rely on conservative rasterization (or MSAA), and draw the light proxies instead, then those would do atomic writes into the tile buffer in the pixel shader. Michal Drobot suggested that this can be faster than running the compute shader and doing software culling, because rasterization will make use of the zbuffer, and zbuffer tile acceleration (coarse conservative z-test in the hardware that supports it, not user controlled). A minor issue with this might be that not everything supports conservative rasterization or forced sample count setting for MSAA today (conservative rasterization can be approximated with MSAA).

Implementation samples:

As always, you can find my implementation for these in Wicked Engine. I have made two snapshot branches for the two separate techniques, and while the main branch will updated, these will remain like this to represent the state of this blog:

Some other light culling resources:

Simple job system using standard C++

After experimenting with the entity-component system this fall, I wanted to see how difficult it would be to put my other unused CPU cores to good use. I never really got into CPU multithreading seriously, so this is something new for me. The idea behind the entity-component system is both to make more efficient use of a single CPU by having similar data laid out linearly in memory (thus using cache prefetching when iterating), and also making it easier to write parallelized code (because data dependecies are more apparent). In this post I want to talk about the job system I came up with. It will not only make sense in the entity-component system, but generally it should perform well for any large data processing task. I wanted to remain in standard C++ (~11) realm, which I found that it is entirely possible. However, it can be extended with platform specific extensions if wanted. Let’s break this blog up into multiple parts:

  1. Interface – how do we want to use it (I hope you’re familiar with lambdas)
  2. Implementation – let’s see the code
  3. Performance – I guees this would be pointless without it
  4. Windows platform specific extension – optional, if you can still keep going
  5. Closure – all must come to an end

Interface
I think the best way to get my point across would be to just post the interface code here, because it is very short and make a few comments:

#include <functional>

// A Dispatched job will receive this as function argument:
struct JobDispatchArgs
{
	uint32_t jobIndex;
	uint32_t groupIndex;
};

namespace JobSystem
{
	// Create the internal resources such as worker threads, etc. Call it once when initializing the application.
	void Initialize();

	// Add a job to execute asynchronously. Any idle thread will execute this job.
	void Execute(const std::function<void()>& job);

	// Divide a job onto multiple jobs and execute in parallel.
	//	jobCount	: how many jobs to generate for this task.
	//	groupSize	: how many jobs to execute per thread. Jobs inside a group execute serially. It might be worth to increase for small jobs
	//	func		: receives a JobDispatchArgs as parameter
	void Dispatch(uint32_t jobCount, uint32_t groupSize, const std::function<void(JobDispatchArgs)>& job);

	// Check if any threads are working currently or not
	bool IsBusy();

	// Wait until all threads become idle
	void Wait();
}

How this would work? First, we would call JobSystem::Initialize() function somewhere in our application once (before we want to use the job system for the first time around). Probably the best place is when we initialize the application and before doing any resource loading, because the job system can also handle resource loading efficiently.
Then the simplest way we can make a job is by calling JobSystem::Execute(myTask); where myTask is just a simple function without parameters. I use it for example when initializing engine systems, like this:

JobSystem::Execute([] { /*do stuff 1*/ InitGraphics(); });
JobSystem::Execute([] { /*do stuff 2 which is not realted to stuff 1*/ InitPhysics(); });
JobSystem::Execute([] { /*do stuff 3 which is not related to the previous stuff*/ InitInput(); });
JobSystem::Execute([] { /*do stuff 4 which is not related to the previous stuff*/ InitNetwork(); });
JobSystem::Execute([] { /*you probably already get the idea*/ InitSound(); });

The above code creates 5 jobs, which are already probably utilizing 5 worker threads (if the Initialize() created that many workers, but later on that).
Then we can call JobSystem::Wait(); to simply wait for all the jobs to finish, before doing anything that depends on their completion. For example this code would utilize 3 threads:

JobSystem::Execute([] { /*do stuff 1*/ });
JobSystem::Execute([] { /*do stuff 2*/ });
JobSystem::Wait();

Where the main thread is responsible to notify two worker threads that there are some jobs, and then wait. It also means that overall we are doing extra bookkeeping work by the main thread instead of just doing the jobs one by one. But this should be fine when looking at the big picture, because some work is a good fit to be executed in parallel (if there is no dependency between them). We should only do jobs in parallel if they are a good fit, a lot of times it is still just better to do it on the main thread. Oh and by dependency, I mean if there is any variable that is being written by a job and read by an other, then there is a dependency.

The IsBusy() function is a way for the main thread to just see if any worker threads are busy. It can be used if we don’t actually want to wait for work to finish, and the main thread has other meaningful work to do. For instance, drawing a loading screen is a good example.

The last one is the Dispatch function. This is the most interesting and most useful one in my opinion. I was inspired by how compute shaders work on the GPU, and wanted really bad to use something like that on the CPU. You can think of it like the Execute function, but it actually creates multiple jobs instead of one (jobCount argument). Let’s skip the groupSize argument now and return to it shortly. The third argument is the job itself that will receive a JobDispatchArgs argument that will identify which instance of the job is currently running, and which group (hang in there). To demonsrate how it will be used, look at this loop iterating some large dataset:

const uint32_t dataSize = 1000000;
Data dataSet[dataSize];
for(int i = 0; i < dataSize; ++i)
{
	dataSet[i].Compute(i);
}

And let’s rewrite the same thing but this time using Dispatch, to offload it to multiple threads:

const uint32_t groupSize = 1;
JobSystem::Dispatch(dataSize, groupSize, [&dataSet] (JobDispatchArgs args) {
	dataSet[args.jobIndex].Compute(args.jobIndex);
});

The above code does the same thing as the loop, but now several worker threads each operate on a single element of dataSet. Here is when the groupSize argument comes into play. We actually don’t want each thread to only work on a single element. Not only because of caching, but then 1000000 jobs need to be generated that would put a significant burden on the main thread which is creating the jobs. Instead, we should specify let’s say const uint32_t groupSize = 1000; or something like that. That means, one thread will operate on a batch of 1000 elements of the dataSet. In every case that we are putting a Dispatch in our application, the groupSize must be considered. Just like we have to when we are writing compute shaders – which are really similar to this and have the same purpose: performing work over a large dataset, where data is mostly independent of other data in the set. If our dataSet has elements with dependencies, then it might not be a good candidate – unless we can think of a clever way to utilize the groupSize argument. Because the thing with groupSize, is that with groupSize = 1000, it means that 1000 sub-jobs will run serially on the same thread – then there can be dependencies inside the group. But here it starts to get out of topic, so I recommend to get experienced in compute shaders and using groupshared memory, because the same principles could be applied here. I won’t go into handling groupshared memory in this post.

Now that we see how we want to use the job system, let’s implement the internal workings of it, using standard C++ constructs.

Implementation
I will go over this mostly in the order I described the interface. But before getting into the first function which is Initialize(), define the internal state and include files, stuff like that:

#include "JobSystem.h"    // include our interface

#include <algorithm>    // std::max
#include <atomic>    // to use std::atomic<uint64_t>
#include <thread>    // to use std::thread
#include <condition_variable>    // to use std::condition_variable

namespace JobSystem
{
	uint32_t numThreads = 0;    // number of worker threads, it will be initialized in the Initialize() function
	ThreadSafeRingBuffer<std::function<void()>, 256> jobPool;    // a thread safe queue to put pending jobs onto the end (with a capacity of 256 jobs). A worker thread can grab a job from the beginning
	std::condition_variable wakeCondition;    // used in conjunction with the wakeMutex below. Worker threads just sleep when there is no job, and the main thread can wake them up
	std::mutex wakeMutex;    // used in conjunction with the wakeCondition above
	uint64_t currentLabel = 0;    // tracks the state of execution of the main thread
	std::atomic<uint64_t> finishedLabel;    // track the state of execution across background worker threads

	// ...And here will go all of the functions that we will implement
}

Mostly everything here is using standard C++, except the jobPool. It can also be done with a std::deque, but then also make it thread safe by using mutexes or some kind of synchronization primitives. Or if you are interested, here is my very simple thread safe queue (or ring buffer) implementation:

// Fixed size very simple thread safe ring buffer
template <typename T, size_t capacity>
class ThreadSafeRingBuffer
{
public:
	// Push an item to the end if there is free space
	//	Returns true if succesful
	//	Returns false if there is not enough space
	inline bool push_back(const T& item)
	{
		bool result = false;
		lock.lock();
		size_t next = (head + 1) % capacity;
		if (next != tail)
		{
			data[head] = item;
			head = next;
			result = true;
		}
		lock.unlock();
		return result;
	}

	// Get an item if there are any
	//	Returns true if succesful
	//	Returns false if there are no items
	inline bool pop_front(T& item)
	{
		bool result = false;
		lock.lock();
		if (tail != head)
		{
			item = data[tail];
			tail = (tail + 1) % capacity;
			result = true;
		}
		lock.unlock();
		return result;
	}

private:
	T data[capacity];
	size_t head = 0;
	size_t tail = 0;
	std::mutex lock; // this just works better than a spinlock here (on windows)
};

It is templated, so you can also use it for any type of data. The nice thing about it is that is very simple, only uses a fixed size memory allocation. It can’t reallocate memory, doesn’t do anything clever, so performance might be more reliable than std::deque. Also, when an element doesn’t fit because there is no more space, it returns false on push_back, or if it is empty and you want to remove an element, it returns false on pop_back. if you got false, you can just try again and maybe that time, it will be successful, because an other thread grabbed/put a job onto it.

Now we have everything we need, except the actual functions. Here is what Initialize() looks like:

void Initialize()
{
	// Initialize the worker execution state to 0:
	finishedLabel.store(0);

	// Retrieve the number of hardware threads in this system:
	auto numCores = std::thread::hardware_concurrency();

	// Calculate the actual number of worker threads we want:
	numThreads = std::max(1u, numCores);

	// Create all our worker threads while immediately starting them:
	for (uint32_t threadID = 0; threadID < numThreads; ++threadID)
	{
		std::thread worker([] {

			std::function<void()> job; // the current job for the thread, it's empty at start.

			// This is the infinite loop that a worker thread will do 
			while (true)
			{
				if (jobPool.pop_front(job)) // try to grab a job from the jobPool queue
				{
					// It found a job, execute it:
					job(); // execute job
					finishedLabel.fetch_add(1); // update worker label state
				}
				else
				{
					// no job, put thread to sleep
					std::unique_lock<std::mutex> lock(wakeMutex);
					wakeCondition.wait(lock);
				}
			}

		});
		
		// *****Here we could do platform specific thread setup...
		
		worker.detach(); // forget about this thread, let it do it's job in the infinite loop that we created above
	}
}

So, in case the comments didn’t do it justice, let me recap: First, we initialize the worker-side label state, but don’t worry about it yet, it will make sense later (hopefully). Then C++ has the neat hardware_concurrency function that can give us how many cores (incuding hyperthreaded virtual cores) our CPU has, so we create that many worker threads in the for loop. Each thread is created with an infinite loop, that it will be doing until our application exits. But most of the time, each thread will just sleep (until the main threads wakes it up), to not spin the CPU wastefully.

Let’s see the Execute function next:

// This little helper function will not let the system to be deadlocked while the main thread is waiting for something
inline void poll()
{
	wakeCondition.notify_one(); // wake one worker thread
	std::this_thread::yield(); // allow this thread to be rescheduled
}

void Execute(const std::function<void()>& job)
{
	// The main thread label state is updated:
	currentLabel += 1;

	// Try to push a new job until it is pushed successfully:
	while (!jobPool.push_back(job)) { poll(); }

	wakeCondition.notify_one(); // wake one thread
}

Sorry if you got distracted by the poll function above, I did a function for it because it will be used in more places. I will explain, but first, ket’s digest the Execute function: The main thread’s label is incremented. This means, that one job is being submitted, so the JobSystem only becomes idle when the worker label reached the same value. In the previous Initialize function, you can see that in the infinite loop, after a job is executed, the worker label state is also incremented by one. But there it is an atomic operation, because all the workers will be updating that single label.
Then we just push the new job onto the end of the jobPool. As I said, my ringbuffer could be full, so we will try to push the job until it actually succeeds. If it doesn’t succeed, then we invoke the poll function, which will wake up one worker, and yield the main thread. Doing this, we can be sure that the jobPool will eventually be drained by the workers, and the main thread can then schedule the current job.
But maybe, the job was just pushed fine the first time around, and the poll() never executed. In that case, we just wake up any one worker thread that will start working on the job immediately.

I hope I am not putting anyone to sleep yet. Let’s follow with the IsBusy and Wait functions:

bool IsBusy()
{
	// Whenever the main thread label is not reached by the workers, it indicates that some worker is still alive
	return finishedLabel.load() < currentLabel;
}

void Wait()
{
	while (IsBusy()) { poll(); }
}

You can see how we are checking the state of the main thread and the state of the workers in the IsBusy function. This is also used by the Wait function if we just want to put the main thread to idle and drain all the pending jobs.

Now, with implementing the Dispatch function, you will get full insight into the Job System’s inner workings. This is the most complicated of them all, but if you look at it in a way, it is not very different from the Execute function, just does the same thing multiple times with some little twists:

void Dispatch(uint32_t jobCount, uint32_t groupSize, const std::function<void(JobDispatchArgs)>& job)
{
	if (jobCount == 0 || groupSize == 0)
	{
		return;
	}

	// Calculate the amount of job groups to dispatch (overestimate, or "ceil"):
	const uint32_t groupCount = (jobCount + groupSize - 1) / groupSize;

	// The main thread label state is updated:
	currentLabel += groupCount;

	for (uint32_t groupIndex = 0; groupIndex < groupCount; ++groupIndex)
	{
		// For each group, generate one real job:
		auto& jobGroup = [jobCount, groupSize, job, groupIndex]() {

			// Calculate the current group's offset into the jobs:
			const uint32_t groupJobOffset = groupIndex * groupSize;
			const uint32_t groupJobEnd = std::min(groupJobOffset + groupSize, jobCount);

			JobDispatchArgs args;
			args.groupIndex = groupIndex;

			// Inside the group, loop through all job indices and execute job for each index:
			for (uint32_t i = groupJobOffset; i < groupJobEnd; ++i)
			{
				args.jobIndex = i;
				job(args);
			}
		};

		// Try to push a new job until it is pushed successfully:
		while (!jobPool.push_back(jobGroup)) { poll(); }

		wakeCondition.notify_one(); // wake one thread
	}
}

Take a deep breath, and let’s tackle this slowly. First, the jobCount (which you can think of as the “loop size”), can be a dynamic parameter, so if it is 0, just return immediately, because we don’t want to shedule 0 jobs.
Then the groupCount is calculated. Don’t confuse this with groupSize, that you specified as an argument to Dispatch. The groupCount is calculated from the jobCount and groupSize and tells us how many worker jobs (or groups) will be put onto the jobPool. There is an integer divide, but we must overestimate it, because let’s say that we want to have jobCount=3, groupSize=2. From that we want to create 2 actual worker jobs (groups), but the integer divide of jobCount/groupSize would give us only 1 group. We must instead schedule 2 groups, each with 2 jobs. But wait, that would be 4 jobs then, even though we specified jobCount=3. We will just discard the out of bounds jobs later.
Then the main thread label is updated with adding in the groupCount. That is how many workers must finish from this point for the JobSystem to become idle.
The following for loop goes over every group, creates a worker job for it with a lambda, passing in the required parameters to populate the JobDispatchArgs which our dispatched job needs to receive. That lambda is the one that will be executed by a worker thread. It contains logic to populate a JobDispatchArgs and run through all inner sub-jobs while making sure that no extra out of bounds iterations are performed. I think this might be the most complicated part if you are going to try to wrap your mind around someone else’s code. Hang in there!
After that group-worker-lambda thing, we do just like the Execute function, try to schedule the group and get the worker threads working immediately on it.

I’m not sure if I made sense of all this, but at this point the JobSystem is ready to be used actually! Shall we take a look of what kind of performance improvements we can expect right off the bat, then continue reading:

Performance
The first thing I tried, was that I started to use the JobSystem::Execute() in my Wicked Engine when initializing the main engine systems, so each system initialization would be performed in its own execute. The system initialization tasks were large independent tasks, operating on some arbitrary/different data, a good fit for the Execute function. We can make up a test case if we just Execute Spin functions that will perform infinite loop for some time and not let the CPU rest until the time passed. Let me give you some helpers for spin and a simple timer that you can use too in a small console application:

#include "JobSystem.h"

#include <iostream>
#include <chrono>
#include <string>

using namespace std;

void Spin(float milliseconds)
{
	milliseconds /= 1000.0f;
	chrono::high_resolution_clock::time_point t1 = chrono::high_resolution_clock::now();
	double ms = 0;
	while (ms < milliseconds)
	{
		chrono::high_resolution_clock::time_point t2 = chrono::high_resolution_clock::now();
		chrono::duration<double> time_span = chrono::duration_cast<chrono::duration<double>>(t2 - t1);
		ms = time_span.count();
	}
}

struct timer
{
	string name;
	chrono::high_resolution_clock::time_point start;

	timer(const string& name) : name(name), start(chrono::high_resolution_clock::now()) {}
	~timer()
	{
		auto end = chrono::high_resolution_clock::now();
		cout << name << ": " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << " milliseconds" << endl;
	}
};

int main()
{
	JobSystem::Initialize();

	// Serial test
	{
		auto t = timer("Serial test: ");
		Spin(100);
		Spin(100);
		Spin(100);
		Spin(100);
	}

	// Execute test
	{
		auto t = timer("Execute() test: ");
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Wait();
	}

    return 0;
}

The Spin function just emulates a working CPU for a set amount of milliseconds, the timer struct is a scoped timer which will record how much time has passed until the destructor was called – so I just made two scopes for the two tests in the main function. The results are for me:

Serial test: : 400 milliseconds
Execute() test: : 101 milliseconds

The machine I currently ran this test on is a dual core, but has 4 hardware threads, so the results make sense. If I increase the jobs to 6, the times are as following:

Serial test: : 600 milliseconds
Execute() test: : 200 milliseconds

The reason for this is that at most 4 jobs were executing in parallel, then the last two remaining jobs took over after that, also in parallel to each other. Looks like the results look like we intended!

Now, let’s test the Dispatch performance. I came up with this quick test case:

struct Data
{
	float m[16];
	void Compute(uint32_t value)
	{
		for (int i = 0; i < 16; ++i)
		{
			m[i] += float(value + i);
		}
	}
};
uint32_t dataCount = 1000000;

// Loop test:
{
	Data* dataSet = new Data[dataCount];
	{
		auto t = timer("loop test: ");

		for (uint32_t i = 0; i < dataCount; ++i)
		{
			dataSet[i].Compute(i);
		}
	}
	delete[] dataSet;
}

// Dispatch test:
{
	Data* dataSet = new Data[dataCount];
	{
		auto t = timer("Dispatch() test: ");

		const uint32_t groupSize = 1000;
		JobSystem::Dispatch(dataCount, groupSize, [&dataSet](JobDispatchArgs args) {
			dataSet[args.jobIndex].Compute(1);
		});
		JobSystem::Wait();
	}
	delete[] dataSet;
}

Notice that I create the dataSet two times, once for each test. This is important to avoid reusing cached values for the second run of the test. And the results (optimizations disabled):

loop test: : 140 milliseconds
Dispatch() test: : 59 milliseconds

Neat. But when I enabled optimizations, I got other results:

loop test: : 28 milliseconds
Dispatch() test: : 31 milliseconds

That’s not cool, now the simple loop is faster! Then I tried increasing the groupSize to 10000:

loop test: : 28 milliseconds
Dispatch() test: : 15 milliseconds

Ok, now dispatch is nearly two times faster, still not 4 times like we would expect. After trying some other different groupSizes, I could not improve this much better. This behaviour can depend on several factors, such as cache, thread core switching, the operating system overtaking the threads, contention for the jobPool, or just the nature of the workload that is being executed, all of which needs much more investigation, possibly with a dedicated profiling tool. These days there are at least two big ones, Intel Vtune and AMD CodeXL, both of which I am unfamiliar with at the moment.

Windows platform specific extension
If we are on windows, we can try some Winapi functions to extend threading functionality. I will just name a few useful ones. You can put these in the Initialize function after the thread has been created (you can search for * to find that place). Also, I am doing an #include <Windows.h> and #include <sstream> and #include <assert.h> for this next part.

First, we must retrieve the native thread handle from the std::thread, like this:

HANDLE handle = (HANDLE)worker.native_handle();

Once we have that, we are free to pass it to the Winapi functions. For example, we can set a thread affinity mask, to put it on a dedicated hardware thread instead of having it for all possible cores and let the operating system switch them around on demand. This piece of code puts the thread we just created onto one core specified by threadID:

DWORD_PTR affinityMask = 1ull << threadID;
DWORD_PTR affinity_result = SetThreadAffinityMask(handle, affinityMask);
assert(affinity_result > 0);

So the affinityMask is a bitmask where each bit specifies a hardware core that is less than what hardware_concurrency gives us and start from thread 0. By default, the affinity mask should be 2^hardware_concurrecy() – 1 which is also returned by the SetThreadAffinityMask into the affinity_result variable.

We can increase thread priority like so:

BOOL priority_result = SetThreadPriority(handle, THREAD_PRIORITY_HIGHEST);
assert(priority_result != 0);

By default, each thread we create is THREAD_PRIORITY_DEFAULT. Modifying this could help threads not be overtaken by the operating system by lesser priority threads. I’ve found no way to increase performance with this yet, only decrease it.

I found this next one very useful. We can give a name to our thread so that it will show up in the Visual Studio debugger:

std::wstringstream wss;
wss << "JobSystem_" << threadID;
HRESULT hr = SetThreadDescription(handle, wss.str().c_str());
assert(SUCCEEDED(hr));

My general suggestion is not to just mess with platform specific extensions, unless you are sure that you are getting a benefit. In some cases, such as modifying priority can lead to great performance reduction too.

Closure
If you stayed till the end, thank you. This simple job system is ready to be used and can offer a generous speedup, but there is definitely room for improvements, especially for the Dispatch function. I am planning to give an in-depth look to Intel Vtune (they have a trial version now) and AMD CodeXL soon, and possibly follow up on this with some improvements. As always, you can take a look at the current implementation in my Wicked Engine.
The code in the engine is likely to change over time, but I also made a sample for this blog post alone which will stay like this. It contains all the code for the JobSystem and the tests I showed.

UPDATE: This job system doesn’t have the ability to run jobs from within jobs. The implementation in Wicked Engine was updated to support this with not much more added complexity.

Thank you for reading! If you find a mistake, or have suggestions, I would like to hear it!

Scalable GPU Fluid Simulation

Let’s take a look at how to efficiently implement a particle based fluid simulation for real time rendering. We will be running a Smooth Particle Hydrodynamics (SPH) simulation on the GPU. This post is intended for experienced developers and provide the general steps of implementation. It is not a step-by step tutorial, but rather introducing algorithms, data structures, ideas and optimization tips. There are multiple parts I will write about: computing SPH, N-body simulation, dynamic hashed grid acceleration structure. You will find example code pieces here written as HLSL compute shaders.

To be able to implement a simulator like this, you should already have a basic particle simulation/rendering up and running on the GPU. Like this. Then the fluid simulation step can just be inserted between particle emission and particle integration phase. Everything should just work from this point. At the end of the article, you should be able to do this on a low end GPU in real time (30 000 particles, 80 FPS, GTX 1050):

sphinit

Smooth Particle Hydrodynamics

As a starting point, let’s get familiar with SPH. You can probably already find several papers on the subject. I chose this short paper. I would like to skip the in-depth theory behind it here and focus on implementation instead to keep it as short as possible. The basic idea is that we are approximating fluid with particles, with a finite range of affection. Each particle can influence every particle inside its range (smoothing radius). To simulate affection forces for each particle, we will compute the following for each particle: density -> pressure -> pressure-based forces -> viscosity-based forces -> constant gravitational force. After we computed these forces, we can use the combined force value and perform integration of velocity and update particle positions. In the paper we can find the following formulas that we must implement as shader code (I highlited and annotated key areas):

  • Density evaluation with Poly6 kernel smoothing function + pressure computationdensity_pressure
    • Let’s translate that into (pseudo-)code:
      particleA.density = 0; // we will compute this (float)
      
      // Precompute part of Poly6
      //  this should be constant buffer value
      //  h = smoothing radius parameter (float) (default = 1.0f)
      //  h9 = pow(h, 9)
      //  PI = 3.1415...
      const float Poly6_constant = (315.0f / (64.0f * PI * h9)); 
      
      for(all particleB that is either near particleA or is particleA)
      {
        const float3 diff = particleA.position - particleB.position;
        const float r2 = dot(diff, diff);
        if(r2 < h2) // h2 = h*h
        {
          const float W = Poly6_constant * pow(h2 - r2, 3);
      
          particleA.density += particleB.mass * W;
        }
      }
      
      // avoid negative pressure by clamping density to reference value:
      particleA.density = max(p0, particleA.density);
      
      // Compute pressure:
      //  K = pressure constant parameter (float) (default = 250.0f)
      //  p0 = reference density param (float) (default = 1.0f)
      particleA.pressure = K * (particleA.density - p0);
    • particleA correspons to the particle we are evaluating, particleB is the particle which is affecting the current particle. In the formulas, particleA is the “i” iterator, particleB is the “j” iterator. I found that names like this are just better readable in code.
    • The loop iteration is placeholder now, that will be addressed later how we should implement it.
    • Note that we not only evaluate nearby particles, but the particle itself, too. This is essentially just omitting that check inside the loop which is checking if the particle is itself or not. Without this detail, the simulation will be jerkier because essentially we avoid contribution from the particle’s own mass.
    • Note that we can avoid storing the pressure, because it can be computed from the particle itself. We can compute it before computing forces, instead of loading it, which will increase computation time but reduce memory access stalls.
  • Pressure-based force computation with Spiky kernel smoothing functionpressure_force
    • Looks like this in code:
      particleA.force1 = 0; // we will compute this (float3)
      
      // Precomputed part of Spiky kernel:
      //  h6 = pow(h, 6);
      //  again, it should be coming from constant buffer
      const float Spiky_constant = (-45 / (PI * h6));
      
      for(all particleB that is near particleA but is not particleA)
      {
        const float3 diff = particleA.position - particleB.position;
        const float r2 = dot(diff, diff);
        const float r = sqrt(r2);
      
        if(r > 0 && r < h) // **avoid division by zero!
        {
          const float3 rNorm = diff / r; // same as normalize(diff)
          const float W = Spiky_constant * pow(h - r, 2);
      
          particleA.force1 += (particleB.mass / particleA.mass) * 
            ((particleA.pressure + particleB.pressure) / 
             (2 * particleA.density * particleB.density)) * W * rNorm;
        }
      }
      
      particleA.force1 *= -1;
    • Very important that unlike in the previous step, we must avoid computing the current particle against itself! The simulation immediately blows up if you don’t check!
  • Viscosity-based force computation with Laplacian smoothing functionviscosity_force
    • In code:
      particleA.force2 = 0; // we will compute this (float3)
      
      // Most of it is the same as the previous snippet, 
      //  only the innermost calculations will differ
      for(all particleB that is near particleA but is not particleA)
      {
       const float3 diff = particleA.position - particleB.position;
       const float r2 = dot(diff, diff);
       const float r = sqrt(r2);
      
       if(r > 0 && r < h) // **avoid division by zero!
       {
         const float3 rNorm = diff / r;
         const float r3 = r2 * r;
         const float W = -(r3 / (2 * h3)) + (r2 / h2) + 
           (h / (2 * r)) - 1;
      
         particleA.force2 += (particleB.mass / particleA.mass) * 
           (1.0f / particleB.density) * 
           (particleB.velocity - particleA.velocity) * 
           W * rNorm;
       }
      }
      
      // e = viscosity constant (float) (default = 0.018f)
      particleA.force2 *= e;
    • All of this should be combined with the pressure-based force computations, into the same loop!
    • The viscosity constant will control how fluid our simulation is.
  • Forces integration
    • Consists of following steps:
      • Add force to velocity
      • Step position along velocity vector
      • Reset force
    • Using fixed timestep so that simulation will not blow up
    • This step must be separate from the SPH forces evaluation, because viscosity based force computation takes velocity into account, thus it cannot be modified while the forces are not computed yet!
    • They say that a code piece can say a thousand words:
      const float deltaTime = 0.016f; // fixed step
      const float3 G = float3(0, -9.8f, 0);
      particleA.velocity += deltaTime * ((particleA.force1 + particleA.force2) / particleA.density + G);
      particleA.position += deltaTime * particleA.velocity;
      particleA.force1 = 0;
      particleA.force2 = 0;
    • Of course, the particle should not be storing two distinct forces, they can be combined into one.

N-body simulation

N-body simulation stands for the simple algorithm of checking each particle against each other particle in the simulation. This is how we should implement SPH-based fluid simulation first for simplicity. This can be still done real time for a couple thousand particles on a GPU, so it is ideal for us to test out how the formulas work and find good default input values for the simulation. However, this algorithm has a complexity of O(n^2), so it will not scale well after a certain threshold. The SPH smoothing radius will give us opportunity to come up with a better algorithm to reduce complexity and make it scalable later. The first attempt to create an N-body simulation in compute shaders, would involve a double loop iterating through the particle list:

#define THREADCOUNT 256

[numthreads(THREADCOUNT, 1, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
{
  Particle particleA = particleBuffer[dispatchThreadID.x];
  for(uint i = 0; i < particleCount; ++i)
  {
    Particle particleB = particleBuffer[i];
    resolveParticles(particleA, particleB);
  }
}

Well, not *really* a double loop, but the outer loop is the compute kernel itself. But there are already several problems with this. First, is that the SPH force computation needs intermediate results of particle densities, but our compute shader is running thread groups in parallel, so we can’t just combine both in the same shader, as we can’t be sure that all the densities are computed by the time we want to evaluate forces. So we are splitting the simulation into two shaders: density evaluation and force evaluation. (Pressure evaluation can be done immediately before force computation because it can be computed from density) So far we have two shaders, each doing O(n^2) particle evaluations, each loading from global memory. This just screams slow. I also experimented with using shared memory to preload particles ahead of time and reduce latency. You can skip this step if you want, because the acceleration structure later will be a complete replacement, but this is a good exercise nonetheless and increases performance of the N-body simulation:

#define THREADCOUNT 256
groupshared Particle sharedParticles[THREADCOUNT];
//...

Particle particleA = particleBuffer[dispatchThreadID.x];

uint numTiles = 1 + particleCount / THREADCOUNT;

for (uint tile = 0; tile < numTiles; ++tile)
{
  uint offset = tile * THREADCOUNT;
  sharedParticles[flattenedGroupIndex] = particleBuffer[offset + flattenedGroupIndex];
  GroupMemoryBarrierWithGroupSync();

  for(uint i = 0; i < THREADCOUNT; ++i)
  {
    Particle particleB = sharedParticles[i];
    resolveParticles(particleA, particleB);
  }
  GroupMemoryBarrierWithGroupSync();
}

Important to look out that we are doing thread sync operations inside a dynamic loop. You are not allowed to do this if the loop is potentially divergent across the thread group. To have it compile, ensure that the loop range is loaded from coherent memory such as constant buffer values or read only buffers. For example you can’t load particleCount from an unordered access view, because the loop termination depends on it and it can potentially be divergent across the thread group.

Obviously, it is a bit simplified, to be more readable. We eliminated a bunch of latency by avoiding accessing the main memory and using group shared memory instead. This is the general way to speed up N-body simulation. It should be already good (real time) for simple fluid simulations up to ~20000 particles on an average GPU. We still need to do it two times though, first for computing densities, then for forces.

So, with this implementation, our fluid simulation can be written as running two extra shaders before particle velocity integration:

  1. SPH Density evaluation
  2. SPH force computation

Pressure computation can either be in density evaluation phase and exported to memory/read back in force phase, or just computed in force phase. I chose to actually precompute it in the force phase, and store it into shared memory per “tile”.

So far we can simulate a moderate amount of particles in real time with N-body collisions (4000 particles):

sph

Dynamic Hashed Grid

So, hopefully you have the fluid simulator up and running before coming to this part. 🙂 Let’s eliminate that nasty O(n^2) complexity! The very basic idea behind SPH is that there is a smoothing radius for each particle and only particles inside the smoothing radius should be affecting the current particle. If we subdivide space to a uniform grid, with a grid cell size of smoothing radius, and sort each particle into its corresponding cell, we can perform efficient constant lookups of smaller particle lists. Let’s say, that we have a particle somewhere. We can look up which grid cell it belongs to by taking its position, and dividing it by smoothing radius:

int3 cellIndex = floor(particleA.position / h);

Then we have to evaluate all the particles inside that grid cell and all the direct neighbour cells as well. In a three dimension grid, those are 3*3*3 = 27 cells. There are some problems that come to mind immediately:

  • We don’t want to just constrain the simulation to a finite grid, but we don’t have infinite memory
  • We don’t know how many particles can be inside one cell at once. We don’t want to impose limitation, because that could affect simulation results.
  • We can’t allocate memory with GPU code, so we will be using a fixed size pre-allocated pool.
  • We will need to build this data structure entirely on the GPU.

Fortunately, all of these are solvable with the dynamic hashed grid algorithm!

The first part we should solve is that each cell can contain an unbounded number of particles (well, bounded to the maximum number of allowed particles per simulation). I took this idea from a GDC 2008 presentation by Nvidia. I needed some time to wrap my mind around it, so hopefully I can describe it so that it will be easier for you. It is called “sorted grid” and it is described like this:

sorted_grid

It can be tricky to digest, let me explain. Let’s start from left to right. We have a picture and three columns. The picture of the grid shows that we have grid cells 4, 6 and 9 containing some particles. It can be represented by the unsorted list (first column). It can be built on the GPU by iterating through the particle list and each particle writes its grid cell index into a list. But it is not yet usable to perform a constant time lookup. The second column is the sorted list. It’s the previous list, only the particle index list is now sorted by grid cell index. It is good for us, because now if we iterate through the particle index list, the corresponding grid cell indices are continuously laid out in ascending order. But we need the ability to look up an arbitrary cell in constant time and have a continuous array of particle list which it contains. We are nearly there. As the last step, we can process this list and create an “offset list” for each grid cell. We will have an array of grid cells each holding either

  • UINT32_MAX (0xFFFFFFFF) When the cell is empty
  • integer value >= 0 indicating the offset into sorted particle list

With this information, we can now say everything about a particle when we iterate through them:

  • Its corresponding grid cell
  • The grid cell’s offset into the sorted list
  • We know that the sorted list contains some number of particles in the same cell from the offset until we read a particle which is in an other cell.

As we can see, there is no need for memory allocation apart from the pre-allocation, so it is doable on the GPU. We should follows these steps:

  1. Create the unsorted listWe will be running a compute shader with the SV_DispatchThreadID indexing the particle index list, which indexes the main particle property buffer containing position, simulation properties, etc..We will have a cellIndexBuffer, which is the same size as the particle index buffer. It could also be a property of the particle structure, but this is better memory access pattern because it is only used in optional particle simulation path. Also, we can have a general purpose sorting shader more easily.
    uint particleIndex = particleIndexBuffer[dispatchThreadID.x];
    Particle particle = particleBuffer[particleIndex];
    int3 cellIndex = floor(particle.position / h);
    uint flatCellIndex = GetFlatCellIndex(cellIndex);
    
    cellIndexBuffer[particleIndex] = flatCellIndex;
  2. Create the sorted listRun a Bitonic Merge Sort Algorithm which performs comparisons on the cellIndexBuffer but swaps elements of the particleIndexBuffer. Bitonic sort is a well researched parallel sorting algorithm, and you can pull the implementation from AMD’s particle demo on GitHub.
  3. Create the offset listThe offset list will be recreated from scratch each frame, so first ensure that it is completely cleared to invalid offsets (0xFFFFFFFF).Then, we can exploit Atomic operations to efficiently compute the cell offsets. We will dispatch a compute shader that will read one particle index from SV_DispatchThreadID semantic. We can read the corresponding grid cell index or compute it. Loading is preferable here, because we can achieve better cache coherency by not loading position from the bloated particle buffer and converting it to cell index.The Atomic Min operation works because we have high values everywhere in the offset buffer initially, so we can store for each grid cell the minimum of all the threads that had a particle corresponding to that cell.
    uint particleIndex = particleIndexBuffer[dispatchThreadID.x];
    uint cellIndex = cellIndexBuffer[particleIndex];
    
    InterlockedMin(cellOffsetBuffer[cellIndex], dispatchThreadID.x);
  4. The structure is complete, all we have to do is iterate it!

Before going further, let’s address our remaining desire, that the grid should be “infinite”, not only constrained to a small area. This is a really simple extension, called hashing. The only thing that will change is the computation of the flat cell index. Let’s define the function

inline uint GetFlatCellIndex(in int3 cellIndex)
{
  const uint p1 = 73856093; // some large primes
  const uint p2 = 19349663;
  const uint p3 = 83492791;
  int n = p1 * cellIndex.x ^ p2*cellIndex.y ^ p3*cellIndex.z;
  n %= TOTAL_GRID_CELL_COUNT;
  return n;
}

The rest of the algorithm is unchanged, just ensure that wherever looking up a grid cell, we have to use the same function. Also, the choice of prime numbers and grid cell count should have a big impact on the efficiency of the hashing (because of hash collisions, multiple world positions can index the same physical cell), so we should tune this to our needs.

All that’s left now is iterating this structure to resolve the SPH simulation. It will be traversed two times, first for density evaluation, then for force computation. But the iteration itself is the same in both cases:

int3 cellIndex = floor(particleA.position / h);

for(int i = -1; i <= 1; ++i)
{
  for(int j = -1; j <= 1; ++j)
  {
    for(int k = -1; k <= 1; ++k)
    {
       int3 neighborIndex = cellIndex + int3(i, j, k);
       uint flatNeighborIndex = GetFlatCellIndex(neighborIndex);
       
       // look up the offset to the cell:
       uint neighborIterator = cellOffsetBuffer[flatNeighborIndex];

       // iterate through particles in the neighbour cell (if iterator offset is valid)
       while(neighborIterator != 0xFFFFFFFF && neighborIterator < particleCount)
       {
         uint particleIndexB = particleIndexBuffer[neighborIterator];
         if(cellIndexBuffer[particleIndexB] != flatNeighborIndex)
         {
           break;  // it means we stepped out of the neighbour cell list!
         }

         // Here you can load particleB and do the SPH evaluation logic

         neighborIterator++;  // iterate...
       }

    }
  }
}

This is it. It might be a good idea to eliminate the triple loop with a lookup table (but don’t unroll!). But it is better visualized like this for simplicity/code size. This will yield us a huge performance gain and we are able to simulate 100 000 particles on a low end GPU in real time ~30 FPS:

fluid100k

Apart from particle collision, at least for debug purposes, it is a good idea to implement planar collisions, just to keep the simulation in bounds to test how it can fill out a volume. Planar collision is simple, if a particle would go behind a plane, clamp its position and invert the particle velocity based on the plane reflection vector. Apart from that, we can implement rigid collisions with unsimulated particles (for example driven by animation system). But that is out of scope for today.

As with any technique, there are multiple possibilities to further optimize. I will not cover hardware level GPU optimizations here because I have not yet researched this area thoroughly for this and it is already a long article. My general direction would be first to profile for any obvious bottlenecks. Without doing so, I imagine that memory latency could be improved by using shared memory somehow for the grid based lookups. It is not trivial anymore like with N-body simulation, because the thread group can span arbitrary amount of different grid cells at once. Apart from that, we should always aim to stay in cache wherever we can, so deinterleaving certain less frequently used particle properties might be a good idea. Also, watch out for bank conflicts when writing to (shared) memory, you can do so by aligning your data to bank size, and interleave certain frequently used together properties together into a structure. Can we use wave intrinsics to speed up certain parts of the simulation? I have some ideas I still haven’t tested out.

You can see that there is more room for additional optimizations and I still haven’t talked about rendering the fluid. These are very good candidates for other blog posts in the future, so stay tuned!

As always, you are free to pull my implementation from my open source engine:

Also, my older article describes a general purpose GPU particle system which is a good candidate to fit a fluid simulator into. Or just to implement a cool large-scale particle system.

If you find any mistakes or have any questions, let me know in the comments!

UPDATE: There was a divide by zero bug, which is now fixed. In the force computation step, we must check if the distance between two particles is greater than zero, otherwise avoid computing the SPH forces between those two particles. It sometimes manifested as a flickering anomaly that looked more like a synchronization issue. Sorry for the inconvenience! I marked occurences in code samples with **

[Download sample model] for [Wicked Engine Editor]

Thoughts on Skinning and LDS

havoc

I’m letting out some thoughts on using LDS memory as a means to optimize a skinning compute shader. Consider the following workload: each thread is responsible for animating a single vertex, so it loads the vertex position, normal, bone indices and bone weights from a vertex buffer. After this, it starts doing the skinning: for each bone index, load a bone matrix from a buffer in VRAM then multiply the vertex positions and normals by these matrices and weight them by the vertex bone weights. Usually a vertex will contain 4 bone indices and 4 corresponding weights. Which means that for each vertex we are loading 4 matrices from VRAM. Each matrix is 3 float4 vectors, so 48 bytes of data. We have thousands of vertices for each model we will animate, but only a couple of hundred bones usually. So should each vertex load 4 bone matrices from a random place in the bone array?

Instead, we can utilize the LDS the following way: At the beginning of the shader, when the vertex information is being loaded, each thread also loads one matrix from the bone array and stores it inside the LDS. We must also synchronize the group to ensure the LDS bone array has been complete. After all the memory has been read from VRAM, we continue processing the skinning: iterate through all bone indices for the vertex, and load the corresponding bone matrix from LDS, then transform the vertex and blend by the bone weights. We just eliminated a bunch of memory latency from the shader.

Consider what happens when you have a loop which iterates through the bone indices for a vertex: First you load the bone and you want to immediately use it for the skinning computation, then repeat. Loading from a VRAM buffer causes significant latency until the data is ready to be used. If we unroll the loop, the load instructions can be rearranged and padded with ALU instructions that don’t depend on those to hide latency a bit. But unrolling the loop increases register allocations (VGPR = Vector General Purpose Register, for storing unique data to the thread; buffer loads consume VGPR unless they are known to be common to the group at compile time, then they can be placed to scalar registers – SGPR) and can result in lower occupancy as we have a very limited budget of them. We also want a dynamic loop instead because maybe a vertex has fewer bones than the maximum, so processing could exit early. So having tight dynamic loop with VRAM load then immediately ALU instructions is maybe not so good. But once that loop only accesses LDS, the latency can be significantly reduced, and performance should increase.

But LDS does not come for free and it is also a limited resource, like VGPR. Let’s look at the AMD GCN architecture: We have a maximum of 64 KB of LDS for a single compute unit (CU), though HLSL only lets us use 32 KB in a shader. If a shader uses the whole 32 KB, it means that the shader can only be running two instances of itself on the CU. We have a bone data structure which is a 3×4 float matrix, 48 bytes. We can fit 682 bones into LDS and still have two instances of the compute shader operate in parallel. But most of the time we hardly have skeletons consisting of that many bones. In my experience, less than 100 bones should be enough for most cases, but we surely won’t use more than say 256 bones for a highly detailed model, either in real time apps. So say that our shader will declare an LDS bone array of 256 bones, and the thread group size is also 256, so each thread will load one bone into LDS. 256*48 bytes = 12 KB. This means that 5 instances of this shader could be running in parallel on a CU, so 5*256 = 1280 vertices processed. That is if we don’t exceed the max VGPR count of 65536 for a CU. In this case it means that a single shader must at maximum fit into the  51 VGPR limit (calculated as 65536 VGPR / 1280 threads). Most cases we will easily fit into even a 128 bone limit, so an LDS bone array size of 128 and thread group size of 128 threads will just be enough and be much easier on the GPU.

However, I can imagine a scenario, which could be worse with the LDS method, if there is a complicated skeleton, but small mesh referencing only a few of the bones. In this case when there is one skeleton for multiple meshes, maybe we should combine the meshes into a single vertex buffer and use an index buffer to separate between them, so this way a single dispatch could animate all the meshes, while they can be divided into several draw calls when needed.

It sounds like a good idea to utilize LDS in a skinning shader and it is a further potential improvement over skinning in vertex/geometry shader and doing stream out. But as with anything on the GPU this should be profiled on the target hardware which you are developing on. Right now my test cases were unfortunately maybe not the best assets there are, and a puny integrated Intel mobile GPU, but even so I could find a small performance improvement with this method.

Thank you for reading, you can find my implementation on GitHUB! Please tell me if there is any incorrect information.

UPDATE: I’ve made a simple test application with heavy vertex count skinned models, and toggleable LDS skinning: Download WickedEngineLDSSkinningTest (You will probably need Windows 10 to run it and DirectX 11)

Further reading:

Large thread groups on GCN

Easy Transparent Shadow Maps

transparent_shadows

Supporting transparencies with traditional shadow mapping is straight forward and allows for nice effects but as with anything related to rendering transparents with rasterization, there are corner cases.

The implementation is really simple once you have implemented shadow mapping for opaque objects. After the opaque shadow pass, we must render the transparents into a color buffer, but reject samples which would be occluded by opaques, so using a depth read-only depth stencil state. The transparents should be blended multiplicatively. Sorting does not matter with a multiply blend state. In bullet points:

  1. Render opaque objects into depth stencil texture from light’s point of view
  2. Bind render target for shadow color filter: R8G8B8A8_UNORM works good
  3. Clear render target to 1,1,1,0 (RGBA) color
  4. Apply depth stencil state with depth read, but no write
  5. Apply multiplicative blend state eg:
    • SrcBlend = BLEND_DEST_COLOR
    • DestBlend = BLEND_ZERO
    • BlendOp = BLEND_OP_ADD
  6. Render transparents in arbitrary order

When reading shadow maps in shading passes, we only need to multiply the lighting value with the transparent shadow map color filter if the pixel is inside the light. There is a slight problem with this approach, that you will notice immediately. Transparent objects now receive their own colored self shadow too. The simplest fix is to just disable the colored part of the shadow calculation for transparent objects. We can already produce nice effects with this, this is not a huge price to pay.

See it in action, transparents are rendered without colored self-shadows:

doesnotreceive

But they receive shadows from opaque objects just fine:

opaquereceived

There is a technique which would allow us to render colored shadows on top of transparents too. This involves keeping an additional shadow depth map for transparencies. The flow of this technique is like this (from a Blizzard presentation):

  1. Render opaque shadow map
  2. Render transparent shadow map
    • To a separate depth texture!
    • Depth writes ON
  3. Clear shadow color filter texture (like in the simple approach)
  4. Render transparent again to color filter render target
    • But use the opaque shadow map’s depth stencil
    • depth read ON
    • depth write OFF

And in the shading step, now there will be two shadow map checks, one for the opaque, one for the transparent shadow maps. Only multiply the light with the shadow filter color texture when the transparent shadow check fails.

This will eliminate false self shadows from transparent objects. But unfortunately now when a transparent receives a colored shadow, its own transparent shadow color will also contribute to itself.

What’s more interesting, are the additional effects we can achieve with transparent shadow maps:

Textured shadows, which can be used as a projector for example, just put a transparent textured geometry in front of a light:

textured_shadows

And underwater refraction caustics:

caustics

The caustics are implemented as adding additional light to the surface, so even if the surface below the water would be in the water’s shadow it would add in the caustic effect. For this I am using the transparent color filter render target’s alpha channel with an additive blend mode. The water shadow shader renders the slope of the water normal map into the alpha channel, eg. this:

float4 outputColor.a = 1 – saturate(dot(normalMap.rgb * 2 – 1, float3(0, 0, 1)));

Which could probably use a more sophisticated method, which I didn’t bother to find yet. So in the light evaluation shaders, I add the transparent color filter alpha channel to the pixelIsInShadow value multiplied by some causticStrength modulator to render the caustics. Caustics are added because there can be brighter caustics than the light brightness, because in reality they are generated when multiple rays (photons) arrive at the same location from different sources.

That’s it, I think this is a worthwhile technique to include in a game engine. Even the simplest implementation can be used for many interesting effects. Thank you for reading!

Related article:

StarCraft 2 Effects and techniques