Optimizing tile-based light culling

Tile-based lighting techniques like Forward+ and Tiled Deferred rendering are widely used these days. With the help of such technique we can efficiently query every light affecting any surface. But a trivial implementation has many ways to improve. The biggest goal is to refine the culling results as much as possible to help reduce the shading cost. There are some clever algorithms I want to show here which are relatively easy to implement but can greatly increase performance.

sponza_culling

For the starting point I assume a tiled rendering pipeline like the one described in a GDC 2011 presentation by Johan Andersson (DICE). The following post will be applicable to the Tiled Deferred and Forward+ variants of that pipeline. A quick recap of such a renderer:

  1. Dispatch a compute shader thread group per screen-space tile
  2. Thread group calculates tile frustum in view space
  3. Each thread reads pixel from depth buffer, performs atomic min-max to determine depth bounds
  4. Tile frustum is truncated by depth bounds
  5. Each thread computes if a light is inside the frustum, adds the light to a per tile list if yes
  6. Compute final shading from the per tile light list
    • Deferred can just read gbuffer, loop through per tile light list in the same shader and write the light buffer texture
    • Forward+ exports light list that will be read by second geometry pass and compute shading in pixel shaders

Our main goal is to refine culling results with eliminating false positives, which means cull lights which do not contribute to the final shading.

1.) Help out frustum culling with AABB

The first problem arises when we perform the culling of lights as sphere-frustum intersection tests. Let me demonstrate this with a point light:

culling0

culling1

Blue pixels on the second picture visualize which tiles contain the sphere light source after the culling. But the results are strange, why is the blue area kind of sqare shaped, when it should be a circle instead? This results because the math used for sphere – frustum plane tests are not accurate enough when we test big spheres against small frustums. Let me try illustrating this problem. The following image shows a tile frustum as seen on the screen:

frustum_culling

As you can see, the sphere is completely outside the frustum, but none of the plane tests pass completely, so the light is not culled away properly. We can overcome this, by using axis aligned boxes (AABBs) instead of frustums. It is implemented like this:

bool SphereIntersectsAABB(in Sphere sphere, in AABB aabb)

{

  float3 vDelta = max(0, abs(aabb.center – sphere.center) – aabb.extents);

  float fDistSq = dot(vDelta, vDelta);

  return fDistSq <= sphere.radius * sphere.radius;

}

The result:

culling2

This one seems good enough, but don’t throw out frustum culling just yet!

2.) Depth discontinuities are enemy

Holes in the depth buffer can greatly reduce efficiency of a tile based light selection, because the tile enclosing min-max depth AABB can get huge really fast:

big_aabb

In the image above I tried to illustrate (from an above point of view) that a depth discontinuity made the AABB large enough to intersect with a light which the frustum culling would have rejected. This is why AABB should be used alongside frustum culling and complementing each other.

Depth discontinuities usually introduce an other inefficiency, because there might be cases when a light will lie in empty space, not intersecting with anything, but still inside the tile, so the shading will receive the light, but it will not contribute at all:

light_not_intersecting

As you can see, that light is inside the frustum, inside the AABB, but it is in empty space, between geometries, but our current algorithm will add it to the light list.

To solve this, there is a technique called 2.5D light culling, introduced by Takahiro Harada. In addition to that presentation, I would like to give an implementation for this in HLSL. So the basic idea is to create two bitmasks, one for the tile and one for the light which we are checking. The bitmasks are used by doing a bitwise AND operation with them to determine if the light intersects any geometry (when AND returns non zero) in the tile or not (when AND returns zero).

bitmasks

For the sake of a simpler image, I used a 9-bit mask, but we should use a 32-bit mask which we can represent by a uint variable.

The first bitmask is created for the whole tile once. While each thread reads its corresponding pixel from the depth buffer, it does an atomic min-max already, but now it also fills in a single relevant bit in a uint and performs an atomic OR to the tile bitmask. So what is the relevant bit? The algorithm says that we divide our tile depth range into 32 pieces and a 32-bit uint variable will contain those ranges. We first determine our tile depth bounds in linear space for this, then fill in the corresponding bit accordingly:

groupshared uint tileDepthMask = 0;

// …

float minDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, minDepth, 1)).z;

float maxDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, maxDepth, 1)).z;

float realDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, pixelDepth, 1)).z;

float depthRangeRecip = 32.0f / (maxDepthVS – minDepthVS);

uint depthmaskcellindex = max(0, min(32, floor((realDepthVS – minDepthVS) * depthRangeRecip)));

InterlockedOr(tileDepthMask, 1 << depthmaskcellindex);

GroupMemoryBarrierWithGroupSync();

This code is being run by every thread in the group. The unexplained function called UnprojectScreenSpaceToViewSpace just does what is says, the input is a screen coordinate point, and transforms it to view space. We are only interested in the Z coordinate here, so we only need to transform the input with the inverse projection matrix and divide the result by the w component afterwards. Otherwise if we would be interested in XY coordinates, we would also need to transform from [0,1] to [-1,1] range before projection. The function would look like this for the common case:

float4 UnprojectScreenSpaceToViewSpace(in float4 screenPoint)

{

  float4 clipSpace = float4(float2(screenPoint.x, 1 – screenPoint.y) * 2 – 1, screenPoint.z, screenPoint.w);

  float4 viewSpace = mul(clipSpace, xInverseProjectionMatrix);

  viewSpace /= viewSpace.w;

  return viewSpace;

}

So the bitmask construction code might look a bit intimidating, so let me explain a bit better what’s happening. We calculate the minZ, maxZ and current pixel Z in view space and determine the depth slice size which a single bit will represent (depthRangeRecip). Then shift a bit to the right place and adding it to the group shared tile mask by means of an atomic OR operation.

The tile mask is complete, so we only need to know how to construct a light mask. That must be done inside the loop where we are culling lights. On the first try I cooked up this:

float fMin = lightPosViewSpace.z – lightRadius.r;

float fMax = lightPosViewSpace.z + lightRadius.r;

uint lightMaskcellindexSTART = max(0, min(32, floor((fMin – minDepthVS) * depthRangeRecip)));

uint lightMaskcellindexEND = max(0, min(32, floor((fMax – minDepthVS) * depthRangeRecip)));

uint lightMask = 0;

for (uint c = lightMaskcellindexSTART; c <= lightMaskcellindexEND; ++c)

{

  lightMask |= 1 << c;

}

Here we determine the beginning and end ranges of a sphere light inside the view space depth range and push bits into the mask in a loop to the correct places one-by one:

In this mask for example, lightMaskcellindexSTART is the 11th bit from the right, and lightMaskcellindexEND is the 21st bit from the right:

0000000000111111111110000000000

Of course this loop seems like a waste to do inside a shader, so I needed to come up with something better. Rethinking how this a smaller bitfield could be pushed inside a bigger bitrange gave me the idea to exploit the truncation by the bitwise shift operators:

  • First, fill full mask like this:

    1111111111111111111111111111111

    • uint lightMask = 0xFFFFFFFF;
  • Then Shift right with spare amount to keep mask only:

    0000000000000000000011111111111

    • lightMask >>= 31 – (lightMaskcellindexEND – lightMaskcellindexSTART);
  • Last, shift left with START amount to correct mask position:

    0000000000111111111110000000000

    • lightMask <<= lightMaskcellindexSTART;

So the resulting code eliminated a loop for only a very few instructions which is a lot better.

We have the tile mask and the light mask, so the only thing left to do is AND them to determine if a light touches something or not:

bool intersect2_5D = tileMask & lightMask;

And the resulting comparison of culling results in a high depth discontinuity scene with alpha tested vegetation and many point lights (red tiles have more that 50 lights):

advanced_culling

As you can see, there is a big improvement in the tile visualizer heatmap, the shading will process much less lights and performance will improve in these difficult cases with alpha tested geometry. The scene shown above had the following timing results (performed with DX11 timestamp queries):

  • Basic 2D culling:
    • Culling: 0.53 ms
    • Shading: 10.72 ms
  • Improved 2.5D culling:
    • Culling: 0.64 ms
    • Shading: 7.64 ms

As you can see this is a definite improvement because while the culling shader took a bit more time to finish (0.1 ms), the object shading  took 3 ms less. I made a more detailed video some time ago:

3.) Other light types

We surely have the need to cull other light types than point lights. Spot lights come to mind at first thought, but there can be the desire to cull decal boxes or area-aligned local environment probe volumes as well.

  • Spot lights:
    • Implement a cone culling algorithm in shader code. This can be more expensive than sphere culling and again not too accurate with frustum tests. Then going further and implementing cone – AABB tests will further slow down our algorithm.
    • Or approximate the spotlight cone with tightly fitting a sphere around it. We don’t need to implement any other culling algorithms, only fitting a sphere around the spotlight, which is a straight forward, easy calculation. But this will result in excessive waste for thin/long cones. Code to fit sphere around cone:
      • float spotLightConeHalfAngleCos = cos(spotLightFOV * 0.5f);

        float sphereRadius = spotlightRange * 0.5f / (spotLightConeHalfAngleCos *spotLightConeHalfAngleCos);

        float3 sphereCenter = spotLightPos + spotLightDirection * sphereRadius;

      • Remember that you can precalculate this outside the shader, so the shader will only have a sphere to evaluate.
    • An other method involves doing cone-sphere test instead and can produce much more accurate results than the previous two, check it out here.
  • Decal/Environment probe box:
    • The idea here is to get away with AABB-AABB tests. But our decals or envprobes can be oriented in world space arbitrarily so they are OBBs! We can do something that I’d call a coarse AABB test. I transform the tile AABB from view space to world space (using the inverse view matrix of the camera) and recalculate the AABB of the resulting OBB. Then for each decal I transform the world space tile AABB by the decal’s inverse world matrix and perform an AABB-AABB intersection test with the resulting AABB and a unit AABB. The results are kind of good, but we have to refine again with the frustum as well. Transforming an AABB with a matrix is quite heavy though, because I am multiplying each corner with a matrix and keep track of the min-max at the same time. With two AABB transforms it gets heavy, but I have not yet found a better solution for this. And culling decals or envprobes is much less frequent than lights, so I comfort myself with that idea for the time being. The results:

decals0decals1

4.) Gather vs. Scatter approaches

The implementation described by the above mentioned Battlefield presentation deals with the gather approach, which means that each thread group iterates through every potential light. A culling shader like this works best for a couple hundred lights, but can become slow when approaching a few thousand. To support more lights, we can implement some kind of a scatter approach. My suggestion is to have a coarse culling step before the regular light culling which operates on much bigger screen tiles and dispatches the work differently. This shader is dispatched so that each thread will process a single light, and determine which tile it belongs to and write (scatter) the light index into the according tile. Then the regular culling would read the light lists from the coarse tiles as opposed to iterating through each light on the screen. We could also use an acceleration structure for the lights like an octree for example and let the shader use that instead of coarse culling.

A scatter approach could be implemented for a different purpose as well: refining culling results. I described above that we can approximate spotlights with spheres, but an other idea would be to rasterize a cone instead in low resolution and let the pixel shader write the light into the appropriate tile corresponding to the invoked pixel.

5.) Don’t be greedy

It might be tempting to create a single uber-shader handling everything, creating frustums, reading depth buffer and assembling depth bounds, creating tile AABB, creating tile bitmask, culling lights, and in the case of tiled deferred, also evaluating the shading at the end. In reality, though it could be the wrong path to take.

First, on AMD GCN architecture for example, resources such as registers are shared across hardware execution units and if a shader uses too many, there will be contention for them and parallel execution will be reduced so that overall performance will be bottlenecked. This is called register pressure. Our shader which is creating frustums at the beginning are already using many registers for example which could be precalculated instead to lighten the load. AABB calculation further reduces available registers and so does calculating the tile bitmask. A tiled deferred shader at the end of the culling shader can also be very complex and utilizing many registers at once.

Then there is the part when we are creating the depth bounds with atomic operations. Atomic operations can be slow, so calculating the depth bounds in a separate shader by means of parallel reduction could be a better alternative. A reduced resolution depth buffer can also be reused later as a hierarchical-Z pyramid for instance.

Divergent branching in a shader is only a good idea if we design the shader to be highly coherent in the branches it takes for a thread group. A light culling setup usually works best in 16×16 or 32×32 pixel tiles, and each thread gets a minor task of culling a single light. This task is highly divergent in the path each thread will take. A light evaluation shader has a different behaviour as opposed to that, because each thread will potentially process the same array of lights in its given tile. Except there could be cases when a light/decal calculation will exit early or skip shadow calculations, etc… with a pixel granularity instead of per tile granularity. In that case, it is already inefficient to utilize big thread groups because long iterating threads will hold back the rest in the group from exiting early and the hardware will be underutilized. So a smaller tile size for the light evaluation should be preferred (8×8 worked best in my experiments).

Seeing these problems I propose to separate the big shader into several smaller parts. Frustum precomputation could go into its own shader. An other shader could reduce the depth buffer and create the depth bounds with the tile bitmask information. Tile AABB computation could also go in there potentially. The culling shader would only load the depth bounds, tile AABB and bitmask from memory, then perform culling the lights, then export the per tile light lists into memory. The last shader is the surface shading, light evaluating shader, which is a vertex-pixel shader for Forward+ and a compute shader for tiled deferred with a smaller blocksize than the culling (8×8 proposed as in previous paragraph).

Separating the shaders opens up an other optimization possibility, to use Indirect Dispatch. Consider for example that the culling determined for a tile that no lights are inside it, so it would instruct the surface shading shader not to dispatch any work groups for that tile.

Recap:

A lot has been covered, so let’s recap quickly:

  • Frustum-sphere tests are inaccurate
  • Refine with AABB-sphere tests
  • Work against depth discontinuities with 2.5D culling bitmasks
  • Small burden on culling can be big win with shading
  • Approximate complex light types with spheres and AABBs
  • Avoid overworked ubershaders
  • Prefer smaller shaders, better fitted to specific task at hand

Thanks for reading, hope you enjoyed it! If you have any feedback, please share with me. Were there any mistakes? Definetly share 🙂

Further reading:

Introduction to tiled deferred rendering

Reference for 2.5D culling

Tile based rendering advancements GDC 2015

Awesome Forward+ tutorial

RealTimeCollisionDetection.NET has most of the intersection code

Investigating efficient spotlight culling

Next power of two in HLSL

There are many occasions when a programmer would want to calculate the next power of two for a given number. For me it was a bitonic sorting algorithm operating in a compute shader and I had this piece of code be responsible for calculating the next power of two of a number:

uint myNumberPowerOfTwo = 1;

while( myNumberPowerOfTwo < myNumber)

{

   myNumberPowerOfTwo <<= 1;

}

It gets the job done, but doesn’t look so nice. For not unusual cases when myNumber is more than 1000 it can already take ten cycles to loop. I recently learned that HLSL has a built in function called firstbithigh. It returns the position of the first non zero bit in a 32-bit number starting from the left to the right (from high order to low). With its help, we can rewrite the algorithm as follows:

uint myNumberPowerOfTwo = 2 << firstbithigh(myNumber);

It does the same thing, so how does it work? Take a random number and write it in binary:

00000001011010011011011111000100

For that number, firstbithigh returns the number of the location of the first non zero bit (from the left), which is 24. The next power of two to that number would be:

00000010000000000000000000000000

That number has seven leading zeroes. All we have to do is zero the whole number but set a single bit to non-zero. To do it, first set up 2 in binary:

00000000000000000000000000000010

Then shift the whole thing left by the position of the first bit in myNumber to get what we are looking for:

00000010000000000000000000000000

Nice one, but we have a few corner cases we need to eliminate. If myNumber is already a power of two, we will get a greater power of two which might not be what we are looking for. In that case we should return myNumber itself. If you subtract one from myNumber before calling firstbithigh on it, it will ensure that power of two numbers will return themselves:

uint myNumberPowerOfTwo = 2 << firstbithigh(myNumber – 1);

Take 1024 for example, it will get the next power of two for 1023 which is 1024. Or take 129 for example, it will get the next power of two for 128 which is 256, so works out quite well. If you try to calculate it for 0 however, there is a potential bug if myNumber is unsigned, as -1 in unsigned would wrap around to 0xFFFFFFFF. If you can potentially call it for zeroes and operate on unsigned, I would suggest doing it like this:

uint myNumberPowerOfTwo = 2 << firstbithigh(max(1, myNumber) – 1);

If it is signed int, then you don’t have to worry about it, as the firstbithigh function will return zero for negative numbers.

Have any question, noticed a mistake? Please post in the comments below!

Thank you for reading!

GPU-based particle simulation

gpuparticles0

I finally took the leap and threw out my old CPU-based particle simulation code and ventured to GPU realms with it. The old system could spawn particles on the surface on a mesh with a starting velocity of each particle modulated by the surface normal. It kept a copy of each particle on CPU, updated them sequentially, then uploaded them to GPU for rendering each frame. The new system needed to keep the same set of features at a minimum, but GPU simulation also opens up more possibilities because we have direct access to resources like textures created by the rendering pipeline. It is also highly parallellized compared to the CPU solution, both the emitting and the simulation phase which means we can do a much higher amount of particles in the same amount of time. There is less data moving between the system and GPU, we can get away with only a single constant buffer update and command buffer generation, the rest of the data lives completely in VRAM. This makes simulation on a massive scale a reality.

If that got you interested, check out the video presentation of my implementation in Wicked Engine:

So, the high level flow of the GPU particle system described here is the following:

  1. Initialize resources:
    • Particle buffer with a size of maximum amount of particles [ParticleType*MAX_PARTICLE_COUNT]
    • Dead particle index buffer, with every particle marked as dead in the beginning [uint32*MAX_PARTICLE_COUNT]
    • 2 Alive particle index lists, empty at the beginning [uint32*MAX_PARTICLE_COUNT]
      • We need two of them, because the emitter writes the first one, simulation kills dead particles and writes the alive list again to draw later
    • Counter buffer:
      • alive particle count [uint32]
      • dead particle count [uint32]
      • real emit count = min(requested emit count, dead particle count) [uint32]
      • particle count after simulation (optional, I use it for sorting) [uint32]
    • Indirect argument buffer:
      • emit compute shader args [uint32*3]
      • simulation compute shader args [uint32*3]
      • draw args [uint32*4]
      • sorting compute shader arguments (optional) [uint32*3]
    • Random color texture for creating random values in the shaders
  2. Kick off particle simulation:
    • Update a constant buffer holding emitter properties:
      • Emitted particle count in current frame
      • Emitter mesh vertex, index counts
      • Starting particle size, randomness, velocity, lifespan, and any other emitter property
    • Write indirect arguments of following compute passes:
      • Emitting compute shader thread group sizes
      • Simulation compute shader thread group sizes
      • Reset draw argument buffer
    • Copy last frame simulated particle count to current frame alive counter
  3. Emitting compute shader:
    • Bind mesh vertex/index buffers, random colors texture
    • Spawn as many threads as there are particles to emit
    • Initialize a new particle on a random point on the emitter mesh surface
    • Decrement dead list counter atomically while getting last value, this is our new dead particle index, read the dead list on that location to retrieve the particle index for the particle buffer
    • Write the new particle to the particle buffer on this index
    • Increment alive particle count, write particle index into alive list 1
  4. Simulation compute shader:
    • Each thread reads alive list 1, and updates particle properties if particle has life > 0, then writes it into alive list 2. Increment Draw argument buffer.
    • Otherwise, kill particle by incrementing dead list counter and writing particle index to dead list
    • Write particle distance squared to camera for sorting (optional)
    • Iterate through force fields in the scene and update particle according (optional)
    • Check collisions with depth buffer and bounce off particle (optional)
    • Update AABB by atomic min-maxing particle positions for additional culling steps (optional)
  5. Sorting compute shader (optional):
    • An algorithm like bitonic sorting maps well to GPU, can sort a large amount
    • Multiple dispatches required
    • Additional constant buffer updates might be required
  6. Swap alive lists:
    • Alive list 1 is the alive list from previous frame + emitted particles in this frame.
    • In this frame we might have killed off particles in the simulation step and written the new list into Alive list 2. This will be used when drawing, and input to the next frame emitting phase.
  7. Draw alive list 1:
    • After the swap, alive list 1 should contain only the alive particle indices in the current frame.
    • Draw only the current alive list count with DrawIndirect. Indirect arguments were written by the simulation compute shader.
  8. Kick back and profit 🙂
    • Use your new additional CPU time for something cool (until you move that to the GPU as well)

Note: for adding particles, you could use append-consume structured buffers, or counters written by atomic operations in the shader code. The append-consume buffers might include an additional performance optimization hidden from the user, which is GDS (global data share) for the hardware that supports it. Basically it is a small piece of fast access memory visible to every thread group located on a separate chip instead of the RAM. I went with the atomic counter approach and haven’t tested performance differences yet. The append-consume buffers are not available in every API which makes them less appealing.

gpuparticles3

The following features are new and nicely fit into the new GPU particle pipeline:

  • Sorting
    • I never bothered with particle sorting on the CPU. It was already kind of slow without it, so I got away with only sorting per-emitter, so farther away emitters were drawn earlier. I decided to go with bitonic sorting because I could just pull that from the web. This is a bit too involving and though that it would consume too much time to implement on my own and debug. AMD has a really nice implementation available. Sorting becomes a required step if the particles are not additively blended because threads are now writing them in arbitrary order.
  • Depth buffer collisions
    • This is a very nice feature of GPU particle systems. This is essentially a free physics simulation for particles which are on the screen. This only involves reading the depth buffer in the simulation phase, checking if the particle is behind it, and if it is, then read the normal buffer (or reconstruct normal from depth buffer) and modulate particle velocity by reflecting it with the surface normal.
  • Force fields
    • This is completely possible with CPU particle systems as well, but now we can apply them to a much bigger simulation. In the simulation compute shader we can preload some force fields to LDS (local data share) for faster memory access.
  • Emit from skinned mesh
    • Mesh skinning is done on the GPU nowadays, so using the skinned meshes while emitting becomes trivial, with no additional cost whatsoever.
  • Async compute
    • Now I still haven’t had a chance to try any async compute, but this seems like a nice candidate for that because simulation could be very much decoupled from rendering and it could lead to better utilization of GPU resources. Async compute is available in the modern low level graphics APIs like DX12, Vulkan and console specific APIs. It also requires hardware support which is available only in the latest GPUs.

gpuparticles1

Debugging

Debugging a system which is living on the GPU is harder than on the CPU but essential. We should ideally make use of a graphics debugger software, but there are also opportunities to make our life easier with creating some utilities for this purpose. The thing that helped me most is writing out some data about the simulation to the screen. For this, we need to access the data which is resident on the GPU, which we can do as if we were downloading something from a remote machine. Using the DirectX 11 API, we can do this by creating a resource of the same type and size that we want to download and creating it with D3D11_USAGE_STAGING flag, no bind flags and READ CPU access. We have to issue a copy into this buffer from the one we want to download by calling ID3D11DeviceContext::CopyResource, then read the buffer contents by mapping it with READ flags. As the buffer contents will only be available when the frame is finished with rendering until that point, we can either introduce a CPU-GPU sync point and wait in place until the operation completes or do the mapping a few frames later. In a debugging scenario, a sync point might be sufficient and simpler to implement, but we should avoid any such behaviour in the final version of the application.

Drawing

Drawing billboards would be seem like a nice place to use geometry shaders. Unfortunately, geometry shaders introduce inefficiencies in the graphics pipeline, because of various reasons. Primitives need to be traversed and written to memory serially, some architectures even go as far as writing the GS output to system memory. The option of my choice is just leaving the geometry shader and doing the billboard expansion in the vertex shader. For this, we must spawn the VS with a triangle list topology, vertex count of particleCount * 6 and calculate the particle index and billboard vertex index from the SV_VertexID system-value semantic. Like this:

static const float3 BILLBOARD[] = {

  float3(-1, -1, 0), // 0

  float3(1, -1, 0), // 1

  float3(-1, 1, 0), // 2

  float3(-1, 1, 0), // 3

  float3(1, -1, 0), // 4

  float3(1, 1, 0), // 5

};

VertextoPixel main(uint fakeIndex : SV_VERTEXID)

{

  uint vertexID = fakeIndex % 6;

  uint instanceID = fakeIndex / 6;

  Particle particle = particleBuffer[aliveList[instanceID]];

  float3 quadPos = BILLBOARD[vertexID];

  // …

}

Additionally, for better drawing performance, you should use indexed drawing with 4 vertices per quad, but that way the two index lists will be six times the size each, so bandwidth will increase for the simulation. Maybe it is still worth it, I need to compare performance results.

Conclusion:

There are many possibilities to extend this system, because compute shaders make it very flexible. I am overall happy with how this turned out. Provided my previous particle systems were quite simplistic, porting all the features was not very hard and I haven’t had to make any compromises. The new system frees up CPU resources which are more valuable for gameplay logic and other systems which are interconnected. Particles are usually completely decoupled from the rest of the engine so they are an ideal candidate for running it remotely on a GPU.

You can check the source code of my implementation of GPU-based particles in Wicked Engine:

Feel free to rip off any source code from there! Thank you for reading!

Inspiration from:

Compute – based GPU particles by Gareth Thomas

Which blend state for me?

think

If you are familiar with creating graphics applications, you are probably somewhat familiar with different blending states. If you are like me, then you were not overly confident in using them, and got some basics ones copy-pasted from the web. Maybe got away with simpe alpha blending and additive states, and heard of premultiplied alpha somewhere but didn’t really care as long as it looked decent enough at the time. Surely, there are a lot of much more interesting stuff waiting for you to be implemented. Then later you realize, that something looks off with an alpha blended sprite somewhere. You correct it with some quick fix and forget about it. A week later, you are want to be playing with some particle systems, but there is something wrong with that, the blending doesn’t look good anymore because of a dirty tweak you made earlier. Also, your GUI layer was displaying the wrong color the whole time, but just enough not to notice. There are just so many opportunities for screwing up your blending states without noticing it immediately. Correcting the mistakes can really quickly turn into a big headache. Here I want to give some practical examples and explanations of different use cases, for techniques mainly used in 3D rendering engines.

First thing is rendering alpha blended sprites on top of each other, just to the back buffer immediately. We need a regular alpha blending renderstate for that which does this:

dst.rgb = src.rgb * src.a + dst.rgb * (1 – src.a)

Here, dst means the resulting color in the rendertarget (which is now the backbuffer). Src is the color of the sprite which the pixel shader returns. Our colors are just standard 32 bit rgba in the range [0, 1] here. With this, we have successfully calculated a good output color which we can just write as-is to the back buffer. For this, we don’t care what the alpha output is, because no further composition will be happening. Here is the corresponding state description for DirectX 11:

D3D11_RENDER_TARGET_BLEND_DESC desc;

desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_SRC_ALPHA;

desc.DestBlend = D3D11_BLEND_INV_SRC_ALPHA;

desc.BlendOp = D3D11_BLEND_OP_ADD;

The tricky part comes when we want to draw our alpha blended sprites to separate layers which will be composited later on. In that case we also have to be careful what alpha value we write out. Take a simple scenario for example, in which you render a sprite with alpha blending to a render target, and later you render your rendertarget to your backbuffer with the same alpha blending. For this, we want to accumulate alpha values, so just add them:

dst.a = src.a + dst.a

Which is equivalent to the following blend state in DirectX 11 (just append to the previous snippet):

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

Accumulating alpha seems like a good fit for a 32 bit render target, as values will be clamped to one, so opacity will increase with overlapping sprites, but colors won’t be over saturated. Try blending the render target layer now to the back buffer. There will be an error, which could not be obvious at first (which means the worst kind of error). Let me show you:

blendOn the black background, you might even not notice the error at first, but on the white background it becomes apparent at once (in this case, “background” means our backbuffer). For different images, there might be different scenarios where the problem becomes apparent. This could be a challenge to overcome, if you already made a lot of assets, maybe even compensated somehow for the error without addressing the source of the error: the blend operation is not correct anymore! First, we blended the sprite onto the layer, this is still correct, but we blend the layer with the same operation again, so alpha messes with the colors two times here. There is a correct solution to the problem: premultiplied alpha blend operation, which is this:

dst.rgb = src.rgb + dst.rgb * (1 – src.a)

dst.a = src.a + dst.a

Create it in DX11 like this:

D3D11_RENDER_TARGET_BLEND_DESC desc;

desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_ONE;

desc.DestBlend = D3D11_BLEND_INV_SRC_ALPHA;

desc.BlendOp = D3D11_BLEND_OP_ADD;

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

The only change is that we do not multiply with source alpha any more. Our problem is fixed now, we can keep using our regular alpha blending on sprites, but use premultiplied blending for the layers. Simple right? But do not forget about premultiplied alpha just yet, it can help us out for more problems as well.

I had multiple problems with rendering particle systems to off-screen buffers for soft particles, and can also help with performance if the buffer is of small resolution. One of the problems was the above mentioned faulty alpha blending of particles (particles are the sprites, the layer is the off-screen buffer when mapped to the previous example). The other issue is that I also want to render additive particles to the same render target and I want to blend the whole thing later in a single pass. This is an additive blending state:

dst.rgb = src.rgb * src.a + dst.rgb

dst.a = dst.a

Which corresponds to this state in DX11:

D3D11_RENDER_TARGET_BLEND_DESC desc;

desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_SRC_ALPHA;

desc.DestBlend = D3D11_BLEND_ONE;

desc.BlendOp = D3D11_BLEND_OP_ADD;

dest.SrcBlendAlpha = D3D11_BLEND_ZERO;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

Notice that src output doesn’t contribute to alpha, but that is no problem at all, because the premultiplied layer blending will still take the layer color, just adding the full destination color to it. This is only possible if our layer is configured for premultiplied blending, otherwise it would just disappear upon blending. We can also have our particle textures themselves be in premultiplied texture format (where the texture colors are already multiplied by alpha) and blend state, and blending that onto the layer, it just works. When I wasn’t familiar with this, I used a separate render target for regular alpha blended and premultiplied texture format particles and an other one for additive ones, what a waste! We can see that premultiplied blending is already very flexible, so keep that in mind because it can save the day on many occasions. It also makes a huge difference in mipmap generation, see this neat article from Nvidia!

Side note: premultiplied alpha blending was also widely used because it has better blending performance. Think of it as precomputing the alpha blend factor and storing it inside the texture. The performance reasons are probably not so apparent today.

I had an other problem with particle system rendering, because they were rendered to a HDR floating point target. That one doesn’t clamp the alpha values. Consider the case when a particle’s alpha value is bigger than one for whatever reason and blended with regular alpha blending: the term dest.rgb * (1 – src.a) is now of course producing negative values. This is easy to overcome with just saturating the alpha output of the pixel shader, done! The other problem is this: dst.a = src.a + dst.a which can still result in larger than one alpha values, but this will only be a problem later on, when blending the layer as premultiplied alpha. We would need to saturate the (1 – src.a) term in the blending state but we cannot, there is not such state. We have a blend D3D11_BLEND_SRC_ALPHA_SAT value, but there is not for the inverse of it. The workaround that I am using for this is to modify the particle alpha blending state to accumulate alpha a bit differently:

dest.a = src.a + dst.a * (1 – src.a)

In DX11 terms:

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_INV_SRC_ALPHA;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

This accumulation method is probably not perfect, but works really well in practice:

particleblend

That’s it, I think these are the three most important blending modes, most effects can be achieved with a combination of these. Just always keep an eye on your blend state creation and be very explicit about it, that is the way to avoid many bugs down the road. If you were like me and haven’t paid much attention to these  until now, this is the best time to reiterate on this, because tracking down errors associated with this becomes a hard journey later. Thanks for reading!

Forward+ decal rendering

decalsDrawing decals in deferred renderers is quite simple, straight forward and efficient: Just render boxes like you render the lights, read the gbuffer in the pixel shader, project onto the surface, then sample and blend the decal texture. The light evaluation then already computes lighting for the decaled surfaces. In traditional forward rendering pipelines, this is not so trivial. It is usually done by cutting out geometry under the decal, creating a new mesh from it with projected texture coordinates and render it for all lights, additively. Apart from the obvious increased draw call count and fillrate consumption, there is even potential for z-fighting artifacts. While moving to tile-based forward rendering (Forward+), we can surely think of something more high-tech.

We want to avoid additional geometry creation, increased draw call count while keeping the lighting computation constant. But in addition to these, with this new technique we can even trivially support modification of surface properties, creating decals which can modify surface normal, roughness, metalness, emissive, etc. or even do parallax occlusion mapping. We can even apply decals to transparent surfaces easily! This article will describe the outline of the technique without source code. You can look at my implementation however, here: culling, and sorting shader; blending evaluation shader.

In forward+ we have a light culling step and a light evaluation separately. The decals will be inserted for both passes. A culling compute shader iterates through a global light array and decides for each screen space tile, which lights are inside and adds them to a global list (in case of tiled deferred, it just adds them to a local list and evaluates lighting there and then). For adding decals to the culling, we need to extend the light descriptor structure to be able to hold decal information, and add functions to the shader to be able to cull oriented bounding boxes(OBBs). We can implement OBB culling by doing coarse AABB tests. Transform the AABB of the tile by the decal OBB’s inverse matrix (while keeping min-max up to date) and test the resulting AABB against a unit AABB. This is achieved by determining the 8 corner points of the tile AABB, transforming each by the inverse decal OBB, then determining the min and max corner points of the resulting points.

tiled_decals

Rendering the decals takes place in the object shaders while we also evaluate the lighting. If the decals can modify surface parameters, like normals, it is essential that we render the decals before the lights. For that, we must have a sorted decal list. We can not avoid sorting the decals, anyway, as I have found out the hard way. Because the culling is performed in parallel, the decals can be added to the tile in arbitrary order. But we have a strict order when blending the decals, that is the order we have placed them onto the surface. If we don’t sort, it can lead to severe flickering artifacts when there are overlapping decals. Thankfully the sorting is straight forward, easily parallellized and can be done in the LDS (Local Data Share memory) entirely. I have gotten this piece of code from an AMD presentation (a bitonic sort implementation in LDS).

The easiest way is that we sort the decals in the CS so that the bottom decal is first, and the top is last (bottom-to-top sorting). This way, we can do regular alpha blending (which is a simple lerp in the shader) easily. Though we can do better. This way we sample all of the decals, even if the bottom ones are completely covered by decals placed on top. Instead we should sort the opposite way, so that first we evaluate the top ones, and then the decals underneath but just until the alpha accumulation reaches one. We can skip the rest. The blending equation also needs to be modified for this. The same idea is presented in the above mentioned AMD presentation for tile based particle systems. The modified blending equation looks like this:

color = ( invDestA x srcA x srcCol ) + destCol

alpha = srcA + ( invSrcA x destA )

This method can save us much rendering time when multiple decals are overlapping. But this can result in different output when we have emissive decals for example. In the bottom-to-top blending, emissive decals will always be visible because the contribution is added to the light buffer, but the top-to-bottom sorting (and skip) algorithm will skip the decals which are completely covered. I think this is “better” behaviour overall but on a subjective basis of course.

The nice thing about this technique, is that we can trivially modify surface properties, if we just sample all of our decals before all of the lights. Take this for example: we want to modify normal of the surface with the decal normal map. We already have our normal vector in our object shader, so when we get to the decals, just blend it in shader code with the decal normal texture, without the need for any packing/unpacking and tricky blending of g-buffers (a’la deferred). The light evaluation which comes after it “just works” with the new surface normal without any modification at all.

Maybe you have noticed, that we need to do the decal evaluation in dynamically branching code, which means that we must leave the default mip-mapping support. This is because from the compiler’s standpoint, we might perfectly well not be evaluating the same decals in neighbouring pixels, but we need those helper pixels for correct screen space derivative coordinates. In our case when we have multiple of two pixel count tiles (I am using 16×16 tiles), we are being coherent for our helper pixels, but the compiler doesn’t know that unfortunately. I haven’t yet found a satisfying way to overcome this problem. I experimented with linear distance/screen space size based mip selection, but found them unsatisfying for my purposes (they might be ok for a single game/camera type though).

 

Update: Thanks to MJP, I learned a new technique for obtaining nice mip mapping results: We just need to take the derivatives of the surface world position, transform it by the decal projection matrix (but leave the translation), and we have the decal derivatives that we can feed into Texture2D::SampleGrad for example. An additional note is that when using a texture atlas for the decals, we need to take into consideration the atlas texture coordinate transformation. So, just multiply the decal derivatives by the atlas transformation’s scaling part. Cool technique!

We also need to somehow dynamically support different decal textures in the same object shader. A texture atlas comes handy in this case, or bindless textures are also an option in newer APIs.

As we have added decal support to the tiled light array structure, the structure probably is getting bloated, which means less cache efficient, because most lights probably don’t need a decal inverse matrix (for projection), texture atlas offsets, etc. For this, the decals could probably get their own structure and a different array, or just tightly pack everything in a raw buffer (byteaddressbuffer in DX11). I need to experiment with this.

Conclusion:

This technique is a clear upgrade from the traditional forward rendered decals, but comparing it with the deferred decals is not a trivial matter. First, we can certainly optimize deferred decals in several ways. I have been already toying with the idea of using Rasterizer Ordered Views to optimize the blending in a similar way, and eliminating overdraw. Secondly, we have increased branching and register pressure in the forward rendering pass, while rasterization of deferred decals is a much more light weight shader which can be better parallellized when the overdraw is not so apparent. In that case, we can get away with rendering much more deferred decals than tiled decals. The tile-based approach gets much better with increased overdraw because of the “skip the occluded” behaviour as well as the reduced bandwidth cost of not having to sample a G-buffer for each decal. Forgive me for not providing actual profiling data at the moment, this article intends to be merely a brain dump, but I also hope somewhat inspirational.

XBOX One and UWP rant

I have experimented with the Universal Windows Platform (UWP) for some time now. I mainly maintain the UWP compatibility for Wicked Engine, a game engine written by me in C++ and using the DirectX 11 API for rendering. A perfect candidate for an UWP applications, because there need not be difficult porting for new platforms, supporting Desktop, Phone, tablet, augmented reality devices and XBOX One should be trivial (at least in theory).

The game engine originally only targetted the traditional Windows desktop environment, when I found out about the Universal Windows Platform, I knew I had to try it out. It turned out to be quite easy. I just needed to replace a few calls to the original WinAPI with UWP variants, like getting the screen resolution, the window handling and keyboard input. We also need to have an UWP compliant application entry point, you can see an example in the Wicked Engine Demos project. All the rest remains the same, like rendering code, joystick input, sound system, physics engine and the scripting interface. Only, for phone/tablet, you probably need to handle the touch screen, but UWP comes with a nice API for it.

I have been testing the Windows Phone and UWP desktop builds, and while I don’t think they are very relevant as of now, still feels nice to have support for them. Recently, I had the chance to acquire an XBOX One devkit for a short time, so I jumped on it and set up some demos in my engine and built it as an UWP app. The whole process was incredibly easy and straight forward, no change needed to be done in the code, the hardest part was launching the app on a “remote machine” which is the XBOX devkit itself. In the application, there were some lightweight scenes, like displaying a rotating image, static mesh, animated mesh, and some heavy rendering tasks, like and outdoor scene with multiple lights, shadows, instanced trees, particle systems, water, reflections, etc. The performance was abysmal in all of them. It really caught me off guard, because everything performed much better on my phone. It wasn’t even rendering most of the things in real time, but with one frame per second for the complex scene, and even the image rendering could barely do 50 FPS. With a little investigation, I have found out that I was creating my graphics device with the WARP rasterizer, which is a software renderer. The minimal requirements of the engine is D3D11, feature level 11.0 hardware, because it is making use of advanced rendering techniques, which need support for structured buffers, unordered access resources, compute shaders, tessellation and the like from the graphics driver. It could only create an appropriate graphics device in software mode, which is very slow because now the entire rendering runs on the CPU.

But can’t the XBOX One make use of D3D11 and even D3D12? The answer is, it can, but not with UWP apps. For developing high performance 3D applications, like games, you must be a registered XBOX developer, obtain the XBOX SDK (XDK) and build against the new platform that comes with it. Here’s what they say about it in the “known issues” of UWP development article on MSDN:

“UWP on Xbox One supports DirectX 11 Feature Level 10. DirectX 12 is not supported at this time.

Xbox One, like all traditional games consoles, is a specialized piece of hardware that requires a specific SDK to access its full potential. If you are working on a game that requires access to the maximum potential of the Xbox One hardware, you can register with the ID@XBOX program to get access to that SDK, which includes DirectX 12 support.”

Apart from that, the developer is also just getting access to limited system resources, like smaller amount of allowed memory allocation, reduced CPU speed and fewer threads with the UWP platform. PIX, the graphics debugger which comes with the XDK also didn’t recognize my application, but maybe that was because the WARP device. I could capture a frame with the Visual Studio graphics debugger, however.

This is just a really big letdown for me because I was really interested trying out the performance of my engine on a game console. I have no chance now, because I am not a registered developer, so no way of trying it out this time. Maybe I will look into the developer program, but I suspect that for a successful application, they need an actual company, business plan and probably much more.

Thanks for reading!

Skinning in a Compute Shader

skinning

Recently I have moved my mesh skinning implementation from a streamout geometry shader to compute shader. One reason for this was the ugly API for the streamout which I wanted to leave behind, but the more important reason was that this could come with several benefits.

First, compared to traditional skinning in a vertex shader, the render pipeline can be simplified, because we only perform skinning once for each mesh instead of in each render pass. So when we render our animated models multiple times, for shadow maps, Z-prepass, lighting pass, etc.. we are using regular vertex shaders for those passes with the vertex buffer swapped out for the pre-skinned vertex buffer. Also, we avoid many render state setup, like binding bone matrix buffers for each render pass. But this can be done in a geometry shader with stream out capabilities as well.

The compute shader approach has some other nice features compared to the first point. The render pipeline of Wicked Engine requires the creation of a screen space velocity buffer. For that, we need out previous frame animated vertex positions. If we don’t do it in a compute shader, we probably need to skin each vertex with the previous frame bone transforms in the current frame to get the velocity of the vertex which is currentPos – prevPos (If we have deinterleaved vertex buffers, we could avoid it by swapping vertex position buffers). In a compute shader, this becomes quite straight forward, however. Perform skinning only for the current frame bone matrices, but before writing out the skinned vertex to the buffer, load the previous value of the position and that is your previous frame vertex position. Write it out then to the buffer at the end.

In a compute shader, it is the developer who can assign the workload across several threads, not rely on the default vertex shader thread invocations. Also, the vertex shader stage has strict ordering specifications, because vertices must be written out in the exact same order they arrived. A compute shader can just randomly write into the skinned vertex buffer when it is finished. That said, it is also the developer’s responsibility to avoid writing conflits. Thankfully, it is quite trivial, because we are writing a linear array of data.

In compute shaders we can also make use of LDS memory to reduce memory reads. This can be implemented as each thread in a group only loads one bone data from main memory and stores it in LDS. Then the skinning computation just reads the bone data from LDS, and because each bone now doesn’t read 4 bones from VRAM but LDS, it has the potential for speedup. I have made a blog about this.

An other nice feature is the possibility lo leverage async compute in a newer graphics APIs like DirectX 12, Vulkan or the Playstation 4 graphics API. I don’t have experience with it, but I imagine it would be more taxing on the memory, because we would probably need to double buffer the skinned vertex buffers.

An other possible optimization is possible with this. If the performance is bottlenecked by the skinning in our scene, we can avoid skinning meshes in the distance for every other frame or so for example, so a kind of a level of detail technique for skinning.

The downside is that this technique comes with increased memory requirements, because we must write into global memory to provide the data up front for following render passes. We also avoid the fast on-chip memory of the GPU (memory for vertex shader to pixel shader parameters) for storing the skinned values.

Here is my shader implementation for skinning a mesh in a compute shader:


struct Bone

{

float4x4 pose;

};

StructuredBuffer boneBuffer;

ByteAddressBuffer vertexBuffer_POS; // T-Pose pos

ByteAddressBuffer vertexBuffer_NOR; // T-Pose normal

ByteAddressBuffer vertexBuffer_WEI; // bone weights

ByteAddressBuffer vertexBuffer_BON; // bone indices

RWByteAddressBuffer streamoutBuffer_POS; // skinned pos

RWByteAddressBuffer streamoutBuffer_NOR; // skinned normal

RWByteAddressBuffer streamoutBuffer_PRE; // previous frame skinned pos

inline void Skinning(inout float4 pos, inout float4 nor, in float4 inBon, in float4 inWei)

{

 float4 p = 0, pp = 0;

 float3 n = 0;

 float4x4 m;

 float3x3 m3;

 float weisum = 0;

// force loop to reduce register pressure

 // though this way we can not interleave TEX - ALU operations

 [loop]

 for (uint i = 0; ((i < 4) && (weisum<1.0f)); ++i)

 {

 m = boneBuffer[(uint)inBon[i]].pose;

 m3 = (float3x3)m;

p += mul(float4(pos.xyz, 1), m)*inWei[i];

 n += mul(nor.xyz, m3)*inWei[i];

weisum += inWei[i];

 }

bool w = any(inWei);

 pos.xyz = w ? p.xyz : pos.xyz;

 nor.xyz = w ? n : nor.xyz;

}

[numthreads(256, 1, 1)]

void main( uint3 DTid : SV_DispatchThreadID )

{

 const uint fetchAddress = DTid.x * 16; // stride is 16 bytes for each vertex buffer now...

uint4 pos_u = vertexBuffer_POS.Load4(fetchAddress);

 uint4 nor_u = vertexBuffer_NOR.Load4(fetchAddress);

 uint4 wei_u = vertexBuffer_WEI.Load4(fetchAddress);

 uint4 bon_u = vertexBuffer_BON.Load4(fetchAddress);

float4 pos = asfloat(pos_u);

 float4 nor = asfloat(nor_u);

 float4 wei = asfloat(wei_u);

 float4 bon = asfloat(bon_u);

Skinning(pos, nor, bon, wei);

pos_u = asuint(pos);

 nor_u = asuint(nor);

// copy prev frame current pos to current frame prev pos

streamoutBuffer_PRE.Store4(fetchAddress, streamoutBuffer_POS.Load4(fetchAddress));

// write out skinned props:

 streamoutBuffer_POS.Store4(fetchAddress, pos_u);

 streamoutBuffer_NOR.Store4(fetchAddress, nor_u);

}

Oh god I hate this wordpress code editor… (maybe I just can’t use it properly)

As you can see, quite simple code, I just call this compute shader with something like this:

Dispatch( ceil(mesh.vertices.getCount() / 256.0f), 1, 1);

These vertex buffers are not packed yet as of now, which is quite inefficient. Of course, positions could probably be stored in 16-bit float3s (but you must animate in local space then), normals can be packed nicely into 32-bit uints, bone weights and indices should be packed into a single buffer and packed into uints as well. If you are using raw buffers (byteaddressbuffer in hlsl), then you have to do the type conversion yourself. You can also use typed buffers, but performance may be diminished. You can see an example of the optimizations with manual type conversion of compressed vertex streams in my Wicked Engine repo.

I am using precomputed skinning in Wicked Engine for a long time now, so can’t compare with the vertex shader approach, but it is definetly not worse than the streamout technique. I can imagine that for some titles, it might not be worth it to store additional vertex buffers to VRAM and avoid on-chip memory for skinning results. However, this technique could be a candidate in optimization scenarios because it is easy to implement and I think also easier to maintain because we can avoid the shader permutations for skinned and not skinned models.

Thanks for reading!