Thoughts on Skinning and LDS


I’m letting out some thoughts on using LDS memory as a means to optimize a skinning compute shader. Consider the following workload: each thread is responsible for animating a single vertex, so it loads the vertex position, normal, bone indices and bone weights from a vertex buffer. After this, it starts doing the skinning: for each bone index, load a bone matrix from a buffer in VRAM then multiply the vertex positions and normals by these matrices and weight them by the vertex bone weights. Usually a vertex will contain 4 bone indices and 4 corresponding weights. Which means that for each vertex we are loading 4 matrices from VRAM. Each matrix is 3 float4 vectors, so 48 bytes of data. We have thousands of vertices for each model we will animate, but only a couple of hundred bones usually. So should each vertex load 4 bone matrices from a random place in the bone array?

Instead, we can utilize the LDS the following way: At the beginning of the shader, when the vertex information is being loaded, each thread also loads one matrix from the bone array and stores it inside the LDS. We must also synchronize the group to ensure the LDS bone array has been complete. After all the memory has been read from VRAM, we continue processing the skinning: iterate through all bone indices for the vertex, and load the corresponding bone matrix from LDS, then transform the vertex and blend by the bone weights. We just eliminated a bunch of memory latency from the shader.

Consider what happens when you have a loop which iterates through the bone indices for a vertex: First you load the bone and you want to immediately use it for the skinning computation, then repeat. Loading from a VRAM buffer causes significant latency until the data is ready to be used. If we unroll the loop, the load instructions can be rearranged and padded with ALU instructions that don’t depend on those to hide latency a bit. But unrolling the loop increases register allocations (VGPR = Vector General Purpose Register, for storing unique data to the thread; buffer loads consume VGPR unless they are known to be common to the group at compile time, then they can be placed to scalar registers – SGPR) and can result in lower occupancy as we have a very limited budget of them. We also want a dynamic loop instead because maybe a vertex has fewer bones than the maximum, so processing could exit early. So having tight dynamic loop with VRAM load then immediately ALU instructions is maybe not so good. But once that loop only accesses LDS, the latency can be significantly reduced, and performance should increase.

But LDS does not come for free and it is also a limited resource, like VGPR. Let’s look at the AMD GCN architecture: We have a maximum of 64 KB of LDS for a single compute unit (CU), though HLSL only lets us use 32 KB in a shader. If a shader uses the whole 32 KB, it means that the shader can only be running two instances of itself on the CU. We have a bone data structure which is a 3×4 float matrix, 48 bytes. We can fit 682 bones into LDS and still have two instances of the compute shader operate in parallel. But most of the time we hardly have skeletons consisting of that many bones. In my experience, less than 100 bones should be enough for most cases, but we surely won’t use more than say 256 bones for a highly detailed model, either in real time apps. So say that our shader will declare an LDS bone array of 256 bones, and the thread group size is also 256, so each thread will load one bone into LDS. 256*48 bytes = 12 KB. This means that 5 instances of this shader could be running in parallel on a CU, so 5*256 = 1280 vertices processed. That is if we don’t exceed the max VGPR count of 65536 for a CU. In this case it means that a single shader must at maximum fit into the  51 VGPR limit (calculated as 65536 VGPR / 1280 threads). Most cases we will easily fit into even a 128 bone limit, so an LDS bone array size of 128 and thread group size of 128 threads will just be enough and be much easier on the GPU.

However, I can imagine a scenario, which could be worse with the LDS method, if there is a complicated skeleton, but small mesh referencing only a few of the bones. In this case when there is one skeleton for multiple meshes, maybe we should combine the meshes into a single vertex buffer and use an index buffer to separate between them, so this way a single dispatch could animate all the meshes, while they can be divided into several draw calls when needed.

It sounds like a good idea to utilize LDS in a skinning shader and it is a further potential improvement over skinning in vertex/geometry shader and doing stream out. But as with anything on the GPU this should be profiled on the target hardware which you are developing on. Right now my test cases were unfortunately maybe not the best assets there are, and a puny integrated Intel mobile GPU, but even so I could find a small performance improvement with this method.

Thank you for reading, you can find my implementation on GitHUB! Please tell me if there is any incorrect information.

UPDATE: I’ve made a simple test application with heavy vertex count skinned models, and toggleable LDS skinning: Download WickedEngineLDSSkinningTest (You will probably need Windows 10 to run it and DirectX 11)

Further reading:

Large thread groups on GCN

Easy Transparent Shadow Maps


Supporting transparencies with traditional shadow mapping is straight forward and allows for nice effects but as with anything related to rendering transparents with rasterization, there are corner cases.

The implementation is really simple once you have implemented shadow mapping for opaque objects. After the opaque shadow pass, we must render the transparents into a color buffer, but reject samples which would be occluded by opaques, so using a depth read-only depth stencil state. The transparents should be blended multiplicatively. Sorting does not matter with a multiply blend state. In bullet points:

  1. Render opaque objects into depth stencil texture from light’s point of view
  2. Bind render target for shadow color filter: R8G8B8A8_UNORM works good
  3. Clear render target to 1,1,1,0 (RGBA) color
  4. Apply depth stencil state with depth read, but no write
  5. Apply multiplicative blend state eg:
    • SrcBlend = BLEND_DEST_COLOR
    • DestBlend = BLEND_ZERO
    • BlendOp = BLEND_OP_ADD
  6. Render transparents in arbitrary order

When reading shadow maps in shading passes, we only need to multiply the lighting value with the transparent shadow map color filter if the pixel is inside the light. There is a slight problem with this approach, that you will notice immediately. Transparent objects now receive their own colored self shadow too. The simplest fix is to just disable the colored part of the shadow calculation for transparent objects. We can already produce nice effects with this, this is not a huge price to pay.

See it in action, transparents are rendered without colored self-shadows:


But they receive shadows from opaque objects just fine:


There is a technique which would allow us to render colored shadows on top of transparents too. This involves keeping an additional shadow depth map for transparencies. The flow of this technique is like this (from a Blizzard presentation):

  1. Render opaque shadow map
  2. Render transparent shadow map
    • To a separate depth texture!
    • Depth writes ON
  3. Clear shadow color filter texture (like in the simple approach)
  4. Render transparent again to color filter render target
    • But use the opaque shadow map’s depth stencil
    • depth read ON
    • depth write OFF

And in the shading step, now there will be two shadow map checks, one for the opaque, one for the transparent shadow maps. Only multiply the light with the shadow filter color texture when the transparent shadow check fails.

This will eliminate false self shadows from transparent objects. But unfortunately now when a transparent receives a colored shadow, its own transparent shadow color will also contribute to itself.

What’s more interesting, are the additional effects we can achieve with transparent shadow maps:

Textured shadows, which can be used as a projector for example, just put a transparent textured geometry in front of a light:


And underwater refraction caustics:


The caustics are implemented as adding additional light to the surface, so even if the surface below the water would be in the water’s shadow it would add in the caustic effect. For this I am using the transparent color filter render target’s alpha channel with an additive blend mode. The water shadow shader renders the slope of the water normal map into the alpha channel, eg. this:

float4 outputColor.a = 1 – saturate(dot(normalMap.rgb * 2 – 1, float3(0, 0, 1)));

Which could probably use a more sophisticated method, which I didn’t bother to find yet. So in the light evaluation shaders, I add the transparent color filter alpha channel to the pixelIsInShadow value multiplied by some causticStrength modulator to render the caustics. Caustics are added because there can be brighter caustics than the light brightness, because in reality they are generated when multiple rays (photons) arrive at the same location from different sources.

That’s it, I think this is a worthwhile technique to include in a game engine. Even the simplest implementation can be used for many interesting effects. Thank you for reading!

Related article:

StarCraft 2 Effects and techniques

Optimizing tile-based light culling

Tile-based lighting techniques like Forward+ and Tiled Deferred rendering are widely used these days. With the help of such technique we can efficiently query every light affecting any surface. But a trivial implementation has many ways to improve. The biggest goal is to refine the culling results as much as possible to help reduce the shading cost. There are some clever algorithms I want to show here which are relatively easy to implement but can greatly increase performance.


For the starting point I assume a tiled rendering pipeline like the one described in a GDC 2011 presentation by Johan Andersson (DICE). The following post will be applicable to the Tiled Deferred and Forward+ variants of that pipeline. A quick recap of such a renderer:

  1. Dispatch a compute shader thread group per screen-space tile
  2. Thread group calculates tile frustum in view space
  3. Each thread reads pixel from depth buffer, performs atomic min-max to determine depth bounds
  4. Tile frustum is truncated by depth bounds
  5. Each thread computes if a light is inside the frustum, adds the light to a per tile list if yes
  6. Compute final shading from the per tile light list
    • Deferred can just read gbuffer, loop through per tile light list in the same shader and write the light buffer texture
    • Forward+ exports light list that will be read by second geometry pass and compute shading in pixel shaders

Our main goal is to refine culling results with eliminating false positives, which means cull lights which do not contribute to the final shading.

1.) Help out frustum culling with AABB

The first problem arises when we perform the culling of lights as sphere-frustum intersection tests. Let me demonstrate this with a point light:



Blue pixels on the second picture visualize which tiles contain the sphere light source after the culling. But the results are strange, why is the blue area kind of sqare shaped, when it should be a circle instead? This results because the math used for sphere – frustum plane tests are not accurate enough when we test big spheres against small frustums. Let me try illustrating this problem. The following image shows a tile frustum as seen on the screen:


As you can see, the sphere is completely outside the frustum, but none of the plane tests pass completely, so the light is not culled away properly. We can overcome this, by using axis aligned boxes (AABBs) instead of frustums. It is implemented like this:

bool SphereIntersectsAABB(in Sphere sphere, in AABB aabb)


  float3 vDelta = max(0, abs( – – aabb.extents);

  float fDistSq = dot(vDelta, vDelta);

  return fDistSq <= sphere.radius * sphere.radius;


The result:


This one seems good enough, but don’t throw out frustum culling just yet!

2.) Depth discontinuities are enemy

Holes in the depth buffer can greatly reduce efficiency of a tile based light selection, because the tile enclosing min-max depth AABB can get huge really fast:


In the image above I tried to illustrate (from an above point of view) that a depth discontinuity made the AABB large enough to intersect with a light which the frustum culling would have rejected. This is why AABB should be used alongside frustum culling and complementing each other.

Depth discontinuities usually introduce an other inefficiency, because there might be cases when a light will lie in empty space, not intersecting with anything, but still inside the tile, so the shading will receive the light, but it will not contribute at all:


As you can see, that light is inside the frustum, inside the AABB, but it is in empty space, between geometries, but our current algorithm will add it to the light list.

To solve this, there is a technique called 2.5D light culling, introduced by Takahiro Harada. In addition to that presentation, I would like to give an implementation for this in HLSL. So the basic idea is to create two bitmasks, one for the tile and one for the light which we are checking. The bitmasks are used by doing a bitwise AND operation with them to determine if the light intersects any geometry (when AND returns non zero) in the tile or not (when AND returns zero).


For the sake of a simpler image, I used a 9-bit mask, but we should use a 32-bit mask which we can represent by a uint variable.

The first bitmask is created for the whole tile once. While each thread reads its corresponding pixel from the depth buffer, it does an atomic min-max already, but now it also fills in a single relevant bit in a uint and performs an atomic OR to the tile bitmask. So what is the relevant bit? The algorithm says that we divide our tile depth range into 32 pieces and a 32-bit uint variable will contain those ranges. We first determine our tile depth bounds in linear space for this, then fill in the corresponding bit accordingly:

groupshared uint tileDepthMask = 0;

// …

float minDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, minDepth, 1)).z;

float maxDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, maxDepth, 1)).z;

float realDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, pixelDepth, 1)).z;

float depthRangeRecip = 32.0f / (maxDepthVS – minDepthVS);

uint depthmaskcellindex = max(0, min(32, floor((realDepthVS – minDepthVS) * depthRangeRecip)));

InterlockedOr(tileDepthMask, 1 << depthmaskcellindex);


This code is being run by every thread in the group. The unexplained function called UnprojectScreenSpaceToViewSpace just does what is says, the input is a screen coordinate point, and transforms it to view space. We are only interested in the Z coordinate here, so we only need to transform the input with the inverse projection matrix and divide the result by the w component afterwards. Otherwise if we would be interested in XY coordinates, we would also need to transform from [0,1] to [-1,1] range before projection. The function would look like this for the common case:

float4 UnprojectScreenSpaceToViewSpace(in float4 screenPoint)


  float4 clipSpace = float4(float2(screenPoint.x, 1 – screenPoint.y) * 2 – 1, screenPoint.z, screenPoint.w);

  float4 viewSpace = mul(clipSpace, xInverseProjectionMatrix);

  viewSpace /= viewSpace.w;

  return viewSpace;


So the bitmask construction code might look a bit intimidating, so let me explain a bit better what’s happening. We calculate the minZ, maxZ and current pixel Z in view space and determine the depth slice size which a single bit will represent (depthRangeRecip). Then shift a bit to the right place and adding it to the group shared tile mask by means of an atomic OR operation.

The tile mask is complete, so we only need to know how to construct a light mask. That must be done inside the loop where we are culling lights. On the first try I cooked up this:

float fMin = lightPosViewSpace.z – lightRadius.r;

float fMax = lightPosViewSpace.z + lightRadius.r;

uint lightMaskcellindexSTART = max(0, min(32, floor((fMin – minDepthVS) * depthRangeRecip)));

uint lightMaskcellindexEND = max(0, min(32, floor((fMax – minDepthVS) * depthRangeRecip)));

uint lightMask = 0;

for (uint c = lightMaskcellindexSTART; c <= lightMaskcellindexEND; ++c)


  lightMask |= 1 << c;


Here we determine the beginning and end ranges of a sphere light inside the view space depth range and push bits into the mask in a loop to the correct places one-by one:

In this mask for example, lightMaskcellindexSTART is the 11th bit from the right, and lightMaskcellindexEND is the 21st bit from the right:


Of course this loop seems like a waste to do inside a shader, so I needed to come up with something better. Rethinking how this a smaller bitfield could be pushed inside a bigger bitrange gave me the idea to exploit the truncation by the bitwise shift operators:

  • First, fill full mask like this:


    • uint lightMask = 0xFFFFFFFF;
  • Then Shift right with spare amount to keep mask only:


    • lightMask >>= 31 – (lightMaskcellindexEND – lightMaskcellindexSTART);
  • Last, shift left with START amount to correct mask position:


    • lightMask <<= lightMaskcellindexSTART;

So the resulting code eliminated a loop for only a very few instructions which is a lot better.

We have the tile mask and the light mask, so the only thing left to do is AND them to determine if a light touches something or not:

bool intersect2_5D = tileMask & lightMask;

And the resulting comparison of culling results in a high depth discontinuity scene with alpha tested vegetation and many point lights (red tiles have more that 50 lights):


As you can see, there is a big improvement in the tile visualizer heatmap, the shading will process much less lights and performance will improve in these difficult cases with alpha tested geometry. The scene shown above had the following timing results (performed with DX11 timestamp queries):

  • Basic 2D culling:
    • Culling: 0.53 ms
    • Shading: 10.72 ms
  • Improved 2.5D culling:
    • Culling: 0.64 ms
    • Shading: 7.64 ms

As you can see this is a definite improvement because while the culling shader took a bit more time to finish (0.1 ms), the object shading  took 3 ms less. I made a more detailed video some time ago:

3.) Other light types

We surely have the need to cull other light types than point lights. Spot lights come to mind at first thought, but there can be the desire to cull decal boxes or area-aligned local environment probe volumes as well.

  • Spot lights:
    • Implement a cone culling algorithm in shader code. This can be more expensive than sphere culling and again not too accurate with frustum tests. Then going further and implementing cone – AABB tests will further slow down our algorithm.
    • Or approximate the spotlight cone with tightly fitting a sphere around it. We don’t need to implement any other culling algorithms, only fitting a sphere around the spotlight, which is a straight forward, easy calculation. But this will result in excessive waste for thin/long cones. Code to fit sphere around cone:
      • float spotLightConeHalfAngleCos = cos(spotLightFOV * 0.5f);

        float sphereRadius = spotlightRange * 0.5f / (spotLightConeHalfAngleCos *spotLightConeHalfAngleCos);

        float3 sphereCenter = spotLightPos + spotLightDirection * sphereRadius;

      • Remember that you can precalculate this outside the shader, so the shader will only have a sphere to evaluate.
    • An other method involves doing cone-sphere test instead and can produce much more accurate results than the previous two, check it out here.
  • Decal/Environment probe box:
    • The idea here is to get away with AABB-AABB tests. But our decals or envprobes can be oriented in world space arbitrarily so they are OBBs! We can do something that I’d call a coarse AABB test. I transform the tile AABB from view space to world space (using the inverse view matrix of the camera) and recalculate the AABB of the resulting OBB. Then for each decal I transform the world space tile AABB by the decal’s inverse world matrix and perform an AABB-AABB intersection test with the resulting AABB and a unit AABB. The results are kind of good, but we have to refine again with the frustum as well. Transforming an AABB with a matrix is quite heavy though, because I am multiplying each corner with a matrix and keep track of the min-max at the same time. With two AABB transforms it gets heavy, but I have not yet found a better solution for this. And culling decals or envprobes is much less frequent than lights, so I comfort myself with that idea for the time being. The results:


4.) Gather vs. Scatter approaches

The implementation described by the above mentioned Battlefield presentation deals with the gather approach, which means that each thread group iterates through every potential light. A culling shader like this works best for a couple hundred lights, but can become slow when approaching a few thousand. To support more lights, we can implement some kind of a scatter approach. My suggestion is to have a coarse culling step before the regular light culling which operates on much bigger screen tiles and dispatches the work differently. This shader is dispatched so that each thread will process a single light, and determine which tile it belongs to and write (scatter) the light index into the according tile. Then the regular culling would read the light lists from the coarse tiles as opposed to iterating through each light on the screen. We could also use an acceleration structure for the lights like an octree for example and let the shader use that instead of coarse culling.

A scatter approach could be implemented for a different purpose as well: refining culling results. I described above that we can approximate spotlights with spheres, but an other idea would be to rasterize a cone instead in low resolution and let the pixel shader write the light into the appropriate tile corresponding to the invoked pixel.

5.) Don’t be greedy

It might be tempting to create a single uber-shader handling everything, creating frustums, reading depth buffer and assembling depth bounds, creating tile AABB, creating tile bitmask, culling lights, and in the case of tiled deferred, also evaluating the shading at the end. In reality, though it could be the wrong path to take.

First, on AMD GCN architecture for example, resources such as registers are shared across hardware execution units and if a shader uses too many, there will be contention for them and parallel execution will be reduced so that overall performance will be bottlenecked. This is called register pressure. Our shader which is creating frustums at the beginning are already using many registers for example which could be precalculated instead to lighten the load. AABB calculation further reduces available registers and so does calculating the tile bitmask. A tiled deferred shader at the end of the culling shader can also be very complex and utilizing many registers at once.

Then there is the part when we are creating the depth bounds with atomic operations. Atomic operations can be slow, so calculating the depth bounds in a separate shader by means of parallel reduction could be a better alternative. A reduced resolution depth buffer can also be reused later as a hierarchical-Z pyramid for instance.

Divergent branching in a shader is only a good idea if we design the shader to be highly coherent in the branches it takes for a thread group. A light culling setup usually works best in 16×16 or 32×32 pixel tiles, and each thread gets a minor task of culling a single light. This task is highly divergent in the path each thread will take. A light evaluation shader has a different behaviour as opposed to that, because each thread will potentially process the same array of lights in its given tile. Except there could be cases when a light/decal calculation will exit early or skip shadow calculations, etc… with a pixel granularity instead of per tile granularity. In that case, it is already inefficient to utilize big thread groups because long iterating threads will hold back the rest in the group from exiting early and the hardware will be underutilized. So a smaller tile size for the light evaluation should be preferred (8×8 worked best in my experiments).

Seeing these problems I propose to separate the big shader into several smaller parts. Frustum precomputation could go into its own shader. An other shader could reduce the depth buffer and create the depth bounds with the tile bitmask information. Tile AABB computation could also go in there potentially. The culling shader would only load the depth bounds, tile AABB and bitmask from memory, then perform culling the lights, then export the per tile light lists into memory. The last shader is the surface shading, light evaluating shader, which is a vertex-pixel shader for Forward+ and a compute shader for tiled deferred with a smaller blocksize than the culling (8×8 proposed as in previous paragraph).

Separating the shaders opens up an other optimization possibility, to use Indirect Dispatch. Consider for example that the culling determined for a tile that no lights are inside it, so it would instruct the surface shading shader not to dispatch any work groups for that tile.


A lot has been covered, so let’s recap quickly:

  • Frustum-sphere tests are inaccurate
  • Refine with AABB-sphere tests
  • Work against depth discontinuities with 2.5D culling bitmasks
  • Small burden on culling can be big win with shading
  • Approximate complex light types with spheres and AABBs
  • Avoid overworked ubershaders
  • Prefer smaller shaders, better fitted to specific task at hand

Thanks for reading, hope you enjoyed it! If you have any feedback, please share with me. Were there any mistakes? Definetly share 🙂

Further reading:

Introduction to tiled deferred rendering

Reference for 2.5D culling

Tile based rendering advancements GDC 2015

Awesome Forward+ tutorial

RealTimeCollisionDetection.NET has most of the intersection code

Investigating efficient spotlight culling

Next power of two in HLSL

There are many occasions when a programmer would want to calculate the next power of two for a given number. For me it was a bitonic sorting algorithm operating in a compute shader and I had this piece of code be responsible for calculating the next power of two of a number:

uint myNumberPowerOfTwo = 1;

while( myNumberPowerOfTwo < myNumber)


   myNumberPowerOfTwo <<= 1;


It gets the job done, but doesn’t look so nice. For not unusual cases when myNumber is more than 1000 it can already take ten cycles to loop. I recently learned that HLSL has a built in function called firstbithigh. It returns the position of the first non zero bit in a 32-bit number starting from the left to the right (from high order to low). With its help, we can rewrite the algorithm as follows:

uint myNumberPowerOfTwo = 2 << firstbithigh(myNumber);

It does the same thing, so how does it work? Take a random number and write it in binary:


For that number, firstbithigh returns the number of the location of the first non zero bit (from the left), which is 24. The next power of two to that number would be:


That number has seven leading zeroes. All we have to do is zero the whole number but set a single bit to non-zero. To do it, first set up 2 in binary:


Then shift the whole thing left by the position of the first bit in myNumber to get what we are looking for:


Nice one, but we have a few corner cases we need to eliminate. If myNumber is already a power of two, we will get a greater power of two which might not be what we are looking for. In that case we should return myNumber itself. If you subtract one from myNumber before calling firstbithigh on it, it will ensure that power of two numbers will return themselves:

uint myNumberPowerOfTwo = 2 << firstbithigh(myNumber – 1);

Take 1024 for example, it will get the next power of two for 1023 which is 1024. Or take 129 for example, it will get the next power of two for 128 which is 256, so works out quite well. If you try to calculate it for 0 however, there is a potential bug if myNumber is unsigned, as -1 in unsigned would wrap around to 0xFFFFFFFF. If you can potentially call it for zeroes and operate on unsigned, I would suggest doing it like this:

uint myNumberPowerOfTwo = 2 << firstbithigh(max(1, myNumber) – 1);

If it is signed int, then you don’t have to worry about it, as the firstbithigh function will return zero for negative numbers.

Have any question, noticed a mistake? Please post in the comments below!

Thank you for reading!

GPU-based particle simulation


I finally took the leap and threw out my old CPU-based particle simulation code and ventured to GPU realms with it. The old system could spawn particles on the surface on a mesh with a starting velocity of each particle modulated by the surface normal. It kept a copy of each particle on CPU, updated them sequentially, then uploaded them to GPU for rendering each frame. The new system needed to keep the same set of features at a minimum, but GPU simulation also opens up more possibilities because we have direct access to resources like textures created by the rendering pipeline. It is also highly parallellized compared to the CPU solution, both the emitting and the simulation phase which means we can do a much higher amount of particles in the same amount of time. There is less data moving between the system and GPU, we can get away with only a single constant buffer update and command buffer generation, the rest of the data lives completely in VRAM. This makes simulation on a massive scale a reality.

If that got you interested, check out the video presentation of my implementation in Wicked Engine:

So, the high level flow of the GPU particle system described here is the following:

  1. Initialize resources:
    • Particle buffer with a size of maximum amount of particles [ParticleType*MAX_PARTICLE_COUNT]
    • Dead particle index buffer, with every particle marked as dead in the beginning [uint32*MAX_PARTICLE_COUNT]
    • 2 Alive particle index lists, empty at the beginning [uint32*MAX_PARTICLE_COUNT]
      • We need two of them, because the emitter writes the first one, simulation kills dead particles and writes the alive list again to draw later
    • Counter buffer:
      • alive particle count [uint32]
      • dead particle count [uint32]
      • real emit count = min(requested emit count, dead particle count) [uint32]
      • particle count after simulation (optional, I use it for sorting) [uint32]
    • Indirect argument buffer:
      • emit compute shader args [uint32*3]
      • simulation compute shader args [uint32*3]
      • draw args [uint32*4]
      • sorting compute shader arguments (optional) [uint32*3]
    • Random color texture for creating random values in the shaders
  2. Kick off particle simulation:
    • Update a constant buffer holding emitter properties:
      • Emitted particle count in current frame
      • Emitter mesh vertex, index counts
      • Starting particle size, randomness, velocity, lifespan, and any other emitter property
    • Write indirect arguments of following compute passes:
      • Emitting compute shader thread group sizes
      • Simulation compute shader thread group sizes
      • Reset draw argument buffer
    • Copy last frame simulated particle count to current frame alive counter
  3. Emitting compute shader:
    • Bind mesh vertex/index buffers, random colors texture
    • Spawn as many threads as there are particles to emit
    • Initialize a new particle on a random point on the emitter mesh surface
    • Decrement dead list counter atomically while getting last value, this is our new dead particle index, read the dead list on that location to retrieve the particle index for the particle buffer
    • Write the new particle to the particle buffer on this index
    • Increment alive particle count, write particle index into alive list 1
  4. Simulation compute shader:
    • Each thread reads alive list 1, and updates particle properties if particle has life > 0, then writes it into alive list 2. Increment Draw argument buffer.
    • Otherwise, kill particle by incrementing dead list counter and writing particle index to dead list
    • Write particle distance squared to camera for sorting (optional)
    • Iterate through force fields in the scene and update particle according (optional)
    • Check collisions with depth buffer and bounce off particle (optional)
    • Update AABB by atomic min-maxing particle positions for additional culling steps (optional)
  5. Sorting compute shader (optional):
    • An algorithm like bitonic sorting maps well to GPU, can sort a large amount
    • Multiple dispatches required
    • Additional constant buffer updates might be required
  6. Swap alive lists:
    • Alive list 1 is the alive list from previous frame + emitted particles in this frame.
    • In this frame we might have killed off particles in the simulation step and written the new list into Alive list 2. This will be used when drawing, and input to the next frame emitting phase.
  7. Draw alive list 1:
    • After the swap, alive list 1 should contain only the alive particle indices in the current frame.
    • Draw only the current alive list count with DrawIndirect. Indirect arguments were written by the simulation compute shader.
  8. Kick back and profit 🙂
    • Use your new additional CPU time for something cool (until you move that to the GPU as well)

Note: for adding particles, you could use append-consume structured buffers, or counters written by atomic operations in the shader code. The append-consume buffers might include an additional performance optimization hidden from the user, which is GDS (global data share) for the hardware that supports it. Basically it is a small piece of fast access memory visible to every thread group located on a separate chip instead of the RAM. I went with the atomic counter approach and haven’t tested performance differences yet. The append-consume buffers are not available in every API which makes them less appealing.


The following features are new and nicely fit into the new GPU particle pipeline:

  • Sorting
    • I never bothered with particle sorting on the CPU. It was already kind of slow without it, so I got away with only sorting per-emitter, so farther away emitters were drawn earlier. I decided to go with bitonic sorting because I could just pull that from the web. This is a bit too involving and though that it would consume too much time to implement on my own and debug. AMD has a really nice implementation available. Sorting becomes a required step if the particles are not additively blended because threads are now writing them in arbitrary order.
  • Depth buffer collisions
    • This is a very nice feature of GPU particle systems. This is essentially a free physics simulation for particles which are on the screen. This only involves reading the depth buffer in the simulation phase, checking if the particle is behind it, and if it is, then read the normal buffer (or reconstruct normal from depth buffer) and modulate particle velocity by reflecting it with the surface normal.
  • Force fields
    • This is completely possible with CPU particle systems as well, but now we can apply them to a much bigger simulation. In the simulation compute shader we can preload some force fields to LDS (local data share) for faster memory access.
  • Emit from skinned mesh
    • Mesh skinning is done on the GPU nowadays, so using the skinned meshes while emitting becomes trivial, with no additional cost whatsoever.
  • Async compute
    • Now I still haven’t had a chance to try any async compute, but this seems like a nice candidate for that because simulation could be very much decoupled from rendering and it could lead to better utilization of GPU resources. Async compute is available in the modern low level graphics APIs like DX12, Vulkan and console specific APIs. It also requires hardware support which is available only in the latest GPUs.



Debugging a system which is living on the GPU is harder than on the CPU but essential. We should ideally make use of a graphics debugger software, but there are also opportunities to make our life easier with creating some utilities for this purpose. The thing that helped me most is writing out some data about the simulation to the screen. For this, we need to access the data which is resident on the GPU, which we can do as if we were downloading something from a remote machine. Using the DirectX 11 API, we can do this by creating a resource of the same type and size that we want to download and creating it with D3D11_USAGE_STAGING flag, no bind flags and READ CPU access. We have to issue a copy into this buffer from the one we want to download by calling ID3D11DeviceContext::CopyResource, then read the buffer contents by mapping it with READ flags. As the buffer contents will only be available when the frame is finished with rendering until that point, we can either introduce a CPU-GPU sync point and wait in place until the operation completes or do the mapping a few frames later. In a debugging scenario, a sync point might be sufficient and simpler to implement, but we should avoid any such behaviour in the final version of the application.


Drawing billboards would be seem like a nice place to use geometry shaders. Unfortunately, geometry shaders introduce inefficiencies in the graphics pipeline, because of various reasons. Primitives need to be traversed and written to memory serially, some architectures even go as far as writing the GS output to system memory. The option of my choice is just leaving the geometry shader and doing the billboard expansion in the vertex shader. For this, we must spawn the VS with a triangle list topology, vertex count of particleCount * 6 and calculate the particle index and billboard vertex index from the SV_VertexID system-value semantic. Like this:

static const float3 BILLBOARD[] = {

  float3(-1, -1, 0), // 0

  float3(1, -1, 0), // 1

  float3(-1, 1, 0), // 2

  float3(-1, 1, 0), // 3

  float3(1, -1, 0), // 4

  float3(1, 1, 0), // 5


VertextoPixel main(uint fakeIndex : SV_VERTEXID)


  uint vertexID = fakeIndex % 6;

  uint instanceID = fakeIndex / 6;

  Particle particle = particleBuffer[aliveList[instanceID]];

  float3 quadPos = BILLBOARD[vertexID];

  // …


Additionally, for better drawing performance, you should use indexed drawing with 4 vertices per quad, but that way the two index lists will be six times the size each, so bandwidth will increase for the simulation. Maybe it is still worth it, I need to compare performance results.


There are many possibilities to extend this system, because compute shaders make it very flexible. I am overall happy with how this turned out. Provided my previous particle systems were quite simplistic, porting all the features was not very hard and I haven’t had to make any compromises. The new system frees up CPU resources which are more valuable for gameplay logic and other systems which are interconnected. Particles are usually completely decoupled from the rest of the engine so they are an ideal candidate for running it remotely on a GPU.

You can check the source code of my implementation of GPU-based particles in Wicked Engine:

Feel free to rip off any source code from there! Thank you for reading!

Inspiration from:

Compute – based GPU particles by Gareth Thomas

Which blend state for me?


If you are familiar with creating graphics applications, you are probably somewhat familiar with different blending states. If you are like me, then you were not overly confident in using them, and got some basics ones copy-pasted from the web. Maybe got away with simpe alpha blending and additive states, and heard of premultiplied alpha somewhere but didn’t really care as long as it looked decent enough at the time. Surely, there are a lot of much more interesting stuff waiting for you to be implemented. Then later you realize, that something looks off with an alpha blended sprite somewhere. You correct it with some quick fix and forget about it. A week later, you are want to be playing with some particle systems, but there is something wrong with that, the blending doesn’t look good anymore because of a dirty tweak you made earlier. Also, your GUI layer was displaying the wrong color the whole time, but just enough not to notice. There are just so many opportunities for screwing up your blending states without noticing it immediately. Correcting the mistakes can really quickly turn into a big headache. Here I want to give some practical examples and explanations of different use cases, for techniques mainly used in 3D rendering engines.

First thing is rendering alpha blended sprites on top of each other, just to the back buffer immediately. We need a regular alpha blending renderstate for that which does this:

dst.rgb = src.rgb * src.a + dst.rgb * (1 – src.a)

Here, dst means the resulting color in the rendertarget (which is now the backbuffer). Src is the color of the sprite which the pixel shader returns. Our colors are just standard 32 bit rgba in the range [0, 1] here. With this, we have successfully calculated a good output color which we can just write as-is to the back buffer. For this, we don’t care what the alpha output is, because no further composition will be happening. Here is the corresponding state description for DirectX 11:


desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_SRC_ALPHA;

desc.DestBlend = D3D11_BLEND_INV_SRC_ALPHA;

desc.BlendOp = D3D11_BLEND_OP_ADD;

The tricky part comes when we want to draw our alpha blended sprites to separate layers which will be composited later on. In that case we also have to be careful what alpha value we write out. Take a simple scenario for example, in which you render a sprite with alpha blending to a render target, and later you render your rendertarget to your backbuffer with the same alpha blending. For this, we want to accumulate alpha values, so just add them:

dst.a = src.a + dst.a

Which is equivalent to the following blend state in DirectX 11 (just append to the previous snippet):

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

Accumulating alpha seems like a good fit for a 32 bit render target, as values will be clamped to one, so opacity will increase with overlapping sprites, but colors won’t be over saturated. Try blending the render target layer now to the back buffer. There will be an error, which could not be obvious at first (which means the worst kind of error). Let me show you:

blendOn the black background, you might even not notice the error at first, but on the white background it becomes apparent at once (in this case, “background” means our backbuffer). For different images, there might be different scenarios where the problem becomes apparent. This could be a challenge to overcome, if you already made a lot of assets, maybe even compensated somehow for the error without addressing the source of the error: the blend operation is not correct anymore! First, we blended the sprite onto the layer, this is still correct, but we blend the layer with the same operation again, so alpha messes with the colors two times here. There is a correct solution to the problem: premultiplied alpha blend operation, which is this:

dst.rgb = src.rgb + dst.rgb * (1 – src.a)

dst.a = src.a + dst.a

Create it in DX11 like this:


desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_ONE;

desc.DestBlend = D3D11_BLEND_INV_SRC_ALPHA;

desc.BlendOp = D3D11_BLEND_OP_ADD;

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

The only change is that we do not multiply with source alpha any more. Our problem is fixed now, we can keep using our regular alpha blending on sprites, but use premultiplied blending for the layers. Simple right? But do not forget about premultiplied alpha just yet, it can help us out for more problems as well.

I had multiple problems with rendering particle systems to off-screen buffers for soft particles, and can also help with performance if the buffer is of small resolution. One of the problems was the above mentioned faulty alpha blending of particles (particles are the sprites, the layer is the off-screen buffer when mapped to the previous example). The other issue is that I also want to render additive particles to the same render target and I want to blend the whole thing later in a single pass. This is an additive blending state:

dst.rgb = src.rgb * src.a + dst.rgb

dst.a = dst.a

Which corresponds to this state in DX11:


desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_SRC_ALPHA;

desc.DestBlend = D3D11_BLEND_ONE;

desc.BlendOp = D3D11_BLEND_OP_ADD;

dest.SrcBlendAlpha = D3D11_BLEND_ZERO;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

Notice that src output doesn’t contribute to alpha, but that is no problem at all, because the premultiplied layer blending will still take the layer color, just adding the full destination color to it. This is only possible if our layer is configured for premultiplied blending, otherwise it would just disappear upon blending. We can also have our particle textures themselves be in premultiplied texture format (where the texture colors are already multiplied by alpha) and blend state, and blending that onto the layer, it just works. When I wasn’t familiar with this, I used a separate render target for regular alpha blended and premultiplied texture format particles and an other one for additive ones, what a waste! We can see that premultiplied blending is already very flexible, so keep that in mind because it can save the day on many occasions. It also makes a huge difference in mipmap generation, see this neat article from Nvidia!

Side note: premultiplied alpha blending was also widely used because it has better blending performance. Think of it as precomputing the alpha blend factor and storing it inside the texture. The performance reasons are probably not so apparent today.

I had an other problem with particle system rendering, because they were rendered to a HDR floating point target. That one doesn’t clamp the alpha values. Consider the case when a particle’s alpha value is bigger than one for whatever reason and blended with regular alpha blending: the term dest.rgb * (1 – src.a) is now of course producing negative values. This is easy to overcome with just saturating the alpha output of the pixel shader, done! The other problem is this: dst.a = src.a + dst.a which can still result in larger than one alpha values, but this will only be a problem later on, when blending the layer as premultiplied alpha. We would need to saturate the (1 – src.a) term in the blending state but we cannot, there is not such state. We have a blend D3D11_BLEND_SRC_ALPHA_SAT value, but there is not for the inverse of it. The workaround that I am using for this is to modify the particle alpha blending state to accumulate alpha a bit differently:

dest.a = src.a + dst.a * (1 – src.a)

In DX11 terms:

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_INV_SRC_ALPHA;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

This accumulation method is probably not perfect, but works really well in practice:


That’s it, I think these are the three most important blending modes, most effects can be achieved with a combination of these. Just always keep an eye on your blend state creation and be very explicit about it, that is the way to avoid many bugs down the road. If you were like me and haven’t paid much attention to these  until now, this is the best time to reiterate on this, because tracking down errors associated with this becomes a hard journey later. Thanks for reading!

Forward+ decal rendering

decalsDrawing decals in deferred renderers is quite simple, straight forward and efficient: Just render boxes like you render the lights, read the gbuffer in the pixel shader, project onto the surface, then sample and blend the decal texture. The light evaluation then already computes lighting for the decaled surfaces. In traditional forward rendering pipelines, this is not so trivial. It is usually done by cutting out geometry under the decal, creating a new mesh from it with projected texture coordinates and render it for all lights, additively. Apart from the obvious increased draw call count and fillrate consumption, there is even potential for z-fighting artifacts. While moving to tile-based forward rendering (Forward+), we can surely think of something more high-tech.

We want to avoid additional geometry creation, increased draw call count while keeping the lighting computation constant. But in addition to these, with this new technique we can even trivially support modification of surface properties, creating decals which can modify surface normal, roughness, metalness, emissive, etc. or even do parallax occlusion mapping. We can even apply decals to transparent surfaces easily! This article will describe the outline of the technique without source code. You can look at my implementation however, here: culling, and sorting shader; blending evaluation shader.

In forward+ we have a light culling step and a light evaluation separately. The decals will be inserted for both passes. A culling compute shader iterates through a global light array and decides for each screen space tile, which lights are inside and adds them to a global list (in case of tiled deferred, it just adds them to a local list and evaluates lighting there and then). For adding decals to the culling, we need to extend the light descriptor structure to be able to hold decal information, and add functions to the shader to be able to cull oriented bounding boxes(OBBs). We can implement OBB culling by doing coarse AABB tests. Transform the AABB of the tile by the decal OBB’s inverse matrix (while keeping min-max up to date) and test the resulting AABB against a unit AABB. This is achieved by determining the 8 corner points of the tile AABB, transforming each by the inverse decal OBB, then determining the min and max corner points of the resulting points.


Rendering the decals takes place in the object shaders while we also evaluate the lighting. If the decals can modify surface parameters, like normals, it is essential that we render the decals before the lights. For that, we must have a sorted decal list. We can not avoid sorting the decals, anyway, as I have found out the hard way. Because the culling is performed in parallel, the decals can be added to the tile in arbitrary order. But we have a strict order when blending the decals, that is the order we have placed them onto the surface. If we don’t sort, it can lead to severe flickering artifacts when there are overlapping decals. Thankfully the sorting is straight forward, easily parallellized and can be done in the LDS (Local Data Share memory) entirely. I have gotten this piece of code from an AMD presentation (a bitonic sort implementation in LDS).

The easiest way is that we sort the decals in the CS so that the bottom decal is first, and the top is last (bottom-to-top sorting). This way, we can do regular alpha blending (which is a simple lerp in the shader) easily. Though we can do better. This way we sample all of the decals, even if the bottom ones are completely covered by decals placed on top. Instead we should sort the opposite way, so that first we evaluate the top ones, and then the decals underneath but just until the alpha accumulation reaches one. We can skip the rest. The blending equation also needs to be modified for this. The same idea is presented in the above mentioned AMD presentation for tile based particle systems. The modified blending equation looks like this:

color = ( invDestA x srcA x srcCol ) + destCol

alpha = srcA + ( invSrcA x destA )

This method can save us much rendering time when multiple decals are overlapping. But this can result in different output when we have emissive decals for example. In the bottom-to-top blending, emissive decals will always be visible because the contribution is added to the light buffer, but the top-to-bottom sorting (and skip) algorithm will skip the decals which are completely covered. I think this is “better” behaviour overall but on a subjective basis of course.

The nice thing about this technique, is that we can trivially modify surface properties, if we just sample all of our decals before all of the lights. Take this for example: we want to modify normal of the surface with the decal normal map. We already have our normal vector in our object shader, so when we get to the decals, just blend it in shader code with the decal normal texture, without the need for any packing/unpacking and tricky blending of g-buffers (a’la deferred). The light evaluation which comes after it “just works” with the new surface normal without any modification at all.

Maybe you have noticed, that we need to do the decal evaluation in dynamically branching code, which means that we must leave the default mip-mapping support. This is because from the compiler’s standpoint, we might perfectly well not be evaluating the same decals in neighbouring pixels, but we need those helper pixels for correct screen space derivative coordinates. In our case when we have multiple of two pixel count tiles (I am using 16×16 tiles), we are being coherent for our helper pixels, but the compiler doesn’t know that unfortunately. I haven’t yet found a satisfying way to overcome this problem. I experimented with linear distance/screen space size based mip selection, but found them unsatisfying for my purposes (they might be ok for a single game/camera type though).


Update: Thanks to MJP, I learned a new technique for obtaining nice mip mapping results: We just need to take the derivatives of the surface world position, transform it by the decal projection matrix (but leave the translation), and we have the decal derivatives that we can feed into Texture2D::SampleGrad for example. An additional note is that when using a texture atlas for the decals, we need to take into consideration the atlas texture coordinate transformation. So, just multiply the decal derivatives by the atlas transformation’s scaling part. Cool technique!

We also need to somehow dynamically support different decal textures in the same object shader. A texture atlas comes handy in this case, or bindless textures are also an option in newer APIs.

As we have added decal support to the tiled light array structure, the structure probably is getting bloated, which means less cache efficient, because most lights probably don’t need a decal inverse matrix (for projection), texture atlas offsets, etc. For this, the decals could probably get their own structure and a different array, or just tightly pack everything in a raw buffer (byteaddressbuffer in DX11). I need to experiment with this.


This technique is a clear upgrade from the traditional forward rendered decals, but comparing it with the deferred decals is not a trivial matter. First, we can certainly optimize deferred decals in several ways. I have been already toying with the idea of using Rasterizer Ordered Views to optimize the blending in a similar way, and eliminating overdraw. Secondly, we have increased branching and register pressure in the forward rendering pass, while rasterization of deferred decals is a much more light weight shader which can be better parallellized when the overdraw is not so apparent. In that case, we can get away with rendering much more deferred decals than tiled decals. The tile-based approach gets much better with increased overdraw because of the “skip the occluded” behaviour as well as the reduced bandwidth cost of not having to sample a G-buffer for each decal. Forgive me for not providing actual profiling data at the moment, this article intends to be merely a brain dump, but I also hope somewhat inspirational.