Thoughts on Skinning and LDS


I’m letting out some thoughts on using LDS memory as a means to optimize a skinning compute shader. Consider the following workload: each thread is responsible for animating a single vertex, so it loads the vertex position, normal, bone indices and bone weights from a vertex buffer. After this, it starts doing the skinning: for each bone index, load a bone matrix from a buffer in VRAM then multiply the vertex positions and normals by these matrices and weight them by the vertex bone weights. Usually a vertex will contain 4 bone indices and 4 corresponding weights. Which means that for each vertex we are loading 4 matrices from VRAM. Each matrix is 3 float4 vectors, so 48 bytes of data. We have thousands of vertices for each model we will animate, but only a couple of hundred bones usually. So should each vertex load 4 bone matrices from a random place in the bone array?

Instead, we can utilize the LDS the following way: At the beginning of the shader, when the vertex information is being loaded, each thread also loads one matrix from the bone array and stores it inside the LDS. We must also synchronize the group to ensure the LDS bone array has been complete. After all the memory has been read from VRAM, we continue processing the skinning: iterate through all bone indices for the vertex, and load the corresponding bone matrix from LDS, then transform the vertex and blend by the bone weights. We just eliminated a bunch of memory latency from the shader.

Consider what happens when you have a loop which iterates through the bone indices for a vertex: First you load the bone and you want to immediately use it for the skinning computation, then repeat. Loading from a VRAM buffer causes significant latency until the data is ready to be used. If we unroll the loop, the load instructions can be rearranged and padded with ALU instructions that don’t depend on those to hide latency a bit. But unrolling the loop increases register allocations (VGPR = Vector General Purpose Register, for storing unique data to the thread; buffer loads consume VGPR unless they are known to be common to the group at compile time, then they can be placed to scalar registers – SGPR) and can result in lower occupancy as we have a very limited budget of them. We also want a dynamic loop instead because maybe a vertex has fewer bones than the maximum, so processing could exit early. So having tight dynamic loop with VRAM load then immediately ALU instructions is maybe not so good. But once that loop only accesses LDS, the latency can be significantly reduced, and performance should increase.

But LDS does not come for free and it is also a limited resource, like VGPR. Let’s look at the AMD GCN architecture: We have a maximum of 64 KB of LDS for a single compute unit (CU), though HLSL only lets us use 32 KB in a shader. If a shader uses the whole 32 KB, it means that the shader can only be running two instances of itself on the CU. We have a bone data structure which is a 3×4 float matrix, 48 bytes. We can fit 682 bones into LDS and still have two instances of the compute shader operate in parallel. But most of the time we hardly have skeletons consisting of that many bones. In my experience, less than 100 bones should be enough for most cases, but we surely won’t use more than say 256 bones for a highly detailed model, either in real time apps. So say that our shader will declare an LDS bone array of 256 bones, and the thread group size is also 256, so each thread will load one bone into LDS. 256*48 bytes = 12 KB. This means that 5 instances of this shader could be running in parallel on a CU, so 5*256 = 1280 vertices processed. That is if we don’t exceed the max VGPR count of 65536 for a CU. In this case it means that a single shader must at maximum fit into the  51 VGPR limit (calculated as 65536 VGPR / 1280 threads). Most cases we will easily fit into even a 128 bone limit, so an LDS bone array size of 128 and thread group size of 128 threads will just be enough and be much easier on the GPU.

However, I can imagine a scenario, which could be worse with the LDS method, if there is a complicated skeleton, but small mesh referencing only a few of the bones. In this case when there is one skeleton for multiple meshes, maybe we should combine the meshes into a single vertex buffer and use an index buffer to separate between them, so this way a single dispatch could animate all the meshes, while they can be divided into several draw calls when needed.

It sounds like a good idea to utilize LDS in a skinning shader and it is a further potential improvement over skinning in vertex/geometry shader and doing stream out. But as with anything on the GPU this should be profiled on the target hardware which you are developing on. Right now my test cases were unfortunately maybe not the best assets there are, and a puny integrated Intel mobile GPU, but even so I could find a small performance improvement with this method.

Thank you for reading, you can find my implementation on GitHUB! Please tell me if there is any incorrect information.

UPDATE: I’ve made a simple test application with heavy vertex count skinned models, and toggleable LDS skinning: Download WickedEngineLDSSkinningTest (You will probably need Windows 10 to run it and DirectX 11)

Further reading:

Large thread groups on GCN

Easy Transparent Shadow Maps


Supporting transparencies with traditional shadow mapping is straight forward and allows for nice effects but as with anything related to rendering transparents with rasterization, there are corner cases.

Little sneak peak of what you can achieve with this:

The implementation is really simple once you have implemented shadow mapping for opaque objects. After the opaque shadow pass, we must render the transparents into a color buffer, but reject samples which would be occluded by opaques, so using a depth read-only depth stencil state. The transparents should be blended multiplicatively. Sorting does not matter with a multiply blend state. In bullet points:

  1. Render opaque objects into depth stencil texture from light’s point of view
  2. Bind render target for shadow color filter: R11G11B10 works good
  3. Clear render target to 1,1,1,0 (RGBA) color
  4. Apply depth stencil state with depth read, but no write
  5. Apply multiplicative blend state eg:
    • SrcBlend = BLEND_ZERO
    • DestBlend = BLEND_SRC_COLOR
    • BlendOp = BLEND_OP_ADD
  6. Render transparents in arbitrary order

When reading shadow maps in shading passes, we only need to multiply the lighting value with the transparent shadow map color filter if the pixel is inside the light. There is a slight problem with this approach, that you will notice immediately. Transparent objects now receive their own colored self shadow too. The simplest fix is to just disable the colored part of the shadow calculation for transparent objects. We can already produce nice effects with this, this is not a huge price to pay.

See it in action, transparents are rendered without colored self-shadows:


But they receive shadows from opaque objects just fine:


There is a technique which would allow us to render colored shadows on top of transparents too. This involves keeping an additional shadow depth map for transparencies. The flow of this technique is like this (from a Blizzard presentation):

  1. Render opaque shadow map
  2. Render transparent shadow map
    • To a separate depth texture!
    • Depth writes ON
  3. Clear shadow color filter texture (like in the simple approach)
  4. Render transparent again to color filter render target
    • But use the opaque shadow map’s depth stencil
    • depth read ON
    • depth write OFF

And in the shading step, now there will be two shadow map checks, one for the opaque, one for the transparent shadow maps. Only multiply the light with the shadow filter color texture when the transparent shadow check fails.

In my experience you can also have the transparent depth in the color texture’s alpha channel, to refine the results. (See UPDATE section at the end).

This will eliminate false self shadows from transparent objects. But unfortunately now when a transparent receives a colored shadow, its own transparent shadow color will also contribute to itself.

What’s more interesting, are the additional effects we can achieve with transparent shadow maps:

Textured shadows, which can be used as a projector for example, just put a transparent textured geometry in front of a light:


And underwater refraction caustics:


UPDATE: You can use the alpha channel of the transparent shadow map as a secondary depth buffer. This way, tinting volumetric light becomes possible because light between light and transparent object must be not tinted (rejected based on secondary depth). Since the multiplicative colors don’t care about alpha at all (however, the color is multiplied with alpha – 1 as a transparency factor), this is safe to use with a MAX blend operator (or MIN if you don’t use reversed Z):

SrcBlendAlpha = BLEND_ONE;
DestBlendAlpha = BLEND_ONE;
BlendOpAlpha = BLEND_OP_MAX;
Also use a texture format with alpha channel such as R16G16B16A16_FLOAT

The secondary depth buffer will support one layer of transparency (it should be the closest transparent layer to the light), so it’s not perfect, but things rarely are in real time rendering!

That’s it, I think this is a worthwhile technique to include in a game engine. This can be used for many interesting effects. Thank you for reading!

My transparent shadow map rendering pixel shader can be found here.

Related article:
StarCraft 2 Effects and techniques

Optimizing tile-based light culling

Tile-based lighting techniques like Forward+ and Tiled Deferred rendering are widely used these days. With the help of such technique we can efficiently query every light affecting any surface. But a trivial implementation has many ways to improve. The biggest goal is to refine the culling results as much as possible to help reduce the shading cost. There are some clever algorithms I want to show here which are relatively easy to implement but can greatly increase performance.


For the starting point I assume a tiled rendering pipeline like the one described in a GDC 2011 presentation by Johan Andersson (DICE). The following post will be applicable to the Tiled Deferred and Forward+ variants of that pipeline. A quick recap of such a renderer:

  1. Dispatch a compute shader thread group per screen-space tile
  2. Thread group calculates tile frustum in view space
  3. Each thread reads pixel from depth buffer, performs atomic min-max to determine depth bounds
  4. Tile frustum is truncated by depth bounds
  5. Each thread computes if a light is inside the frustum, adds the light to a per tile list if yes
  6. Compute final shading from the per tile light list
    • Deferred can just read gbuffer, loop through per tile light list in the same shader and write the light buffer texture
    • Forward+ exports light list that will be read by second geometry pass and compute shading in pixel shaders

Our main goal is to refine culling results with eliminating false positives, which means cull lights which do not contribute to the final shading.

1.) Help out frustum culling with AABB

The first problem arises when we perform the culling of lights as sphere-frustum intersection tests. Let me demonstrate this with a point light:



Blue pixels on the second picture visualize which tiles contain the sphere light source after the culling. But the results are strange, why is the blue area kind of sqare shaped, when it should be a circle instead? This results because the math used for sphere – frustum plane tests are not accurate enough when we test big spheres against small frustums. Let me try illustrating this problem. The following image shows a tile frustum as seen on the screen:


As you can see, the sphere is completely outside the frustum, but none of the plane tests pass completely, so the light is not culled away properly. We can overcome this, by using axis aligned boxes (AABBs) instead of frustums. It is implemented like this:

bool SphereIntersectsAABB(in Sphere sphere, in AABB aabb)


  float3 vDelta = max(0, abs( – – aabb.extents);

  float fDistSq = dot(vDelta, vDelta);

  return fDistSq <= sphere.radius * sphere.radius;


The result:


This one seems good enough, but don’t throw out frustum culling just yet!

2.) Depth discontinuities are enemy

Holes in the depth buffer can greatly reduce efficiency of a tile based light selection, because the tile enclosing min-max depth AABB can get huge really fast:


In the image above I tried to illustrate (from an above point of view) that a depth discontinuity made the AABB large enough to intersect with a light which the frustum culling would have rejected. This is why AABB should be used alongside frustum culling and complementing each other.

Depth discontinuities usually introduce an other inefficiency, because there might be cases when a light will lie in empty space, not intersecting with anything, but still inside the tile, so the shading will receive the light, but it will not contribute at all:


As you can see, that light is inside the frustum, inside the AABB, but it is in empty space, between geometries, but our current algorithm will add it to the light list.

To solve this, there is a technique called 2.5D light culling, introduced by Takahiro Harada. In addition to that presentation, I would like to give an implementation for this in HLSL. So the basic idea is to create two bitmasks, one for the tile and one for the light which we are checking. The bitmasks are used by doing a bitwise AND operation with them to determine if the light intersects any geometry (when AND returns non zero) in the tile or not (when AND returns zero).


For the sake of a simpler image, I used a 9-bit mask, but we should use a 32-bit mask which we can represent by a uint variable.

The first bitmask is created for the whole tile once. While each thread reads its corresponding pixel from the depth buffer, it does an atomic min-max already, but now it also fills in a single relevant bit in a uint and performs an atomic OR to the tile bitmask. So what is the relevant bit? The algorithm says that we divide our tile depth range into 32 pieces and a 32-bit uint variable will contain those ranges. We first determine our tile depth bounds in linear space for this, then fill in the corresponding bit accordingly:

groupshared uint tileDepthMask = 0;

// …

float minDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, minDepth, 1)).z;

float maxDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, maxDepth, 1)).z;

float realDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, pixelDepth, 1)).z;

float depthRangeRecip = 32.0f / (maxDepthVS – minDepthVS);

uint depthmaskcellindex = max(0, min(32, floor((realDepthVS – minDepthVS) * depthRangeRecip)));

InterlockedOr(tileDepthMask, 1 << depthmaskcellindex);


This code is being run by every thread in the group. The unexplained function called UnprojectScreenSpaceToViewSpace just does what is says, the input is a screen coordinate point, and transforms it to view space. We are only interested in the Z coordinate here, so we only need to transform the input with the inverse projection matrix and divide the result by the w component afterwards. Otherwise if we would be interested in XY coordinates, we would also need to transform from [0,1] to [-1,1] range before projection. The function would look like this for the common case:

float4 UnprojectScreenSpaceToViewSpace(in float4 screenPoint)


  float4 clipSpace = float4(float2(screenPoint.x, 1 – screenPoint.y) * 2 – 1, screenPoint.z, screenPoint.w);

  float4 viewSpace = mul(clipSpace, xInverseProjectionMatrix);

  viewSpace /= viewSpace.w;

  return viewSpace;


So the bitmask construction code might look a bit intimidating, so let me explain a bit better what’s happening. We calculate the minZ, maxZ and current pixel Z in view space and determine the depth slice size which a single bit will represent (depthRangeRecip). Then shift a bit to the right place and adding it to the group shared tile mask by means of an atomic OR operation.

The tile mask is complete, so we only need to know how to construct a light mask. That must be done inside the loop where we are culling lights. On the first try I cooked up this:

float fMin = lightPosViewSpace.z – lightRadius.r;

float fMax = lightPosViewSpace.z + lightRadius.r;

uint lightMaskcellindexSTART = max(0, min(32, floor((fMin – minDepthVS) * depthRangeRecip)));

uint lightMaskcellindexEND = max(0, min(32, floor((fMax – minDepthVS) * depthRangeRecip)));

uint lightMask = 0;

for (uint c = lightMaskcellindexSTART; c <= lightMaskcellindexEND; ++c)


  lightMask |= 1 << c;


Here we determine the beginning and end ranges of a sphere light inside the view space depth range and push bits into the mask in a loop to the correct places one-by one:

In this mask for example, lightMaskcellindexSTART is the 11th bit from the right, and lightMaskcellindexEND is the 21st bit from the right:


Of course this loop seems like a waste to do inside a shader, so I needed to come up with something better. Rethinking how this a smaller bitfield could be pushed inside a bigger bitrange gave me the idea to exploit the truncation by the bitwise shift operators:

  • First, fill full mask like this:


    • uint lightMask = 0xFFFFFFFF;
  • Then Shift right with spare amount to keep mask only:


    • lightMask >>= 31 – (lightMaskcellindexEND – lightMaskcellindexSTART);
  • Last, shift left with START amount to correct mask position:


    • lightMask <<= lightMaskcellindexSTART;

So the resulting code eliminated a loop for only a very few instructions which is a lot better.

We have the tile mask and the light mask, so the only thing left to do is AND them to determine if a light touches something or not:

bool intersect2_5D = tileMask & lightMask;

And the resulting comparison of culling results in a high depth discontinuity scene with alpha tested vegetation and many point lights (red tiles have more that 50 lights):


As you can see, there is a big improvement in the tile visualizer heatmap, the shading will process much less lights and performance will improve in these difficult cases with alpha tested geometry. The scene shown above had the following timing results (performed with DX11 timestamp queries):

  • Basic 2D culling:
    • Culling: 0.53 ms
    • Shading: 10.72 ms
  • Improved 2.5D culling:
    • Culling: 0.64 ms
    • Shading: 7.64 ms

As you can see this is a definite improvement because while the culling shader took a bit more time to finish (0.1 ms), the object shading  took 3 ms less. I made a more detailed video some time ago:

3.) Other light types

We surely have the need to cull other light types than point lights. Spot lights come to mind at first thought, but there can be the desire to cull decal boxes or area-aligned local environment probe volumes as well.

  • Spot lights:
    • Implement a cone culling algorithm in shader code. This can be more expensive than sphere culling and again not too accurate with frustum tests. Then going further and implementing cone – AABB tests will further slow down our algorithm.
    • Or approximate the spotlight cone with tightly fitting a sphere around it. We don’t need to implement any other culling algorithms, only fitting a sphere around the spotlight, which is a straight forward, easy calculation. But this will result in excessive waste for thin/long cones. Code to fit sphere around cone:
      • float spotLightConeHalfAngleCos = cos(spotLightFOV * 0.5f);

        float sphereRadius = spotlightRange * 0.5f / (spotLightConeHalfAngleCos *spotLightConeHalfAngleCos);

        float3 sphereCenter = spotLightPos + spotLightDirection * sphereRadius;

      • Remember that you can precalculate this outside the shader, so the shader will only have a sphere to evaluate.
    • An other method involves doing cone-sphere test instead and can produce much more accurate results than the previous two, check it out here.
  • Decal/Environment probe box:
    • The idea here is to get away with AABB-AABB tests. But our decals or envprobes can be oriented in world space arbitrarily so they are OBBs! We can do something that I’d call a coarse AABB test. I transform the tile AABB from view space to world space (using the inverse view matrix of the camera) and recalculate the AABB of the resulting OBB. Then for each decal I transform the world space tile AABB by the decal’s inverse world matrix and perform an AABB-AABB intersection test with the resulting AABB and a unit AABB. The results are kind of good, but we have to refine again with the frustum as well. Transforming an AABB with a matrix is quite heavy though, because I am multiplying each corner with a matrix and keep track of the min-max at the same time. With two AABB transforms it gets heavy, but I have not yet found a better solution for this. And culling decals or envprobes is much less frequent than lights, so I comfort myself with that idea for the time being. The results:


4.) Gather vs. Scatter approaches

The implementation described by the above mentioned Battlefield presentation deals with the gather approach, which means that each thread group iterates through every potential light. A culling shader like this works best for a couple hundred lights, but can become slow when approaching a few thousand. To support more lights, we can implement some kind of a scatter approach. My suggestion is to have a coarse culling step before the regular light culling which operates on much bigger screen tiles and dispatches the work differently. This shader is dispatched so that each thread will process a single light, and determine which tile it belongs to and write (scatter) the light index into the according tile. Then the regular culling would read the light lists from the coarse tiles as opposed to iterating through each light on the screen. We could also use an acceleration structure for the lights like an octree for example and let the shader use that instead of coarse culling.

A scatter approach could be implemented for a different purpose as well: refining culling results. I described above that we can approximate spotlights with spheres, but an other idea would be to rasterize a cone instead in low resolution and let the pixel shader write the light into the appropriate tile corresponding to the invoked pixel.

5.) Don’t be greedy

It might be tempting to create a single uber-shader handling everything, creating frustums, reading depth buffer and assembling depth bounds, creating tile AABB, creating tile bitmask, culling lights, and in the case of tiled deferred, also evaluating the shading at the end. In reality, though it could be the wrong path to take.

First, on AMD GCN architecture for example, resources such as registers are shared across hardware execution units and if a shader uses too many, there will be contention for them and parallel execution will be reduced so that overall performance will be bottlenecked. This is called register pressure. Our shader which is creating frustums at the beginning are already using many registers for example which could be precalculated instead to lighten the load. AABB calculation further reduces available registers and so does calculating the tile bitmask. A tiled deferred shader at the end of the culling shader can also be very complex and utilizing many registers at once.

Then there is the part when we are creating the depth bounds with atomic operations. Atomic operations can be slow, so calculating the depth bounds in a separate shader by means of parallel reduction could be a better alternative. A reduced resolution depth buffer can also be reused later as a hierarchical-Z pyramid for instance.

Divergent branching in a shader is only a good idea if we design the shader to be highly coherent in the branches it takes for a thread group. A light culling setup usually works best in 16×16 or 32×32 pixel tiles, and each thread gets a minor task of culling a single light. This task is highly divergent in the path each thread will take. A light evaluation shader has a different behaviour as opposed to that, because each thread will potentially process the same array of lights in its given tile. Except there could be cases when a light/decal calculation will exit early or skip shadow calculations, etc… with a pixel granularity instead of per tile granularity. In that case, it is already inefficient to utilize big thread groups because long iterating threads will hold back the rest in the group from exiting early and the hardware will be underutilized. So a smaller tile size for the light evaluation should be preferred (8×8 worked best in my experiments).

Seeing these problems I propose to separate the big shader into several smaller parts. Frustum precomputation could go into its own shader. An other shader could reduce the depth buffer and create the depth bounds with the tile bitmask information. Tile AABB computation could also go in there potentially. The culling shader would only load the depth bounds, tile AABB and bitmask from memory, then perform culling the lights, then export the per tile light lists into memory. The last shader is the surface shading, light evaluating shader, which is a vertex-pixel shader for Forward+ and a compute shader for tiled deferred with a smaller blocksize than the culling (8×8 proposed as in previous paragraph).

Separating the shaders opens up an other optimization possibility, to use Indirect Dispatch. Consider for example that the culling determined for a tile that no lights are inside it, so it would instruct the surface shading shader not to dispatch any work groups for that tile.


A lot has been covered, so let’s recap quickly:

  • Frustum-sphere tests are inaccurate
  • Refine with AABB-sphere tests
  • Work against depth discontinuities with 2.5D culling bitmasks
  • Small burden on culling can be big win with shading
  • Approximate complex light types with spheres and AABBs
  • Avoid overworked ubershaders
  • Prefer smaller shaders, better fitted to specific task at hand

Thanks for reading, hope you enjoyed it! If you have any feedback, please share with me. Were there any mistakes? Definetly share 🙂

Further reading:

Introduction to tiled deferred rendering

Reference for 2.5D culling

Tile based rendering advancements GDC 2015

Awesome Forward+ tutorial

RealTimeCollisionDetection.NET has most of the intersection code

Investigating efficient spotlight culling

Next power of two in HLSL

There are many occasions when a programmer would want to calculate the next power of two for a given number. For me it was a bitonic sorting algorithm operating in a compute shader and I had this piece of code be responsible for calculating the next power of two of a number:

uint myNumberPowerOfTwo = 1;

while( myNumberPowerOfTwo < myNumber)


   myNumberPowerOfTwo <<= 1;


It gets the job done, but doesn’t look so nice. For not unusual cases when myNumber is more than 1000 it can already take ten cycles to loop. I recently learned that HLSL has a built in function called firstbithigh. It returns the position of the first non zero bit in a 32-bit number starting from the left to the right (from high order to low). With its help, we can rewrite the algorithm as follows:

uint myNumberPowerOfTwo = 2 << firstbithigh(myNumber);

It does the same thing, so how does it work? Take a random number and write it in binary:


For that number, firstbithigh returns the number of the location of the first non zero bit (from the left), which is 24. The next power of two to that number would be:


That number has seven leading zeroes. All we have to do is zero the whole number but set a single bit to non-zero. To do it, first set up 2 in binary:


Then shift the whole thing left by the position of the first bit in myNumber to get what we are looking for:


Nice one, but we have a few corner cases we need to eliminate. If myNumber is already a power of two, we will get a greater power of two which might not be what we are looking for. In that case we should return myNumber itself. If you subtract one from myNumber before calling firstbithigh on it, it will ensure that power of two numbers will return themselves:

uint myNumberPowerOfTwo = 2 << firstbithigh(myNumber – 1);

Take 1024 for example, it will get the next power of two for 1023 which is 1024. Or take 129 for example, it will get the next power of two for 128 which is 256, so works out quite well. If you try to calculate it for 0 however, there is a potential bug if myNumber is unsigned, as -1 in unsigned would wrap around to 0xFFFFFFFF. If you can potentially call it for zeroes and operate on unsigned, I would suggest doing it like this:

uint myNumberPowerOfTwo = 2 << firstbithigh(max(1, myNumber) – 1);

If it is signed int, then you don’t have to worry about it, as the firstbithigh function will return zero for negative numbers.

Have any question, noticed a mistake? Please post in the comments below!

Thank you for reading!

GPU-based particle simulation


I finally took the leap and threw out my old CPU-based particle simulation code and ventured to GPU realms with it. The old system could spawn particles on the surface on a mesh with a starting velocity of each particle modulated by the surface normal. It kept a copy of each particle on CPU, updated them sequentially, then uploaded them to GPU for rendering each frame. The new system needed to keep the same set of features at a minimum, but GPU simulation also opens up more possibilities because we have direct access to resources like textures created by the rendering pipeline. It is also highly parallellized compared to the CPU solution, both the emitting and the simulation phase which means we can do a much higher amount of particles in the same amount of time. There is less data moving between the system and GPU, we can get away with only a single constant buffer update and command buffer generation, the rest of the data lives completely in VRAM. This makes simulation on a massive scale a reality.

If that got you interested, check out the video presentation of my implementation in Wicked Engine:

So, the high level flow of the GPU particle system described here is the following:

  1. Initialize resources:
    • Particle buffer with a size of maximum amount of particles [ParticleType*MAX_PARTICLE_COUNT]
    • Dead particle index buffer, with every particle marked as dead in the beginning [uint32*MAX_PARTICLE_COUNT]
    • 2 Alive particle index lists, empty at the beginning [uint32*MAX_PARTICLE_COUNT]
      • We need two of them, because the emitter writes the first one, simulation kills dead particles and writes the alive list again to draw later
    • Counter buffer:
      • alive particle count [uint32]
      • dead particle count [uint32]
      • real emit count = min(requested emit count, dead particle count) [uint32]
      • particle count after simulation (optional, I use it for sorting) [uint32]
    • Indirect argument buffer:
      • emit compute shader args [uint32*3]
      • simulation compute shader args [uint32*3]
      • draw args [uint32*4]
      • sorting compute shader arguments (optional) [uint32*3]
    • Random color texture for creating random values in the shaders
  2. Kick off particle simulation:
    • Update a constant buffer holding emitter properties:
      • Emitted particle count in current frame
      • Emitter mesh vertex, index counts
      • Starting particle size, randomness, velocity, lifespan, and any other emitter property
    • Write indirect arguments of following compute passes:
      • Emitting compute shader thread group sizes
      • Simulation compute shader thread group sizes
      • Reset draw argument buffer
    • Copy last frame simulated particle count to current frame alive counter
  3. Emitting compute shader:
    • Bind mesh vertex/index buffers, random colors texture
    • Spawn as many threads as there are particles to emit
    • Initialize a new particle on a random point on the emitter mesh surface
    • Decrement dead list counter atomically while getting last value, this is our new dead particle index, read the dead list on that location to retrieve the particle index for the particle buffer
    • Write the new particle to the particle buffer on this index
    • Increment alive particle count, write particle index into alive list 1
  4. Simulation compute shader:
    • Each thread reads alive list 1, and updates particle properties if particle has life > 0, then writes it into alive list 2. Increment Draw argument buffer.
    • Otherwise, kill particle by incrementing dead list counter and writing particle index to dead list
    • Write particle distance squared to camera for sorting (optional)
    • Iterate through force fields in the scene and update particle according (optional)
    • Check collisions with depth buffer and bounce off particle (optional)
    • Update AABB by atomic min-maxing particle positions for additional culling steps (optional)
  5. Sorting compute shader (optional):
    • An algorithm like bitonic sorting maps well to GPU, can sort a large amount
    • Multiple dispatches required
    • Additional constant buffer updates might be required
  6. Swap alive lists:
    • Alive list 1 is the alive list from previous frame + emitted particles in this frame.
    • In this frame we might have killed off particles in the simulation step and written the new list into Alive list 2. This will be used when drawing, and input to the next frame emitting phase.
  7. Draw alive list 1:
    • After the swap, alive list 1 should contain only the alive particle indices in the current frame.
    • Draw only the current alive list count with DrawIndirect. Indirect arguments were written by the simulation compute shader.
  8. Kick back and profit 🙂
    • Use your new additional CPU time for something cool (until you move that to the GPU as well)

Note: for adding particles, you could use append-consume structured buffers, or counters written by atomic operations in the shader code. The append-consume buffers might include an additional performance optimization hidden from the user, which is GDS (global data share) for the hardware that supports it. Basically it is a small piece of fast access memory visible to every thread group located on a separate chip instead of the RAM. I went with the atomic counter approach and haven’t tested performance differences yet. The append-consume buffers are not available in every API which makes them less appealing.


The following features are new and nicely fit into the new GPU particle pipeline:

  • Sorting
    • I never bothered with particle sorting on the CPU. It was already kind of slow without it, so I got away with only sorting per-emitter, so farther away emitters were drawn earlier. I decided to go with bitonic sorting because I could just pull that from the web. This is a bit too involving and though that it would consume too much time to implement on my own and debug. AMD has a really nice implementation available. Sorting becomes a required step if the particles are not additively blended because threads are now writing them in arbitrary order.
  • Depth buffer collisions
    • This is a very nice feature of GPU particle systems. This is essentially a free physics simulation for particles which are on the screen. This only involves reading the depth buffer in the simulation phase, checking if the particle is behind it, and if it is, then read the normal buffer (or reconstruct normal from depth buffer) and modulate particle velocity by reflecting it with the surface normal.
  • Force fields
    • This is completely possible with CPU particle systems as well, but now we can apply them to a much bigger simulation. In the simulation compute shader we can preload some force fields to LDS (local data share) for faster memory access.
  • Emit from skinned mesh
    • Mesh skinning is done on the GPU nowadays, so using the skinned meshes while emitting becomes trivial, with no additional cost whatsoever.
  • Async compute
    • Now I still haven’t had a chance to try any async compute, but this seems like a nice candidate for that because simulation could be very much decoupled from rendering and it could lead to better utilization of GPU resources. Async compute is available in the modern low level graphics APIs like DX12, Vulkan and console specific APIs. It also requires hardware support which is available only in the latest GPUs.



Debugging a system which is living on the GPU is harder than on the CPU but essential. We should ideally make use of a graphics debugger software, but there are also opportunities to make our life easier with creating some utilities for this purpose. The thing that helped me most is writing out some data about the simulation to the screen. For this, we need to access the data which is resident on the GPU, which we can do as if we were downloading something from a remote machine. Using the DirectX 11 API, we can do this by creating a resource of the same type and size that we want to download and creating it with D3D11_USAGE_STAGING flag, no bind flags and READ CPU access. We have to issue a copy into this buffer from the one we want to download by calling ID3D11DeviceContext::CopyResource, then read the buffer contents by mapping it with READ flags. As the buffer contents will only be available when the frame is finished with rendering until that point, we can either introduce a CPU-GPU sync point and wait in place until the operation completes or do the mapping a few frames later. In a debugging scenario, a sync point might be sufficient and simpler to implement, but we should avoid any such behaviour in the final version of the application.


Drawing billboards would be seem like a nice place to use geometry shaders. Unfortunately, geometry shaders introduce inefficiencies in the graphics pipeline, because of various reasons. Primitives need to be traversed and written to memory serially, some architectures even go as far as writing the GS output to system memory. The option of my choice is just leaving the geometry shader and doing the billboard expansion in the vertex shader. For this, we must spawn the VS with a triangle list topology, vertex count of particleCount * 6 and calculate the particle index and billboard vertex index from the SV_VertexID system-value semantic. Like this:

static const float3 BILLBOARD[] = {

  float3(-1, -1, 0), // 0

  float3(1, -1, 0), // 1

  float3(-1, 1, 0), // 2

  float3(-1, 1, 0), // 3

  float3(1, -1, 0), // 4

  float3(1, 1, 0), // 5


VertextoPixel main(uint fakeIndex : SV_VERTEXID)


  uint vertexID = fakeIndex % 6;

  uint instanceID = fakeIndex / 6;

  Particle particle = particleBuffer[aliveList[instanceID]];

  float3 quadPos = BILLBOARD[vertexID];

  // …


Additionally, for better drawing performance, you should use indexed drawing with 4 vertices per quad, but that way the two index lists will be six times the size each, so bandwidth will increase for the simulation. Maybe it is still worth it, I need to compare performance results.


There are many possibilities to extend this system, because compute shaders make it very flexible. I am overall happy with how this turned out. Provided my previous particle systems were quite simplistic, porting all the features was not very hard and I haven’t had to make any compromises. The new system frees up CPU resources which are more valuable for gameplay logic and other systems which are interconnected. Particles are usually completely decoupled from the rest of the engine so they are an ideal candidate for running it remotely on a GPU.

You can check the source code of my implementation of GPU-based particles in Wicked Engine:

Feel free to rip off any source code from there! Thank you for reading!

Inspiration from:

Compute – based GPU particles by Gareth Thomas

Which blend state for me?


If you are familiar with creating graphics applications, you are probably somewhat familiar with different blending states. If you are like me, then you were not overly confident in using them, and got some basics ones copy-pasted from the web. Maybe got away with simpe alpha blending and additive states, and heard of premultiplied alpha somewhere but didn’t really care as long as it looked decent enough at the time. Surely, there are a lot of much more interesting stuff waiting for you to be implemented. Then later you realize, that something looks off with an alpha blended sprite somewhere. You correct it with some quick fix and forget about it. A week later, you are want to be playing with some particle systems, but there is something wrong with that, the blending doesn’t look good anymore because of a dirty tweak you made earlier. Also, your GUI layer was displaying the wrong color the whole time, but just enough not to notice. There are just so many opportunities for screwing up your blending states without noticing it immediately. Correcting the mistakes can really quickly turn into a big headache. Here I want to give some practical examples and explanations of different use cases, for techniques mainly used in 3D rendering engines.

First thing is rendering alpha blended sprites on top of each other, just to the back buffer immediately. We need a regular alpha blending renderstate for that which does this:

dst.rgb = src.rgb * src.a + dst.rgb * (1 – src.a)

Here, dst means the resulting color in the rendertarget (which is now the backbuffer). Src is the color of the sprite which the pixel shader returns. Our colors are just standard 32 bit rgba in the range [0, 1] here. With this, we have successfully calculated a good output color which we can just write as-is to the back buffer. For this, we don’t care what the alpha output is, because no further composition will be happening. Here is the corresponding state description for DirectX 11:


desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_SRC_ALPHA;

desc.DestBlend = D3D11_BLEND_INV_SRC_ALPHA;

desc.BlendOp = D3D11_BLEND_OP_ADD;

The tricky part comes when we want to draw our alpha blended sprites to separate layers which will be composited later on. In that case we also have to be careful what alpha value we write out. Take a simple scenario for example, in which you render a sprite with alpha blending to a render target, and later you render your rendertarget to your backbuffer with the same alpha blending. For this, we want to accumulate alpha values, so just add them:

dst.a = src.a + dst.a

Which is equivalent to the following blend state in DirectX 11 (just append to the previous snippet):

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

Accumulating alpha seems like a good fit for a 32 bit render target, as values will be clamped to one, so opacity will increase with overlapping sprites, but colors won’t be over saturated. Try blending the render target layer now to the back buffer. There will be an error, which could not be obvious at first (which means the worst kind of error). Let me show you:

blendOn the black background, you might even not notice the error at first, but on the white background it becomes apparent at once (in this case, “background” means our backbuffer). For different images, there might be different scenarios where the problem becomes apparent. This could be a challenge to overcome, if you already made a lot of assets, maybe even compensated somehow for the error without addressing the source of the error: the blend operation is not correct anymore! First, we blended the sprite onto the layer, this is still correct, but we blend the layer with the same operation again, so alpha messes with the colors two times here. There is a correct solution to the problem: premultiplied alpha blend operation, which is this:

dst.rgb = src.rgb + dst.rgb * (1 – src.a)

dst.a = src.a + dst.a

Create it in DX11 like this:


desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_ONE;

desc.DestBlend = D3D11_BLEND_INV_SRC_ALPHA;

desc.BlendOp = D3D11_BLEND_OP_ADD;

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

The only change is that we do not multiply with source alpha any more. Our problem is fixed now, we can keep using our regular alpha blending on sprites, but use premultiplied blending for the layers. Simple right? But do not forget about premultiplied alpha just yet, it can help us out for more problems as well.

I had multiple problems with rendering particle systems to off-screen buffers for soft particles, and can also help with performance if the buffer is of small resolution. One of the problems was the above mentioned faulty alpha blending of particles (particles are the sprites, the layer is the off-screen buffer when mapped to the previous example). The other issue is that I also want to render additive particles to the same render target and I want to blend the whole thing later in a single pass. This is an additive blending state:

dst.rgb = src.rgb * src.a + dst.rgb

dst.a = dst.a

Which corresponds to this state in DX11:


desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_SRC_ALPHA;

desc.DestBlend = D3D11_BLEND_ONE;

desc.BlendOp = D3D11_BLEND_OP_ADD;

dest.SrcBlendAlpha = D3D11_BLEND_ZERO;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

Notice that src output doesn’t contribute to alpha, but that is no problem at all, because the premultiplied layer blending will still take the layer color, just adding the full destination color to it. This is only possible if our layer is configured for premultiplied blending, otherwise it would just disappear upon blending. We can also have our particle textures themselves be in premultiplied texture format (where the texture colors are already multiplied by alpha) and blend state, and blending that onto the layer, it just works. When I wasn’t familiar with this, I used a separate render target for regular alpha blended and premultiplied texture format particles and an other one for additive ones, what a waste! We can see that premultiplied blending is already very flexible, so keep that in mind because it can save the day on many occasions. It also makes a huge difference in mipmap generation, see this neat article from Nvidia!

Side note: premultiplied alpha blending was also widely used because it has better blending performance. Think of it as precomputing the alpha blend factor and storing it inside the texture. The performance reasons are probably not so apparent today.

I had an other problem with particle system rendering, because they were rendered to a HDR floating point target. That one doesn’t clamp the alpha values. Consider the case when a particle’s alpha value is bigger than one for whatever reason and blended with regular alpha blending: the term dest.rgb * (1 – src.a) is now of course producing negative values. This is easy to overcome with just saturating the alpha output of the pixel shader, done! The other problem is this: dst.a = src.a + dst.a which can still result in larger than one alpha values, but this will only be a problem later on, when blending the layer as premultiplied alpha. We would need to saturate the (1 – src.a) term in the blending state but we cannot, there is not such state. We have a blend D3D11_BLEND_SRC_ALPHA_SAT value, but there is not for the inverse of it. The workaround that I am using for this is to modify the particle alpha blending state to accumulate alpha a bit differently:

dest.a = src.a + dst.a * (1 – src.a)

In DX11 terms:

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_INV_SRC_ALPHA;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

This accumulation method is probably not perfect, but works really well in practice:


That’s it, I think these are the three most important blending modes, most effects can be achieved with a combination of these. Just always keep an eye on your blend state creation and be very explicit about it, that is the way to avoid many bugs down the road. If you were like me and haven’t paid much attention to these  until now, this is the best time to reiterate on this, because tracking down errors associated with this becomes a hard journey later. Thanks for reading!

Forward+ decal rendering

decalsDrawing decals in deferred renderers is quite simple, straight forward and efficient: Just render boxes like you render the lights, read the gbuffer in the pixel shader, project onto the surface, then sample and blend the decal texture. The light evaluation then already computes lighting for the decaled surfaces. In traditional forward rendering pipelines, this is not so trivial. It is usually done by cutting out geometry under the decal, creating a new mesh from it with projected texture coordinates and render it for all lights, additively. Apart from the obvious increased draw call count and fillrate consumption, there is even potential for z-fighting artifacts. While moving to tile-based forward rendering (Forward+), we can surely think of something more high-tech.

We want to avoid additional geometry creation, increased draw call count while keeping the lighting computation constant. But in addition to these, with this new technique we can even trivially support modification of surface properties, creating decals which can modify surface normal, roughness, metalness, emissive, etc. or even do parallax occlusion mapping. We can even apply decals to transparent surfaces easily! This article will describe the outline of the technique without source code. You can look at my implementation however, here: culling, and sorting shader; blending evaluation shader.

In forward+ we have a light culling step and a light evaluation separately. The decals will be inserted for both passes. A culling compute shader iterates through a global light array and decides for each screen space tile, which lights are inside and adds them to a global list (in case of tiled deferred, it just adds them to a local list and evaluates lighting there and then). For adding decals to the culling, we need to extend the light descriptor structure to be able to hold decal information, and add functions to the shader to be able to cull oriented bounding boxes(OBBs). We can implement OBB culling by doing coarse AABB tests. Transform the AABB of the tile by the decal OBB’s inverse matrix (while keeping min-max up to date) and test the resulting AABB against a unit AABB. This is achieved by determining the 8 corner points of the tile AABB, transforming each by the inverse decal OBB, then determining the min and max corner points of the resulting points.


Rendering the decals takes place in the object shaders while we also evaluate the lighting. If the decals can modify surface parameters, like normals, it is essential that we render the decals before the lights. For that, we must have a sorted decal list. We can not avoid sorting the decals, anyway, as I have found out the hard way. Because the culling is performed in parallel, the decals can be added to the tile in arbitrary order. But we have a strict order when blending the decals, that is the order we have placed them onto the surface. If we don’t sort, it can lead to severe flickering artifacts when there are overlapping decals. Thankfully the sorting is straight forward, easily parallellized and can be done in the LDS (Local Data Share memory) entirely. I have gotten this piece of code from an AMD presentation (a bitonic sort implementation in LDS).

The easiest way is that we sort the decals in the CS so that the bottom decal is first, and the top is last (bottom-to-top sorting). This way, we can do regular alpha blending (which is a simple lerp in the shader) easily. Though we can do better. This way we sample all of the decals, even if the bottom ones are completely covered by decals placed on top. Instead we should sort the opposite way, so that first we evaluate the top ones, and then the decals underneath but just until the alpha accumulation reaches one. We can skip the rest. The blending equation also needs to be modified for this. The same idea is presented in the above mentioned AMD presentation for tile based particle systems. The modified blending equation looks like this:

color = ( invDestA x srcA x srcCol ) + destCol

alpha = srcA + ( invSrcA x destA )

This method can save us much rendering time when multiple decals are overlapping. But this can result in different output when we have emissive decals for example. In the bottom-to-top blending, emissive decals will always be visible because the contribution is added to the light buffer, but the top-to-bottom sorting (and skip) algorithm will skip the decals which are completely covered. I think this is “better” behaviour overall but on a subjective basis of course.

The nice thing about this technique, is that we can trivially modify surface properties, if we just sample all of our decals before all of the lights. Take this for example: we want to modify normal of the surface with the decal normal map. We already have our normal vector in our object shader, so when we get to the decals, just blend it in shader code with the decal normal texture, without the need for any packing/unpacking and tricky blending of g-buffers (a’la deferred). The light evaluation which comes after it “just works” with the new surface normal without any modification at all.

Maybe you have noticed, that we need to do the decal evaluation in dynamically branching code, which means that we must leave the default mip-mapping support. This is because from the compiler’s standpoint, we might perfectly well not be evaluating the same decals in neighbouring pixels, but we need those helper pixels for correct screen space derivative coordinates. In our case when we have multiple of two pixel count tiles (I am using 16×16 tiles), we are being coherent for our helper pixels, but the compiler doesn’t know that unfortunately. I haven’t yet found a satisfying way to overcome this problem. I experimented with linear distance/screen space size based mip selection, but found them unsatisfying for my purposes (they might be ok for a single game/camera type though).


Update: Thanks to MJP, I learned a new technique for obtaining nice mip mapping results: We just need to take the derivatives of the surface world position, transform it by the decal projection matrix (but leave the translation), and we have the decal derivatives that we can feed into Texture2D::SampleGrad for example. An additional note is that when using a texture atlas for the decals, we need to take into consideration the atlas texture coordinate transformation. So, just multiply the decal derivatives by the atlas transformation’s scaling part. Cool technique!

We also need to somehow dynamically support different decal textures in the same object shader. A texture atlas comes handy in this case, or bindless textures are also an option in newer APIs.

As we have added decal support to the tiled light array structure, the structure probably is getting bloated, which means less cache efficient, because most lights probably don’t need a decal inverse matrix (for projection), texture atlas offsets, etc. For this, the decals could probably get their own structure and a different array, or just tightly pack everything in a raw buffer (byteaddressbuffer in DX11). I need to experiment with this.


This technique is a clear upgrade from the traditional forward rendered decals, but comparing it with the deferred decals is not a trivial matter. First, we can certainly optimize deferred decals in several ways. I have been already toying with the idea of using Rasterizer Ordered Views to optimize the blending in a similar way, and eliminating overdraw. Secondly, we have increased branching and register pressure in the forward rendering pass, while rasterization of deferred decals is a much more light weight shader which can be better parallellized when the overdraw is not so apparent. In that case, we can get away with rendering much more deferred decals than tiled decals. The tile-based approach gets much better with increased overdraw because of the “skip the occluded” behaviour as well as the reduced bandwidth cost of not having to sample a G-buffer for each decal. Forgive me for not providing actual profiling data at the moment, this article intends to be merely a brain dump, but I also hope somewhat inspirational.

Skinning in a Compute Shader


Recently I have moved my mesh skinning implementation from a streamout geometry shader to compute shader. One reason for this was the ugly API for the streamout which I wanted to leave behind, but the more important reason was that this could come with several benefits.

First, compared to traditional skinning in a vertex shader, the render pipeline can be simplified, because we only perform skinning once for each mesh instead of in each render pass. So when we render our animated models multiple times, for shadow maps, Z-prepass, lighting pass, etc.. we are using regular vertex shaders for those passes with the vertex buffer swapped out for the pre-skinned vertex buffer. Also, we avoid many render state setup, like binding bone matrix buffers for each render pass. But this can be done in a geometry shader with stream out capabilities as well.

The compute shader approach has some other nice features compared to the first point. The render pipeline of Wicked Engine requires the creation of a screen space velocity buffer. For that, we need out previous frame animated vertex positions. If we don’t do it in a compute shader, we probably need to skin each vertex with the previous frame bone transforms in the current frame to get the velocity of the vertex which is currentPos – prevPos (If we have deinterleaved vertex buffers, we could avoid it by swapping vertex position buffers). In a compute shader, this becomes quite straight forward, however. Perform skinning only for the current frame bone matrices, but before writing out the skinned vertex to the buffer, load the previous value of the position and that is your previous frame vertex position. Write it out then to the buffer at the end.

In a compute shader, it is the developer who can assign the workload across several threads, not rely on the default vertex shader thread invocations. Also, the vertex shader stage has strict ordering specifications, because vertices must be written out in the exact same order they arrived. A compute shader can just randomly write into the skinned vertex buffer when it is finished. That said, it is also the developer’s responsibility to avoid writing conflits. Thankfully, it is quite trivial, because we are writing a linear array of data.

In compute shaders we can also make use of LDS memory to reduce memory reads. This can be implemented as each thread in a group only loads one bone data from main memory and stores it in LDS. Then the skinning computation just reads the bone data from LDS, and because each bone now doesn’t read 4 bones from VRAM but LDS, it has the potential for speedup. I have made a blog about this.

An other nice feature is the possibility lo leverage async compute in a newer graphics APIs like DirectX 12, Vulkan or the Playstation 4 graphics API. I don’t have experience with it, but I imagine it would be more taxing on the memory, because we would probably need to double buffer the skinned vertex buffers.

An other possible optimization is possible with this. If the performance is bottlenecked by the skinning in our scene, we can avoid skinning meshes in the distance for every other frame or so for example, so a kind of a level of detail technique for skinning.

The downside is that this technique comes with increased memory requirements, because we must write into global memory to provide the data up front for following render passes. We also avoid the fast on-chip memory of the GPU (memory for vertex shader to pixel shader parameters) for storing the skinned values.

Here is my shader implementation for skinning a mesh in a compute shader:

struct Bone


float4x4 pose;


StructuredBuffer boneBuffer;

ByteAddressBuffer vertexBuffer_POS; // T-Pose pos

ByteAddressBuffer vertexBuffer_NOR; // T-Pose normal

ByteAddressBuffer vertexBuffer_WEI; // bone weights

ByteAddressBuffer vertexBuffer_BON; // bone indices

RWByteAddressBuffer streamoutBuffer_POS; // skinned pos

RWByteAddressBuffer streamoutBuffer_NOR; // skinned normal

RWByteAddressBuffer streamoutBuffer_PRE; // previous frame skinned pos

inline void Skinning(inout float4 pos, inout float4 nor, in float4 inBon, in float4 inWei)


 float4 p = 0, pp = 0;

 float3 n = 0;

 float4x4 m;

 float3x3 m3;

 float weisum = 0;

// force loop to reduce register pressure

 // though this way we can not interleave TEX - ALU operations


 for (uint i = 0; ((i < 4) && (weisum<1.0f)); ++i)


 m = boneBuffer[(uint)inBon[i]].pose;

 m3 = (float3x3)m;

p += mul(float4(, 1), m)*inWei[i];

 n += mul(, m3)*inWei[i];

weisum += inWei[i];


bool w = any(inWei); = w ? :; = w ? n :;


[numthreads(256, 1, 1)]

void main( uint3 DTid : SV_DispatchThreadID )


 const uint fetchAddress = DTid.x * 16; // stride is 16 bytes for each vertex buffer now...

uint4 pos_u = vertexBuffer_POS.Load4(fetchAddress);

 uint4 nor_u = vertexBuffer_NOR.Load4(fetchAddress);

 uint4 wei_u = vertexBuffer_WEI.Load4(fetchAddress);

 uint4 bon_u = vertexBuffer_BON.Load4(fetchAddress);

float4 pos = asfloat(pos_u);

 float4 nor = asfloat(nor_u);

 float4 wei = asfloat(wei_u);

 float4 bon = asfloat(bon_u);

Skinning(pos, nor, bon, wei);

pos_u = asuint(pos);

 nor_u = asuint(nor);

// copy prev frame current pos to current frame prev pos

streamoutBuffer_PRE.Store4(fetchAddress, streamoutBuffer_POS.Load4(fetchAddress));

// write out skinned props:

 streamoutBuffer_POS.Store4(fetchAddress, pos_u);

 streamoutBuffer_NOR.Store4(fetchAddress, nor_u);


Oh god I hate this wordpress code editor… (maybe I just can’t use it properly)

As you can see, quite simple code, I just call this compute shader with something like this:

Dispatch( ceil(mesh.vertices.getCount() / 256.0f), 1, 1);

These vertex buffers are not packed yet as of now, which is quite inefficient. Of course, positions could probably be stored in 16-bit float3s (but you must animate in local space then), normals can be packed nicely into 32-bit uints, bone weights and indices should be packed into a single buffer and packed into uints as well. If you are using raw buffers (byteaddressbuffer in hlsl), then you have to do the type conversion yourself. You can also use typed buffers, but performance may be diminished. You can see an example of the optimizations with manual type conversion of compressed vertex streams in my Wicked Engine repo.

I am using precomputed skinning in Wicked Engine for a long time now, so can’t compare with the vertex shader approach, but it is definetly not worse than the streamout technique. I can imagine that for some titles, it might not be worth it to store additional vertex buffers to VRAM and avoid on-chip memory for skinning results. However, this technique could be a candidate in optimization scenarios because it is easy to implement and I think also easier to maintain because we can avoid the shader permutations for skinned and not skinned models.

Thanks for reading!

Area Lights


I am trying to get back into blogging. I thought writing about implementing area light rendering might help me with that.

If you are interested in the full source code, pull my implementetion from the Wicked Engine lighting shader. I won’t post it here, because I’d rather just talk about it.

A 2014 Siggraph presentation from Frostbite caught my attention for showcasing their research on real time area light rendering. When learning graphics programming from various tutorials, there is explanations for punctual light source rendering, like point, spot and directional lights. Even most games get away with using these simplistic light sources.

For rendering area lights, we need much more complicated light equations and more performance requirements for our shaders. Luckily, the above mentioned presentation came with a paper with all the shaders for diffuse light equations for spherical, disc, rectangular and tube light sources.

The code for specular lighting for these type of lights was not included in that paper, but it mentioned the “representative point method“. What this technique essentially does is that you keep the specular calculation, but change the light vector. The light vector was the vector pointing from the light position to the surface position. But for our lights, we are not interested in the reflection between the light’s center and the surface, but between the light “mesh” and the surface.

Representative point method

If we modify the light vector to point from the surface to the closest point on the light mesh to the reflection vector, then we can keep using our specular BRDF equation and we will get a nice result; the specular highlight will be in the shape of the light mesh (or somewhere close to it). It is important, that this is not a physically accurate model, but it is something nice which is still performant in real time.

My first intuition was just that we could trace the mesh with the reflection ray. Then our light vector (L) is the vector from the surface point (P) to the intersection point (I), so L=I-P. The problem is, what if there is no intersection? Then we won’t have a light vector to feed into our specular brdf. This way we are only getting hard cutoff reflections, and surface roughness won’t work because the reflections can’t be properly “blurred” on the edges where there is no trace hit.

The correct approach to this is, that we have to find the closest point on mesh from the reflection ray. If the trace succeeds, then our closest point is the hit point, if not, then we have to “rotate” the light vector to sucessfully trace the mesh. We don’t actually rotate, just find the closest point (C), and so our new light vector is: L=C-P.

See the image below (V = view vector, R = reflection vector):

representative_pointFor all our four different light types, we have to come up with the code to find the closest point.

  • Sphere light:
    • This one is simple: first, calculate the real reflection vector (R) and the old lightvector (L). Additional symbols: surface normal vector (N) and view vector (V).

      R = reflect(V, N);

      centerToRay = dot(L,R) * R – L;

      closestPoint = L + centerToRay * saturate(lightRadius / length(centerToRay));

  • Disc light:
    • The idea is, first trace the disc plane with the reflection vector, then just calculate the closest point to the sphere from the plane intersection point like for the sphere light type. Tracing a plane is trivial:

      distanceToPlane = dot(planeNormal, (planeOrigin – rayOrigin) / dot(planeNormal, rayDirection));

      planeIntersectionPoint = rayOrigin + rayDirection * distanceToPlane;

  • Rectangle light:
    • Now this is a bit more complicated. The algorithm I use consists of two paths: The first path is when the reflection ray could trace the rectangle. The second path is, when the trace didn’t succeed. In that case, we need to find the intersection with the plane of the rectangle, then find the closest point on one of the four edges of the rectangle to the plane intersection point.
    • For tracing the rectangle, I trace the two triangles that make up the rect and take the correct intersection if it exists. Tracing a triangle involves tracing the triangle plane, then deciding if we are inside the triangle. A, B, C are the triangle corner points.

      planeNormal = normalize(cross(B – A, C – B));

      planeOrigin = A;

      t = Trace_Plane(rayOrigin, rayDirection, planeOrigin, planeNormal);

      p = rayOrigin + rayDirection * tN1 = normalize(cross(B – A, p – B));

      N2 = normalize(cross(C – B, p – C));

      N3 = normalize(cross(A – C, p – A));d0 = dot(N1, N2);

      d1 = dot(N2, N3);

      intersects = (d0 > 0.99) AND (d1 > 0.99);

    • The other algorithm is finding the closest point on line segment from point. A and B are the line segment endpoints. C is the point on the plane.

      AB = B – A;

      t = dot(C – A, AB) / dot(AB, AB);

      closestPointOnSegment = A + saturate(t) * AB;

  • Tube light:
    • First, we should calculate the closest point on the tube line segment to R. Then just place a sphere on that point and do as we did for the sphere light (that is, calculate the closest point on sphere to the reflection ray R). Every algorithm is already described already to this point, so all that needs to be done is just put them together.

So what to do when you have the closest point on the light surface? You have to convert it to the new light vector: newLightVector = closestPoint – surfacePos.

When you have your new light vector, you can feed it to the specular brdf function and in the end you will get a nice specular highlight!


With the regular shadow mapping techniques, we can do shadows for area lights as well. Results are again not accurate, but get the job done. In Wicked Engine, I am only doing regular cube map shadows for area lights like I would do for point lights. I can not say I am happy with them, especially for long tube lights for example. In an other engine however, I have been experimenting with dual paraboloid shadow mapping for point lights. I recommend a single paraboloid shadow map for the disc and rectangle are lights in the light facing direction. These are better in my opinion than regular perspective shadow maps, because they distort very much for high field of views (these light types would require near 180 degrees of FOV).

For the sphere and tube light types I still recommend cubemap shadows.

Original sources:

Voxel-based Global Illumination


There are several use cases of a voxel data structure. One interesting application is using it to calculate global illumination. There are a couple of techniques for that, too. I have chosen the voxel cone tracing approach, because I found it the most flexible one for dynamic scenes, but CryEngine for example, uses Light propagation volumes instead with a sparse voxel octree which has smaller memory footprint. The cone tracing technique works best with a regular voxel grid because we perform ray-marching against the data like with screen space reflections for example. A regular voxel grid consumes more memory, but it is faster to create (voxelize), and more cache efficient to traverse (ray-march).

So let’s break down this technique into pieces. I have to disclose this at the beginnning: We can do everything in this technique real time if we do everything on the GPU. First, we have our scene model with polygonal meshes. We need to convert it to a voxel representation. The voxel structure is a 3D texture which holds the direct illumination of the voxelized geometries in each pixel. There is an optional step here which I describe later. Once we have this, we can pre-integrate it by creating a mipmap chain for the resource. This is essential for cone tracing because we want to ray-march the texture with quadrilinear interpolation (sampling a 3D texture with min-mag-mip-linear filtering). We can then retrieve the bounced direct illumination in a final screen space cone tracing pass. The additional step in the middle is relevant if we want more bounces, because we can dispatch additional cone tracing compute shader passes for the whole structure (not in screen space).

The nice thing about this technique is that we can retrieve all sorts of effects. We have “free” ambient occlusion by default when doing this cone tracing, light bouncing, but we can retrieve reflections, refractions and shadows as well from this voxel structure with additional ray march steps. We can have a configurable amount of light bounces. Cone tracing code can be shared between the bouncing and querying shader and different types of rays as well. The entire thing remains fully on the GPU, the CPU is only responsible for command buffer generation.

Following this, I will describe the above steps in more detail. I will be using the DirectX 11 graphics API, but any modern API will probably do the job. You will definetly need a recent GPU for the most efficient implementation. This technique is targeted for PCs or the most recent consoles (Playstation 4 or Xbox One). It most likely can not run on mobile or handheld devices because of their limited hardware.

I think this is an advanced topic and I’d like to aim for experienced graphics programmers, so I won’t present code samples for the more trivial parts, but the whole implementation is available to anyone in Wicked Engine.

Part 1: Voxelization on the GPU

The most involving part is definetly the first one, the voxelization step. It involves making use of advanced graphics API features like geometry shaders, abandoning the output merger and writing into resources “by hand”. We can also make use of new hardware features like conservative rasterization and rasterizer ordered views, but we will implement them in the shaders as well.

The main trick is to be able to run this real time is that we need to parallelize the process well. For that, we will exploit the fixed function rasterization hardware, and we will get a pixel shader invocation for each voxel which will be rendered. We also do only a single render pass for every object.

We need to integrate the following pipeline to our scene rendering algorithm:

1.) Vertex shader

The voxelizing vertex shader needs to transform vertices into world space and pass through the attributes to the geometry shader stage. Or just do a pass through and transform to world space in the GS, doesn’t matter.

2.) Geometry shader

This will be responsible to select the best facing axis of each triangle received from the vertex shader. This is important because we want to voxelize each triangle once, on the axis it is best visible, otherwise we would get seams and bad looking results.

// select the greatest component of the face normal input[3] is the input array of three vertices
float3 facenormal = abs(input[0].nor + input[1].nor + input[2].nor);
 uint maxi = facenormal[1] > facenormal[0] ? 1 : 0;
 maxi = facenormal[2] > facenormal[maxi] ? 2 : maxi;

After we determined the dominant axis, we need to project to it orthogonally by swizzling the position’s xyz components, then setting the z component to 1 and scaling it to clip space.

for (uint i = 0; i &lt; 3; ++i)
 // voxel space pos:
 output[i].pos = float4((input[i] - g_xWorld_VoxelRadianceDataCenter) / g_xWorld_VoxelRadianceDataSize, 1);

// Project onto dominant axis:
 if (maxi == 0)
 output[i] = output[i].pos.zyx;
 else if (maxi == 1)
 output[i] = output[i].pos.xzy;

// projected pos:
 output[i].pos.xy /= g_xWorld_VoxelRadianceDataRes;

output[i].pos.z = 1;

output[i].N = input[i].nor;
 output[i].tex = input[i].tex;
 output[i].P = input[i];
 output[i].instanceColor = input[i].instanceColor;

At the end, we could also expand our triangle a bit to be more conservative to avoid gaps. We could also just be setting a conservative rasterizer state if we have hardware support for it and avoid the expansion here.

// Conservative Rasterization setup:
 float2 side0N = normalize(output[1].pos.xy - output[0].pos.xy);
 float2 side1N = normalize(output[2].pos.xy - output[1].pos.xy);
 float2 side2N = normalize(output[0].pos.xy - output[2].pos.xy);
 const float texelSize = 1.0f / g_xWorld_VoxelRadianceDataRes;
 output[0].pos.xy += normalize(-side0N + side2N)*texelSize;
 output[1].pos.xy += normalize(side0N - side1N)*texelSize;
 output[2].pos.xy += normalize(side1N - side2N)*texelSize;

It is important to pass the vertices’ world position to the pixel shader, because we will use that directly to index into our voxel grid daa structure and write into it. We will also need texture coords and normals for correct diffuse color and lighting.

3.) Pixel shader

After the geometry shader, the rasterizer unit scheduled some pixel shader invocations for our voxels, so in the pixel shader we determine the color of the voxel and write it into our data structure. We probably need to sample our base texture of the surface and evaluate direct lighting which affects the fragment (the voxel). While evaluating the lighting, use a forward rendering approach, so iterate through the nearby lights for the fragment and do the light calculations for the diffuse part of the light. Leave the specular out of it, because we don’t care about the view dependant part now, we want to be able to query lighting from any direction anyway later. I recommend using a simplified lighting model, but try to keep it somewhat consistent with your main lighting model which is probably a physically based model (at least it is for me and you should also have one :P) and account for the energy loss caused by leaving out the specularity.

When you calculated the color of the voxel, write it out by using the following trick: I didn’t bind a render target for the render pass, but I have set an Unordered Access View by calling OMSetRenderTargetsAndUnorderedAccessViews(). So the shader returns nothing, but we write into our voxel grid in the shader code. My voxel grid is a RWStructuredBuffer here to be able to support atomic operations easily, but later it will be converted to a 3D texture for easier filtering and better cache utilization. The Structured buffer is a linear array of VoxelType of size gridDimensions X*Y*Z. VoxelType is a structure holding a 32 bit uint for the voxel color (packed HDR color with 0-255 RGB, an emissive multiplier in 7 bits and the last bit indicates if the voxel is empty or not). The structure also contains a normal vector packed into a uint. Our interpolated 3D world position comes in handy when determining the write position into the buffer, just truncate and flatten the interpolated world position which you reveived from the geometry shader. For writing the results, you must use atomic max operations on the voxel uints. You could be writing to a texture here without atomic operations, but using rasterizer ordered views, bt they don’t support volume resources, so a multi pass approach would be necessary for the individual slices of the texture.

An additional note: If you have generated shadow maps, you can use them in your lighting calculations here to get more proper illumination when cone tracing. If you don’t have shadow maps, you can even use the voxel grid to retrieve (soft) shadow information for the scene later.


If you got so far, you just voxelized the scene. You should write a debugger to visualize the results. I am using a naive approach which is maybe a bit slow, but gets the job done. I issue a Draw() command with a vertex count of voxel grid dimensions X*Y*Z, read my voxel grid in the vertex shader indexed by the SV_VertexID, then expand to a cube in the geometry shader if the voxel color is not empty (greater than zero). The pixel shader outputs the voxel color for each screen pixel covered.

Part 2: Filtering the data

We voxelized our scene into a linear array of voxels with nicely packed data. The packed data helped in the voxelization process, but it is no good for cone tracing, we need a texture which we can filter and sample. I have a compute shader which unpacks the voxel data, copies it into a 3D texture with RGBA16 format for HDR colors and finally it also clears the packed voxel data by filling it with zeroes. A nice effect would be not just writing the target texture, but intepolating with old values so that abrupt changes in lighting, or moving objects don’t cause much flickering. But we have to account for moving camera and offsetting the voxel grid. We could lerp intelligently with a nice algorithm, but I found that the easiest method is just “do not lerp when the voxel grid got offset” was good enough for me.

Then we generate a mip chain for the 3D texture. DX11 can do this automatically for us by calling GenerateMips() on the device context, but we can also do it in shaders if we want better quality than the default box filter. I experimented with gaussian filtering, but I couldn’t write one to be fast enough to be worthwhile, so I am using the default filter.

But what about the normals, because we saved them in the voxelization process? They are only needed when doing multiple light bounces or in more advanced voxel algorithms, like anisotropic voxelization.


Part 3: Cone tracing

We have the voxel scene ready for our needs, so let’s query it for information. To gather the global illumination for the scene, we have to run the cone tracing in screen space for every pixel on the screen once. This can happen in the forward rendering object shaders or against the gbuffer in a deferred renderer, when rendering a full screen quad, or in a compute shader. In forward rendering, we may lose some performance because of the worse thread utilization if we have many small triangles. A Z-prepass is an absolute must have if we are doing this in forward rendering. We don’t want to shade a pixel multiple times because this is a heavy computation.

For diffuse light bounces, we need the pixel’s surface normal and world position at minimum. From the world position, calculate the voxel grid coordinate, then shoot rays in the direction of the normal and around the normal in a hemisphere. But the ray should not start at the surface voxel, but the next voxel along the ray, so we don’t accumulate the current surface’s lighting. Begin ray marching, and each step sample your voxel from increasing mip levels, accumulate color and alpha and when alpha reaches 1, exit and divide the distance travelled. Do this for each ray, and in the end divide the accumulated result with the number of rays as well. Now you have light bounce information and ambient occlusion information as well, just add it to your diffuse light buffer.

Assembling the hemisphere: You can create a hemisphere on a surface by using a static array of precomputed randomized positions on a sphere and the surface normal. First, if you do a reflect(surfaceNormal, randomPointOnSphere), you get a random point on a sphere with variance added by the normal vector. This helps with banding as discrete precomputed points get modulated by surface normal. We still have a sphere, but we want the upper half of it, so check if a point goes below the “horizon” and force it to go to the other direction if it does:

bool belowHorizon = dot(surfaceNormal, randomPointOnSphere) < 0;

coneDirection = belowHorizon ? – coneDirection : coneDirection;

Avoid self-occlusion: So far, the method of my choice to avoid self occlusion is to start the cone tracing with offset from the surface by the normal direction and also the cone direction. If I don’t do this, then the cone starts off the surface and immediately samples its own voxel, so each surface would get its own contribution from the GI, which is not good. But if we start further off, then that means close by surfaces will not contribute to each other’s GI and there will be a visible disconnect in lighting. I imagine it would help to use anisotropic voxels, which means store a unique voxel for a few directions and only sample the voxels facing the opposite direction to the cone. This of course would require much additional memory to store.

Accumulating alpha: The correct way to accumulate alpha is a bit different to regular alpha blending:

float4 color = 0, alpha = 0;

// …

// And inside cone tracing loop:

float4 voxel = SampleVoxels().rgba;

float4 a = 1 – alpha;

color += a * voxel.rgb;

alpha += a * voxel.a;

As you can see, this is more like a front-to back blending. This is important, because otherwise we would receive a black staircase artefact on the edge of voxels, where the unfilled (black) regions with zero alpha would bleed into the result very aggressively.

Stepping the cone: When we step along the ray in voxel-size increments (ray-marching) in world space, we can retrieve the diameter of the cone for a given position by calculating this:

float coneDiameter = 2 * tan(coneHalfAngle) * distanceFromConeOrigin;

Then we can retrieve the correct mip level to sample from the 3D texture by doing:

float mip = log2(coneDiameter / voxelSize);

With this, we have a single light bounce for our scene. But much better results can be achieved with at least a single secondary light bounce. Read on for that.


Part 4: Additional light bounces

This is a simple step if you are familiar with compute shaders and you have wrapped the cone tracing function to be reusable. When we filtered our voxel grid, we spawn a thread in a compute shader for each voxel (better just for the non-empty voxels), unpack its normal vector and do the cone tracing like in the previous step, but instead for each pixel on the screen, we need to do it for each voxel. This needs to write into an additional 3D texture by the way, because we are sampling the filtered one in this pass, so mind the additional memory footprint.

Part 5: Cone traced reflections

To trace reflections with cone tracing, use the same technique, but the steps along mip levels should take the surface roughness into account. For rough surfaces, the cone should approach the diffuse cone tracing size, for smooth surfaces, keep the mip level increasing to minimum. Just experiment with it until you get results which you like. Or go physically based, and it will be much cooler and would probably go for a nice paper.

Maybe the voxel grid resolution which is used for the diffuse GI is not fine enough for reflections. You will probably want to use a finer voxelization for them. Maybe using separate voxel data for diffuse and specular reflections is a good idea, with some update frequency optimizations. You could, for example update the diffuse voxels in even frames and specular voxels in odd frames orsomething like that.

You probably want this as a fallback to screen space reflections, if they are available.


Part 6: Consider optimizations

The technique, at the current stage, will only work on very fast GPUs. But there already are some games using tech like this (Rise of Tomb Raider using voxel AO), or parts of it, even on consoles (The Tomorrow Children). This is possible with some aggressive optimization techniques. Sparse Voxel Octrees can reduce memory requirements, voxel cascades can bring up framerates with clever updating frequency changes. And of course do not re-voxelize anything that is not necessary, eg. static objects (however, it can be difficult to separate them, because dynamic lights should also force re-voxelization of static objects if they intersect).

And as always, you can see my source code at my GitHub! Points of interest:

Thank you for reading!