Thoughts on light culling: stream compaction vs flat bit arrays

I had my eyes set on the light culling using flat bit arrays technique for a long time now and finally decided to give it a try. Let me share my notes on why it seemed so interesting and why I replaced the stream compaction technique with this. I will describe both techniques and make comparisons between them. Wall of text incoming with occasional code pieces, in brain dump style.

(For this post, I considered a layout where the comparison between the two techniques for each point follow each other immediately, but decided against it, and instead there are two big blocks: stream compaction technique, then the flat bit array technique following it. The points are in similar order for both, so comparison could be easily made by having two browser windows opened for example. It wouldn’t be so simple to do the opposite thing.)

Stream compaction

Idea is based on Johan Andresson’s 2011 GDC talk: DirectX 11 rendering in Battlefield 3

Culling phase:

  • Each tile contains list of lights. Each entry is one uint that is an index into light array. So each tile contains max_lightcount_per_tile number of uints. For me it was 256 entries per tile, and decided at compile time.
    • Possible to compact even further in a way that the average light count per tile is specified (instead of max) when allocating tile buffer and have one more level of indirection. The other indirection here is introducing a “light grid” buffer/texture that would contain the offset and count into the “light index buffer”, while the “light index buffer” contains the light indices in a very compact stream (next tile list’s first entry immediately follows the last light index in the previous tile). This way one tile can contain an unbounded number of lights (well, still bounded on the global tile buffer allocation size). The construction of such structure is complicated and not worth the trouble in my experience (it uses even more atomics). You can see such technique presented in [this other blog]. I have implemented this technique, but in the end I chose to keep the hard coded max items per tile and have one less level of indirection.
  • Culling shader performs stream compaction, that is, using a light index list in groupshared memory and an atomic counter to append lights to the list. Something like this:
groupshared uint ArrayLength = 0;
groupshared uint Array[256];

void AppendLight(uint lightIndex)
    uint index;
    InterlockedAdd(ArrayLength, 1, index);
    if (index < 256)
        Array[index] = lightIndex;
  • Stream compaction can be scalarized with wave intrinsics so that 64x less amount of atomic operations are performed. The idea here is that we will have a per-wavefront bitmask containing set bits for all lanes that wanted to append. This is retrieved by WaveActiveBallot(IsLightCulledInCurrentThread()). This bitmask first gives us how many lanes wanted to append (countbits() function), and the first lane does an InterlockedAdd() with this value to the groupshared light counter. The rest of the lanes don’t do the atomic, so atomics are significantly reduced. The first lane’s result of the InterlockedAdd() returns the wavefront’s offset to the groupshared light index list and broadcasted to the other lanes via WaveReadLaneFirst() intrinsic function. The rest of the lanes need to just write their light index to the list, starting from the broadcasted offset, and doing a prefix sum on the ballot bitmask (WavePrefixSum()) that will give the real offset to the groupshared light index list.
  • Stream compaction is unordered because the item append order relies on one atomic counter. That is fine for lights which use additive blending where the order of operations doesn’t matter for the final output. But once we have decals or environment probes, those are not blended additively, so the undefined order of the items would result in flickering between tiles. To avoid that, we can do a bitonic sort in the compute shader. Doing a bitonic sort in the compute shader is a heavy weight operation, but we can keep track if there were any decals or probes in the tile and only do the sorting if there are. A bitonic sort that runs in a single thread group looks like this (from Gareth Thomas’s tile based particle rendering talk):
// Parameter is SV_GroupIndex
void BitonicSort( in uint localIdxFlattened )
    uint numArray = ArrayLength;
    uint numArrayPowerOfTwo = 2 << firstbithigh(numArray - 1);

    for( uint nMergeSize = 2; nMergeSize <= numArrayPowerOfTwo; nMergeSize = nMergeSize * 2 )
        for( uint nMergeSubSize = nMergeSize >> 1; nMergeSubSize > 0; nMergeSubSize = nMergeSubSize >> 1 )
            uint tmp_index = localIdxFlattened;
            uint index_low = tmp_index & ( nMergeSubSize - 1 );
            uint index_high = 2 * ( tmp_index - index_low );
            uint index = index_high + index_low;

            uint nSwapElem = nMergeSubSize == nMergeSize >> 1 ? index_high + ( 2 * nMergeSubSize - 1 ) - index_low : index_high + nMergeSubSize + index_low;
            if( nSwapElem < numArray && index < numArray )
                if( Array[ index ] < Array[ nSwapElem ] )
                    uint uTemp = Array[ index ];
                    Array[ index ] = Array[ nSwapElem ];
                    Array[ nSwapElem ] = uTemp;
  • The best culling granularity in my experience turned out to be having 16×16 pixel tiles and 16×16 threads in the culling threadgroup. A large thread group is beneficial, because we will have a large groupshared memory reservation, and the bitonic sort works best with a large thread count. Remember that even though we are only interested in the decal and probe ordering, we have to sort the whole entity array, including lights, because lights are in the same array as decals and probes. A large thread group and large groupshared memory usage means reduced occupancy of the shader, so parallel execution between thread groups is reduced.
  • We are still only in the culling shader phase, and an other difference is how the groupshared memory (also called LDS or local data share) is initialized at the beginning. The depth range min-max and the 2.5D culling mask are initialized, and the counter that is responsible to insert the light indices into the array is zeroed out. The LDS light array doesn’t need to be initialized.
  • At the end of the culling shader, we export the light list by having each thread export one light index. Also, the first thread exports a “tile statistics” uint that says how long is the list, how many decals,probes and lights are inside it.

Rendering phase:

  • We process the light list in the forward+ object shader or the tiled deferred compute shader trivially. We read the “tile statistics” uint first, and determine the decals count, probe count and light count in the tile. The following uints are indices into the global entity array that the camera can see, so just iterate through all of them until we reach the last one. Having the decal and probe count is handy, because for some objects, or particles, etc. we don’t want to apply decals, so we can just skip through them and start with the first light index instead. Also, if we process decals, and the top decal fully covers the ones under it, we can skip processing them and continue with the lights.
  • We have high divergence if we process like this inside a pixel shader (forward+), but we can help a bit by using wave intrinsics to scalarize it. It is not trivial and in the worst case it can lead to redundant processing of items. See more info in Drobot’s presentation, the section about “hierarchical container scalarization”.
  • The good thing about processing a compact light index list is that it is compact, so every light index that we process will be contributing to the tile.
Light culling in Sponza

Flat bit arrays

Idea from Michal Drobot’s 2017 Siggraph presentation: Improved culling for tiled and clustered rendering

Culling phase:

  • Each tile contains bit buckets. One bit bucket is a 32-bit uint, that is responsible to index up to 32 lights. So each tile contains max_lightcount_in_view/32 number of uints. That means that for the same amount of max lights per tile as stream compaction method, we use 32x less memory.
  • Culling shader simply computes which bucket can index the currently processed light, then computes which bit maps to the light index, then performs atomic OR operation to the groupshared light bucket list. The groupshared light bucket list is max_lightcount_in_view/32 number of units. The following code piece demonstrates this:
groupshared uint tile[256 / 32];

void AppendLight(uint lightIndex)
    const uint bucket_index = entityIndex / 32;
    const uint bucket_place = entityIndex % 32;
    InterlockedOr(tile[bucket_index], 1 << bucket_place);
  • Bitmask insertion could probably be scalarized in a way that no atomic operations would be necessary. I imagine we would just do a WaveActiveBitOr() instead of InterlockedOr() on the bucket, and if it is the first lane (WaveIsFirstLane()) it would just write the bucket without using atomics. (unfortunately I can’t get sm6.0 working on my current laptop, so I haven’t tried this yet)
  • Bitmask insertion is ordered, we will always get a deterministic order of items, which is the same order as the CPU assembled the global light array. If we have decals and environment probes, we only ensure that they are culled and added to the global light list (or should we say entity list at this point) before the lights (on the CPU). The reason is, because we want to apply decals before applying lights, so that they are lighted just like the surface underneath. Environment probes can be applied before or after the lights, because they will be added to the specular result. However, ordering between locally placed probes does matter, because they are blended amongst themselves with alpha blending. So in the end, we don’t have to sort in the culling shader with the flat bit array technique. That is a big relief and greatly simplifies the shader.
  • The best culling granularity for flat bit arrays turned out to be having 16×16 pixel tiles, however we can also reduce the thread group size to be 8×8 for example to have better occupancy, because there is no sort and little LDS use compared to the compaction technique. But then some shader complexity is added, because each thread will be responsible to work on a small subgroup of the tile. Consider that with 16×16 pixels and 16×16 threads, each thread can work on a single pixel when reading the depth buffer and creating the depth bounds and maybe a 2.5D culling bitmask (Takahiro Harada’s 2012 talk in Siggraph Asia, and also my blog about it). But when reducing the thread count to 8×8, each thread must load 2×2 pixel block instead. However, this also reduces the atomic contention for computing the min-max depth bounds or the 2.5D bitmask (4x times less atomic operations for these phases), because the operations inside the subgroup are performed as non atomic. The whole tile result then will be the result of atomic operations between the subgroups, so in the end that just means less atomics were performed. We don’t have to modify the culling part of the shader however, because that doesn’t work on a thread/pixel granularity, but thread/entity granularity instead. The last thing here worth to mention is that 8×8 thread group size seems like a perfect fit for AMD GCN architecture, which has 64 threads in a wavefront, so the barriers in the compute shader will be even omitted completely. On Nvidia, the wavefront (warp) size is 32 threads, but modifying the thread size to be for example 8×4 threads would also require redesigning how subgroups are processed, which I didn’t tackle for now. It would be an interesting experiment to see how different subgroup distributions would perform. In my implementation, a thread is operating on a 2×2 pixel block, but that could be rewritten so that a thread operates on a row of 4 pixels or something else. Maybe a 8×8 pixel block should be processed first, then move on to the next 8×8 pixel block, and do this 2×2 times. Need to try these and compare. An other idea here: the 2×2 pixel block is also nice for gathering from the depth buffer, we could use one GatherRed() instruction to load 4 samples for the whole subgroup, however that will require a custom code path that won’t work if we choose an other kind of distribution.
  • We don’t have a light array counter in the flat bit aray technique, but instead all the light buckets need to be set to zero, indicating that no lights are contained yet. This is done by each thread just zeroing out one bucket (one uint) until all of them are zeroed out. Apart from that, the common things such as tile depth range are also initialized like always.
  • At the end of the culling shader, each thread exports one bucket, until all buckets were written to global memory. There is no need to have a “tile statistics” entry, because the bitfield directly corresponds to all entities inside the main camera. So the tile statistics can be replaced with just a constant buffer value that contains how many decals/envprobes/lights are inside the main camera.

Rendering phase:

  • Processing the flat bit arrays relies on the firstbitlow() instrinsic in HLSL that gives us the first set bit’s index in the uint bucket, starting from the lowest bit. If the bucket is zero, we continue with the next bucket (thus skipping over 32 non-visible lights), otherwise call firstbitlow(bucket), compute the light index from the bucket index and the bit index, then process the light. Then remove the bit and continue to get the next first low bit until the bucket is not zero. If we want to skip decals, we can do that here too: we can trivially start from the specific bucket’s specific bit that we want to start iterating on (for example if we want to leave out the decals but there are 42 decals inside the camera (first light in entity 43), we can start processing from the second bucket, and mask off the first 10 bits (42-32=10). Here is a pseudo-code how we can extract light index bits from the buckets:
// retrieve tile index (for 16x16 block size):
const uint2 tileIndex = uint2(floor(pixel.xy / 16));

// compute flattened tile index because tile buffer is linear array:
//  g_tileCount = ScreenResoulution.xy / 16
//  g_bucketCountPerTile = max_light_count / 32
const uint flatTileIndex = (tileIndex.x + tileIndex.y * g_tileCount.x) * g_bucketCountPerTile;

for(uint bucket = 0; bucket < g_bucketCountPerTile; ++bucket)
    uint bucket_bits = Tiles[flatTileIndex + bucket];

    while(bucket_bits != 0)
        const uint bucket_bit_index = firstbitlow(bucket_bits);
        const uint entity_index = bucket * 32 + bucket_bit_index;
        bucket_bits ^= 1 << bucket_bit_index;
        Light light = LightArray[entity_index];
        // process light here...
  • Processing these in pixel shaders for forward+ will still yield high VGPR usage because of divergent access to the buckets. However, the scalarization of the buckets themselves is really elegantly done with a single line: bucket_bits = WaveReadLaneFirst(WaveActiveBitOr(bucket_bits)); After the buckets are scalarized, loading from the global light array (entity array) is also scalar.
  • A thing to note about flat bit arrays, is that there can be “holes” in the buckets, so let’s say lightIndex 2 and lightIndex 987 are the only ones contained inside the tile, then the buckets between bucket 0 and bucket 30 are zero, but they can’t be detected in a simple way, so the buckets will be read but not processed. With an other level of indirection, this effect could be reduced, for example having some coarse tiles that are culled with stream compaction, then the finer tile bit buckets would contain indices to those coarse tile light lists. I don’t expect this to be a huge problem, so I haven’t gone there. And it would require the sorting again too. Also, as pointed out by Matt Pettineo on Twitter, the indirection doesn’t have to use stream compaction, but could be an other bitmask, where each bit represents one bucket of 32 entities.
Light culling in Bistro


I’ll compare some results on the bistro scene. The configuration is for GPU times and measured with timestamp queries:

  • Common:
    • Nvidia GTX 1050 laptop GPU
    • Bistro scene (inside + outside)
    • 1000 lights, 0 decals, 0 probes
    • Forward+ rendering
    • 4096 max lights in frustum
    • non-scalarized versions (DX11)
  • Stream compaction method:
    • 255 max lights per tile (255 lights + 1 tile statistics)
    • light culling (4k): 5.6 ms
    • light culling (1080p): 1.4 ms
    • opaque geometry (4k): 43.6 ms
    • opaque geometry (1080p): 13 ms
  • Flat bit array method:
    • 4096 max lights per tile (4096 / 32 = 128 buckets per tile)
    • light culling (4k): 2.5 ms
    • light culling (1080p): 0.64 ms
    • opaque geometry (4k): 41.7 ms
    • opaque geometry (1080p): 12.6 ms

The flat bit array method is the winner here, no matter how I look at it. The biggest boost is in the culling shader that is greatly simplified, but the opaque geometry pass is consistently better as well. (4k means 3840×2160 resolution here, 1080p is full HD 1920×1080)

An other small test I made was for the decals, and I just used a simple cornell box, then placed 100 decals into one spot, overlapping with each other, 4k resolution and covering a small part of the screen:

  • Test: 100 small overlapping decals (4k)
  • Stream compaction method:
    • decal culling + sorting: 3.63 ms
    • opaque geometry: 3.67 ms
  • Flat bit arrays method:
    • decal culling: 1.33 ms
    • opaque geometry: 2.96 ms
that was the 100 small decals, red tile = 100 decals, black = 0 decals

It’s a win again, and the closer I zoomed the camera into the decals, the bigger the difference was between the two techniques. This is the result for full screen decal coverage, 100 overlapping decals, still 4k resolution:

  • Test: 100 full screen overlapping decals (4k)
  • Stream compaction method:
    • decal culling + sorting: 17.82 ms
    • opaque geometry: 84.57 ms
  • Flat bit arrays method:
    • decal culling: 2.12 ms
    • opaque geometry: 71.48 ms

To be honest, I did not expect that kind of speedup, so I am very pleased and can say time was well spent implementing this.

Other notes:

  • One interesting thing I noticed is that I first had one huge loop, and the decal + probe + light evaluation were all included in that single loop. Performance was horrible, especially on the Intel 620 GPU. It was something like +10 ms for opaque geometry pass on a scene with 4 big shadow casting lights, even if no decals or probes were processed at all. After that, I broken up the loop into 3 loops, one for decals, one for probes and one for lights. The performance jump was very noticable. This applies for both techniques, try to break up entities that are processed differently to achieve the best performance. This is due to the compiler can manage to do a better job reserving and reusing registers, so in the end you can have a reduced amount of registers used, even if there are more loops. This is because register lifetime can be constrained to loop scope (if they are not unrolled).
  • An other thing related to the Intel 620 GPU was the annoying fact that I noticed some small flickering of the tiles with the stream compaction technique that only happened rarely, even with a completely static scene. I could never figure it out, but it doesn’t happen with the flat bit array technique. Now I suspect that it might be related to the thread group size, because I also had bugs with tiled deferred compute shader if it was running with 16×16 thread group size and using a RW texture.
  • After getting to know the flat bit array technique, I also implemented a similar one for old-school forward rendering path. For that, the CPU can cull lights per object and provide a per draw call bitmask in a constant buffer. The bitmask is a uint4, where 1 uint represents 32 decals, 1 uint represents 32 probes, and 2 uints allow 64 lights to be evaluated for forward rendered objects. I never wanted to do this previously, because then a big constant buffer needs to be updated with light indices per draw call. An other candidate how to send this bitmask to the shader would be the instance buffer, but having it in the constant buffer means scalarized loads (and also don’t have to propagate the mask from vertex to pixel shader – less parameters to send is good to avoid parameter cache bottleneck). Unfortunately, culling efficiency is reduced if many instances are visible, because a constant buffer will be per draw call granularity instead of per instance.
  • It will be also interesting to try not using a culling compute shader, but rely on conservative rasterization (or MSAA), and draw the light proxies instead, then those would do atomic writes into the tile buffer in the pixel shader. Michal Drobot suggested that this can be faster than running the compute shader and doing software culling, because rasterization will make use of the zbuffer, and zbuffer tile acceleration (coarse conservative z-test in the hardware that supports it, not user controlled). A minor issue with this might be that not everything supports conservative rasterization or forced sample count setting for MSAA today (conservative rasterization can be approximated with MSAA).

Implementation samples:

As always, you can find my implementation for these in Wicked Engine. I have made two snapshot branches for the two separate techniques, and while the main branch will updated, these will remain like this to represent the state of this blog:

Some other light culling resources:

Simple job system using standard C++

After experimenting with the entity-component system this fall (sorry there is no blog for that yet), I wanted to see how difficult it would be to put my other unused CPU cores to good use. I never really got into CPU multithreading seriously, so this is something new for me. The idea behind the entity-component system is both to make more efficient use of a single CPU by having similar data laid out linearly in memory (thus using cache prefetching when iterating), and also making it easier to write parallelized code (because data dependecies are more apparent). In this post I want to talk about the job system I came up with. It will not only make sense in the entity-component system, but generally it should perform well for any large data processing task. I wanted to remain in standard C++ (~11) realm, which I found that it is entirely possible. However, it can be extended with platform specific extensions if wanted. Let’s break this blog up into multiple parts:

  1. Interface – how do we want to use it (I hope you’re familiar with lambdas)
  2. Implementation – let’s see the code
  3. Performance – I guees this would be pointless without it
  4. Windows platform specific extension – optional, if you can still keep going
  5. Closure – all must come to an end

I think the best way to get my point across would be to just post the interface code here, because it is very short and make a few comments:

#include <functional&gt;

// A Dispatched job will receive this as function argument:
struct JobDispatchArgs
	uint32_t jobIndex;
	uint32_t groupIndex;

namespace JobSystem
	// Create the internal resources such as worker threads, etc. Call it once when initializing the application.
	void Initialize();

	// Add a job to execute asynchronously. Any idle thread will execute this job.
	void Execute(const std::function<void()&gt;&amp; job);

	// Divide a job onto multiple jobs and execute in parallel.
	//	jobCount	: how many jobs to generate for this task.
	//	groupSize	: how many jobs to execute per thread. Jobs inside a group execute serially. It might be worth to increase for small jobs
	//	func		: receives a JobDispatchArgs as parameter
	void Dispatch(uint32_t jobCount, uint32_t groupSize, const std::function<void(JobDispatchArgs)&gt;&amp; job);

	// Check if any threads are working currently or not
	bool IsBusy();

	// Wait until all threads become idle
	void Wait();

How this would work? First, we would call JobSystem::Initialize() function somewhere in our application once (before we want to use the job system for the first time around). Probably the best place is when we initialize the application and before doing any resource loading, because the job system can also handle resource loading efficiently.
Then the simplest way we can make a job is by calling JobSystem::Execute(myTask); where myTask is just a simple function without parameters. I use it for example when initializing engine systems, like this:

JobSystem::Execute([] { /*do stuff 1*/ InitGraphics(); });
JobSystem::Execute([] { /*do stuff 2 which is not realted to stuff 1*/ InitPhysics(); });
JobSystem::Execute([] { /*do stuff 3 which is not related to the previous stuff*/ InitInput(); });
JobSystem::Execute([] { /*do stuff 4 which is not related to the previous stuff*/ InitNetwork(); });
JobSystem::Execute([] { /*you probably already get the idea*/ InitSound(); });

The above code creates 5 jobs, which are already probably utilizing 5 worker threads (if the Initialize() created that many workers, but later on that).
Then we can call JobSystem::Wait(); to simply wait for all the jobs to finish, before doing anything that depends on their completion. For example this code would utilize 3 threads:

JobSystem::Execute([] { /*do stuff 1*/ });
JobSystem::Execute([] { /*do stuff 2*/ });

Where the main thread is responsible to notify two worker threads that there are some jobs, and then wait. It also means that overall we are doing extra bookkeeping work by the main thread instead of just doing the jobs one by one. But this should be fine when looking at the big picture, because some work is a good fit to be executed in parallel (if there is no dependency between them). We should only do jobs in parallel if they are a good fit, a lot of times it is still just better to do it on the main thread. Oh and by dependency, I mean if there is any variable that is being written by a job and read by an other, then there is a dependency.

The IsBusy() function is a way for the main thread to just see if any worker threads are busy. It can be used if we don’t actually want to wait for work to finish, and the main thread has other meaningful work to do. For instance, drawing a loading screen is a good example.

The last one is the Dispatch function. This is the most interesting and most useful one in my opinion. I was inspired by how compute shaders work on the GPU, and wanted really bad to use something like that on the CPU. You can think of it like the Execute function, but it actually creates multiple jobs instead of one (jobCount argument). Let’s skip the groupSize argument now and return to it shortly. The third argument is the job itself that will receive a JobDispatchArgs argument that will identify which instance of the job is currently running, and which group (hang in there). To demonsrate how it will be used, look at this loop iterating some large dataset:

const uint32_t dataSize = 1000000;
Data dataSet[dataSize];
for(int i = 0; i < dataSize; ++i)

And let’s rewrite the same thing but this time using Dispatch, to offload it to multiple threads:

const uint32_t groupSize = 1;
JobSystem::Dispatch(dataSize, groupSize, [&dataSet] (JobDispatchArgs args) {

The above code does the same thing as the loop, but now several worker threads each operate on a single element of dataSet. Here is when the groupSize argument comes into play. We actually don’t want each thread to only work on a single element. Not only because of caching, but then 1000000 jobs need to be generated that would put a significant burden on the main thread which is creating the jobs. Instead, we should specify let’s say const uint32_t groupSize = 1000; or something like that. That means, one thread will operate on a batch of 1000 elements of the dataSet. In every case that we are putting a Dispatch in our application, the groupSize must be considered. Just like we have to when we are writing compute shaders – which are really similar to this and have the same purpose: performing work over a large dataset, where data is mostly independent of other data in the set. If our dataSet has elements with dependencies, then it might not be a good candidate – unless we can think of a clever way to utilize the groupSize argument. Because the thing with groupSize, is that with groupSize = 1000, it means that 1000 sub-jobs will run serially on the same thread – then there can be dependencies inside the group. But here it starts to get out of topic, so I recommend to get experienced in compute shaders and using groupshared memory, because the same principles could be applied here. I won’t go into handling groupshared memory in this post.

Now that we see how we want to use the job system, let’s implement the internal workings of it, using standard C++ constructs.

I will go over this mostly in the order I described the interface. But before getting into the first function which is Initialize(), define the internal state and include files, stuff like that:

#include "JobSystem.h"    // include our interface

#include <algorithm&gt;    // std::max
#include <atomic&gt;    // to use std::atomic<uint64_t&gt;
#include <thread&gt;    // to use std::thread
#include <condition_variable&gt;    // to use std::condition_variable

namespace JobSystem
	uint32_t numThreads = 0;    // number of worker threads, it will be initialized in the Initialize() function
	ThreadSafeRingBuffer<std::function<void()&gt;, 256&gt; jobPool;    // a thread safe queue to put pending jobs onto the end (with a capacity of 256 jobs). A worker thread can grab a job from the beginning
	std::condition_variable wakeCondition;    // used in conjunction with the wakeMutex below. Worker threads just sleep when there is no job, and the main thread can wake them up
	std::mutex wakeMutex;    // used in conjunction with the wakeCondition above
	uint64_t currentLabel = 0;    // tracks the state of execution of the main thread
	std::atomic<uint64_t&gt; finishedLabel;    // track the state of execution across background worker threads

	// ...And here will go all of the functions that we will implement

Mostly everything here is using standard C++, except the jobPool. It can also be done with a std::deque, but then also make it thread safe by using mutexes or some kind of synchronization primitives. Or if you are interested, here is my very simple thread safe queue (or ring buffer) implementation:

// Fixed size very simple thread safe ring buffer
template <typename T, size_t capacity&gt;
class ThreadSafeRingBuffer
	// Push an item to the end if there is free space
	//	Returns true if succesful
	//	Returns false if there is not enough space
	inline bool push_back(const T&amp; item)
		bool result = false;
		size_t next = (head + 1) % capacity;
		if (next != tail)
			data[head] = item;
			head = next;
			result = true;
		return result;

	// Get an item if there are any
	//	Returns true if succesful
	//	Returns false if there are no items
	inline bool pop_front(T&amp; item)
		bool result = false;
		if (tail != head)
			item = data[tail];
			tail = (tail + 1) % capacity;
			result = true;
		return result;

	T data[capacity];
	size_t head = 0;
	size_t tail = 0;
	std::mutex lock; // this just works better than a spinlock here (on windows)

It is templated, so you can also use it for any type of data. The nice thing about it is that is very simple, only uses a fixed size memory allocation. It can’t reallocate memory, doesn’t do anything clever, so performance might be more reliable than std::deque. Also, when an element doesn’t fit because there is no more space, it returns false on push_back, or if it is empty and you want to remove an element, it returns false on pop_back. if you got false, you can just try again and maybe that time, it will be successful, because an other thread grabbed/put a job onto it.

Now we have everything we need, except the actual functions. Here is what Initialize() looks like:

void Initialize()
	// Initialize the worker execution state to 0:;

	// Retrieve the number of hardware threads in this system:
	auto numCores = std::thread::hardware_concurrency();

	// Calculate the actual number of worker threads we want:
	numThreads = std::max(1u, numCores);

	// Create all our worker threads while immediately starting them:
	for (uint32_t threadID = 0; threadID < numThreads; ++threadID)
		std::thread worker([] {

			std::function<void()&gt; job; // the current job for the thread, it's empty at start.

			// This is the infinite loop that a worker thread will do 
			while (true)
				if (jobPool.pop_front(job)) // try to grab a job from the jobPool queue
					// It found a job, execute it:
					job(); // execute job
					finishedLabel.fetch_add(1); // update worker label state
					// no job, put thread to sleep
					std::unique_lock<std::mutex&gt; lock(wakeMutex);

		// *****Here we could do platform specific thread setup...
		worker.detach(); // forget about this thread, let it do it's job in the infinite loop that we created above

So, in case the comments didn’t do it justice, let me recap: First, we initialize the worker-side label state, but don’t worry about it yet, it will make sense later (hopefully). Then C++ has the neat hardware_concurrency function that can give us how many cores (incuding hyperthreaded virtual cores) our CPU has, so we create that many worker threads in the for loop. Each thread is created with an infinite loop, that it will be doing until our application exits. But most of the time, each thread will just sleep (until the main threads wakes it up), to not spin the CPU wastefully.

Let’s see the Execute function next:

// This little helper function will not let the system to be deadlocked while the main thread is waiting for something
inline void poll()
	wakeCondition.notify_one(); // wake one worker thread
	std::this_thread::yield(); // allow this thread to be rescheduled

void Execute(const std::function<void()&gt;&amp; job)
	// The main thread label state is updated:
	currentLabel += 1;

	// Try to push a new job until it is pushed successfully:
	while (!jobPool.push_back(job)) { poll(); }

	wakeCondition.notify_one(); // wake one thread

Sorry if you got distracted by the poll function above, I did a function for it because it will be used in more places. I will explain, but first, ket’s digest the Execute function: The main thread’s label is incremented. This means, that one job is being submitted, so the JobSystem only becomes idle when the worker label reached the same value. In the previous Initialize function, you can see that in the infinite loop, after a job is executed, the worker label state is also incremented by one. But there it is an atomic operation, because all the workers will be updating that single label.
Then we just push the new job onto the end of the jobPool. As I said, my ringbuffer could be full, so we will try to push the job until it actually succeeds. If it doesn’t succeed, then we invoke the poll function, which will wake up one worker, and yield the main thread. Doing this, we can be sure that the jobPool will eventually be drained by the workers, and the main thread can then schedule the current job.
But maybe, the job was just pushed fine the first time around, and the poll() never executed. In that case, we just wake up any one worker thread that will start working on the job immediately.

I hope I am not putting anyone to sleep yet. Let’s follow with the IsBusy and Wait functions:

bool IsBusy()
	// Whenever the main thread label is not reached by the workers, it indicates that some worker is still alive
	return finishedLabel.load() < currentLabel;

void Wait()
	while (IsBusy()) { poll(); }

You can see how we are checking the state of the main thread and the state of the workers in the IsBusy function. This is also used by the Wait function if we just want to put the main thread to idle and drain all the pending jobs.

Now, with implementing the Dispatch function, you will get full insight into the Job System’s inner workings. This is the most complicated of them all, but if you look at it in a way, it is not very different from the Execute function, just does the same thing multiple times with some little twists:

void Dispatch(uint32_t jobCount, uint32_t groupSize, const std::function<void(JobDispatchArgs)&gt;&amp; job)
	if (jobCount == 0 || groupSize == 0)

	// Calculate the amount of job groups to dispatch (overestimate, or "ceil"):
	const uint32_t groupCount = (jobCount + groupSize - 1) / groupSize;

	// The main thread label state is updated:
	currentLabel += groupCount;

	for (uint32_t groupIndex = 0; groupIndex < groupCount; ++groupIndex)
		// For each group, generate one real job:
		auto&amp; jobGroup = [jobCount, groupSize, job, groupIndex]() {

			// Calculate the current group's offset into the jobs:
			const uint32_t groupJobOffset = groupIndex * groupSize;
			const uint32_t groupJobEnd = std::min(groupJobOffset + groupSize, jobCount);

			JobDispatchArgs args;
			args.groupIndex = groupIndex;

			// Inside the group, loop through all job indices and execute job for each index:
			for (uint32_t i = groupJobOffset; i < groupJobEnd; ++i)
				args.jobIndex = i;

		// Try to push a new job until it is pushed successfully:
		while (!jobPool.push_back(jobGroup)) { poll(); }

		wakeCondition.notify_one(); // wake one thread

Take a deep breath, and let’s tackle this slowly. First, the jobCount (which you can think of as the “loop size”), can be a dynamic parameter, so if it is 0, just return immediately, because we don’t want to shedule 0 jobs.
Then the groupCount is calculated. Don’t confuse this with groupSize, that you specified as an argument to Dispatch. The groupCount is calculated from the jobCount and groupSize and tells us how many worker jobs (or groups) will be put onto the jobPool. There is an integer divide, but we must overestimate it, because let’s say that we want to have jobCount=3, groupSize=2. From that we want to create 2 actual worker jobs (groups), but the integer divide of jobCount/groupSize would give us only 1 group. We must instead schedule 2 groups, each with 2 jobs. But wait, that would be 4 jobs then, even though we specified jobCount=3. We will just discard the out of bounds jobs later.
Then the main thread label is updated with adding in the groupCount. That is how many workers must finish from this point for the JobSystem to become idle.
The following for loop goes over every group, creates a worker job for it with a lambda, passing in the required parameters to populate the JobDispatchArgs which our dispatched job needs to receive. That lambda is the one that will be executed by a worker thread. It contains logic to populate a JobDispatchArgs and run through all inner sub-jobs while making sure that no extra out of bounds iterations are performed. I think this might be the most complicated part if you are going to try to wrap your mind around someone else’s code. Hang in there!
After that group-worker-lambda thing, we do just like the Execute function, try to schedule the group and get the worker threads working immediately on it.

I’m not sure if I made sense of all this, but at this point the JobSystem is ready to be used actually! Shall we take a look of what kind of performance improvements we can expect right off the bat, then continue reading:

The first thing I tried, was that I started to use the JobSystem::Execute() in my Wicked Engine when initializing the main engine systems, so each system initialization would be performed in its own execute. The system initialization tasks were large independent tasks, operating on some arbitrary/different data, a good fit for the Execute function. We can make up a test case if we just Execute Spin functions that will perform infinite loop for some time and not let the CPU rest until the time passed. Let me give you some helpers for spin and a simple timer that you can use too in a small console application:

#include "JobSystem.h"

#include <iostream&gt;
#include <chrono&gt;
#include <string&gt;

using namespace std;

void Spin(float milliseconds)
	milliseconds /= 1000.0f;
	chrono::high_resolution_clock::time_point t1 = chrono::high_resolution_clock::now();
	double ms = 0;
	while (ms < milliseconds)
		chrono::high_resolution_clock::time_point t2 = chrono::high_resolution_clock::now();
		chrono::duration<double&gt; time_span = chrono::duration_cast<chrono::duration<double&gt;&gt;(t2 - t1);
		ms = time_span.count();

struct timer
	string name;
	chrono::high_resolution_clock::time_point start;

	timer(const string&amp; name) : name(name), start(chrono::high_resolution_clock::now()) {}
		auto end = chrono::high_resolution_clock::now();
		cout << name << ": " << chrono::duration_cast<chrono::milliseconds&gt;(end - start).count() << " milliseconds" << endl;

int main()

	// Serial test
		auto t = timer("Serial test: ");

	// Execute test
		auto t = timer("Execute() test: ");
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });

    return 0;

The Spin function just emulates a working CPU for a set amount of milliseconds, the timer struct is a scoped timer which will record how much time has passed until the destructor was called – so I just made two scopes for the two tests in the main function. The results are for me:

Serial test: : 400 milliseconds
Execute() test: : 101 milliseconds

The machine I currently ran this test on is a dual core, but has 4 hardware threads, so the results make sense. If I increase the jobs to 6, the times are as following:

Serial test: : 600 milliseconds
Execute() test: : 200 milliseconds

The reason for this is that at most 4 jobs were executing in parallel, then the last two remaining jobs took over after that, also in parallel to each other. Looks like the results look like we intended!

Now, let’s test the Dispatch performance. I came up with this quick test case:

struct Data
	float m[16];
	void Compute(uint32_t value)
		for (int i = 0; i < 16; ++i)
			m[i] += float(value + i);
uint32_t dataCount = 1000000;

// Loop test:
	Data* dataSet = new Data[dataCount];
		auto t = timer("loop test: ");

		for (uint32_t i = 0; i < dataCount; ++i)
	delete[] dataSet;

// Dispatch test:
	Data* dataSet = new Data[dataCount];
		auto t = timer("Dispatch() test: ");

		const uint32_t groupSize = 1000;
		JobSystem::Dispatch(dataCount, groupSize, [&amp;dataSet](JobDispatchArgs args) {
	delete[] dataSet;

Notice that I create the dataSet two times, once for each test. This is important to avoid reusing cached values for the second run of the test. And the results (optimizations disabled):

loop test: : 140 milliseconds
Dispatch() test: : 59 milliseconds

Neat. But when I enabled optimizations, I got other results:

loop test: : 28 milliseconds
Dispatch() test: : 31 milliseconds

That’s not cool, now the simple loop is faster! Then I tried increasing the groupSize to 10000:

loop test: : 28 milliseconds
Dispatch() test: : 15 milliseconds

Ok, now dispatch is nearly two times faster, still not 4 times like we would expect. After trying some other different groupSizes, I could not improve this much better. This behaviour can depend on several factors, such as cache, thread core switching, the operating system overtaking the threads, contention for the jobPool, or just the nature of the workload that is being executed, all of which needs much more investigation, possibly with a dedicated profiling tool. These days there are at least two big ones, Intel Vtune and AMD CodeXL, both of which I am unfamiliar with at the moment.

Windows platform specific extension
If we are on windows, we can try some Winapi functions to extend threading functionality. I will just name a few useful ones. You can put these in the Initialize function after the thread has been created (you can search for * to find that place). Also, I am doing an #include <Windows.h> and #include <sstream> and #include <assert.h> for this next part.

First, we must retrieve the native thread handle from the std::thread, like this:

HANDLE handle = (HANDLE)worker.native_handle();

Once we have that, we are free to pass it to the Winapi functions. For example, we can set a thread affinity mask, to put it on a dedicated hardware thread instead of having it for all possible cores and let the operating system switch them around on demand. This piece of code puts the thread we just created onto one core specified by threadID:

DWORD_PTR affinityMask = 1ull << threadID;
DWORD_PTR affinity_result = SetThreadAffinityMask(handle, affinityMask);
assert(affinity_result &gt; 0);

So the affinityMask is a bitmask where each bit specifies a hardware core that is less than what hardware_concurrency gives us and start from thread 0. By default, the affinity mask should be 2^hardware_concurrecy() – 1 which is also returned by the SetThreadAffinityMask into the affinity_result variable.

We can increase thread priority like so:

BOOL priority_result = SetThreadPriority(handle, THREAD_PRIORITY_HIGHEST);
assert(priority_result != 0);

By default, each thread we create is THREAD_PRIORITY_DEFAULT. Modifying this could help threads not be overtaken by the operating system by lesser priority threads. I’ve found no way to increase performance with this yet, only decrease it.

I found this next one very useful. We can give a name to our thread so that it will show up in the Visual Studio debugger:

std::wstringstream wss;
wss << "JobSystem_" << threadID;
HRESULT hr = SetThreadDescription(handle, wss.str().c_str());

My general suggestion is not to just mess with platform specific extensions, unless you are sure that you are getting a benefit. In some cases, such as modifying priority can lead to great performance reduction too.

If you stayed till the end, thank you. This simple job system is ready to be used and can offer a generous speedup, but there is definitely room for improvements, especially for the Dispatch function. I am planning to give an in-depth look to Intel Vtune (they have a trial version now) and AMD CodeXL soon, and possibly follow up on this with some improvements. As always, you can take a look at the current implementation in my Wicked Engine.
The code in the engine is likely to change over time, but I also made a sample for this blog post alone which will stay like this. It contains all the code for the JobSystem and the tests I showed.

Thank you for reading! If you find a mistake, or have suggestions, I would like to hear it!

Scalable GPU Fluid Simulation

Let’s take a look at how to efficiently implement a particle based fluid simulation for real time rendering. We will be running a Smooth Particle Hydrodynamics (SPH) simulation on the GPU. This post is intended for experienced developers and provide the general steps of implementation. It is not a step-by step tutorial, but rather introducing algorithms, data structures, ideas and optimization tips. There are multiple parts I will write about: computing SPH, N-body simulation, dynamic hashed grid acceleration structure. You will find example code pieces here written as HLSL compute shaders.

To be able to implement a simulator like this, you should already have a basic particle simulation/rendering up and running on the GPU. Like this. Then the fluid simulation step can just be inserted between particle emission and particle integration phase. Everything should just work from this point. At the end of the article, you should be able to do this on a low end GPU in real time (30 000 particles, 80 FPS, GTX 1050):


Smooth Particle Hydrodynamics

As a starting point, let’s get familiar with SPH. You can probably already find several papers on the subject. I chose this short paper. I would like to skip the in-depth theory behind it here and focus on implementation instead to keep it as short as possible. The basic idea is that we are approximating fluid with particles, with a finite range of affection. Each particle can influence every particle inside its range (smoothing radius). To simulate affection forces for each particle, we will compute the following for each particle: density -> pressure -> pressure-based forces -> viscosity-based forces -> constant gravitational force. After we computed these forces, we can use the combined force value and perform integration of velocity and update particle positions. In the paper we can find the following formulas that we must implement as shader code (I highlited and annotated key areas):

  • Density evaluation with Poly6 kernel smoothing function + pressure computationdensity_pressure
    • Let’s translate that into (pseudo-)code:
      particleA.density = 0; // we will compute this (float)
      // Precompute part of Poly6
      //  this should be constant buffer value
      //  h = smoothing radius parameter (float) (default = 1.0f)
      //  h9 = pow(h, 9)
      //  PI = 3.1415...
      const float Poly6_constant = (315.0f / (64.0f * PI * h9)); 
      for(all particleB that is either near particleA or is particleA)
        const float3 diff = particleA.position - particleB.position;
        const float r2 = dot(diff, diff);
        if(r2 < h2) // h2 = h*h
          const float W = Poly6_constant * pow(h2 - r2, 3);
          particleA.density += particleB.mass * W;
      // avoid negative pressure by clamping density to reference value:
      particleA.density = max(p0, particleA.density);
      // Compute pressure:
      //  K = pressure constant parameter (float) (default = 250.0f)
      //  p0 = reference density param (float) (default = 1.0f)
      particleA.pressure = K * (particleA.density - p0);
    • particleA correspons to the particle we are evaluating, particleB is the particle which is affecting the current particle. In the formulas, particleA is the “i” iterator, particleB is the “j” iterator. I found that names like this are just better readable in code.
    • The loop iteration is placeholder now, that will be addressed later how we should implement it.
    • Note that we not only evaluate nearby particles, but the particle itself, too. This is essentially just omitting that check inside the loop which is checking if the particle is itself or not. Without this detail, the simulation will be jerkier because essentially we avoid contribution from the particle’s own mass.
    • Note that we can avoid storing the pressure, because it can be computed from the particle itself. We can compute it before computing forces, instead of loading it, which will increase computation time but reduce memory access stalls.
  • Pressure-based force computation with Spiky kernel smoothing functionpressure_force
    • Looks like this in code:
      particleA.force1 = 0; // we will compute this (float3)
      // Precomputed part of Spiky kernel:
      //  h6 = pow(h, 6);
      //  again, it should be coming from constant buffer
      const float Spiky_constant = (-45 / (PI * h6));
      for(all particleB that is near particleA but is not particleA)
        const float3 diff = particleA.position - particleB.position;
        const float r2 = dot(diff, diff);
        const float r = sqrt(r2);
        if(r > 0 && r < h) // **avoid division by zero!
          const float3 rNorm = diff / r; // same as normalize(diff)
          const float W = Spiky_constant * pow(h - r, 2);
          particleA.force1 += (particleB.mass / particleA.mass) * 
            ((particleA.pressure + particleB.pressure) / 
             (2 * particleA.density * particleB.density)) * W * rNorm;
      particleA.force1 *= -1;
    • Very important that unlike in the previous step, we must avoid computing the current particle against itself! The simulation immediately blows up if you don’t check!
  • Viscosity-based force computation with Laplacian smoothing functionviscosity_force
    • In code:
      particleA.force2 = 0; // we will compute this (float3)
      // Most of it is the same as the previous snippet, 
      //  only the innermost calculations will differ
      for(all particleB that is near particleA but is not particleA)
       const float3 diff = particleA.position - particleB.position;
       const float r2 = dot(diff, diff);
       const float r = sqrt(r2);
       if(r > 0 && r < h) // **avoid division by zero!
         const float3 rNorm = diff / r;
         const float r3 = r2 * r;
         const float W = -(r3 / (2 * h3)) + (r2 / h2) + 
           (h / (2 * r)) - 1;
         particleA.force2 += (particleB.mass / particleA.mass) * 
           (1.0f / particleB.density) * 
           (particleB.velocity - particleA.velocity) * 
           W * rNorm;
      // e = viscosity constant (float) (default = 0.018f)
      particleA.force2 *= e;
    • All of this should be combined with the pressure-based force computations, into the same loop!
    • The viscosity constant will control how fluid our simulation is.
  • Forces integration
    • Consists of following steps:
      • Add force to velocity
      • Step position along velocity vector
      • Reset force
    • Using fixed timestep so that simulation will not blow up
    • This step must be separate from the SPH forces evaluation, because viscosity based force computation takes velocity into account, thus it cannot be modified while the forces are not computed yet!
    • They say that a code piece can say a thousand words:
      const float deltaTime = 0.016f; // fixed step
      const float3 G = float3(0, -9.8f, 0);
      particleA.velocity += deltaTime * ((particleA.force1 + particleA.force2) / particleA.density + G);
      particleA.position += deltaTime * particleA.velocity;
      particleA.force1 = 0;
      particleA.force2 = 0;
    • Of course, the particle should not be storing two distinct forces, they can be combined into one.

N-body simulation

N-body simulation stands for the simple algorithm of checking each particle against each other particle in the simulation. This is how we should implement SPH-based fluid simulation first for simplicity. This can be still done real time for a couple thousand particles on a GPU, so it is ideal for us to test out how the formulas work and find good default input values for the simulation. However, this algorithm has a complexity of O(n^2), so it will not scale well after a certain threshold. The SPH smoothing radius will give us opportunity to come up with a better algorithm to reduce complexity and make it scalable later. The first attempt to create an N-body simulation in compute shaders, would involve a double loop iterating through the particle list:

#define THREADCOUNT 256

[numthreads(THREADCOUNT, 1, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
  Particle particleA = particleBuffer[dispatchThreadID.x];
  for(uint i = 0; i < particleCount; ++i)
    Particle particleB = particleBuffer[i];
    resolveParticles(particleA, particleB);

Well, not *really* a double loop, but the outer loop is the compute kernel itself. But there are already several problems with this. First, is that the SPH force computation needs intermediate results of particle densities, but our compute shader is running thread groups in parallel, so we can’t just combine both in the same shader, as we can’t be sure that all the densities are computed by the time we want to evaluate forces. So we are splitting the simulation into two shaders: density evaluation and force evaluation. (Pressure evaluation can be done immediately before force computation because it can be computed from density) So far we have two shaders, each doing O(n^2) particle evaluations, each loading from global memory. This just screams slow. I also experimented with using shared memory to preload particles ahead of time and reduce latency. You can skip this step if you want, because the acceleration structure later will be a complete replacement, but this is a good exercise nonetheless and increases performance of the N-body simulation:

#define THREADCOUNT 256
groupshared Particle sharedParticles[THREADCOUNT];

Particle particleA = particleBuffer[dispatchThreadID.x];

uint numTiles = 1 + particleCount / THREADCOUNT;

for (uint tile = 0; tile < numTiles; ++tile)
  uint offset = tile * THREADCOUNT;
  sharedParticles[flattenedGroupIndex] = particleBuffer[offset + flattenedGroupIndex];

  for(uint i = 0; i < THREADCOUNT; ++i)
    Particle particleB = sharedParticles[i];
    resolveParticles(particleA, particleB);

Important to look out that we are doing thread sync operations inside a dynamic loop. You are not allowed to do this if the loop is potentially divergent across the thread group. To have it compile, ensure that the loop range is loaded from coherent memory such as constant buffer values or read only buffers. For example you can’t load particleCount from an unordered access view, because the loop termination depends on it and it can potentially be divergent across the thread group.

Obviously, it is a bit simplified, to be more readable. We eliminated a bunch of latency by avoiding accessing the main memory and using group shared memory instead. This is the general way to speed up N-body simulation. It should be already good (real time) for simple fluid simulations up to ~20000 particles on an average GPU. We still need to do it two times though, first for computing densities, then for forces.

So, with this implementation, our fluid simulation can be written as running two extra shaders before particle velocity integration:

  1. SPH Density evaluation
  2. SPH force computation

Pressure computation can either be in density evaluation phase and exported to memory/read back in force phase, or just computed in force phase. I chose to actually precompute it in the force phase, and store it into shared memory per “tile”.

So far we can simulate a moderate amount of particles in real time with N-body collisions (4000 particles):


Dynamic Hashed Grid

So, hopefully you have the fluid simulator up and running before coming to this part. ūüôā Let’s eliminate that nasty O(n^2) complexity! The very basic idea behind SPH is that there is a smoothing radius for each particle and only particles inside the smoothing radius should be affecting the current particle. If we subdivide space to a uniform grid, with a grid cell size of smoothing radius, and sort each particle into its corresponding cell, we can perform efficient constant lookups of smaller particle lists. Let’s say, that we have a particle somewhere. We can look up which grid cell it belongs to by taking its position, and dividing it by smoothing radius:

int3 cellIndex = floor(particleA.position / h);

Then we have to evaluate all the particles inside that grid cell and all the direct neighbour cells as well. In a three dimension grid, those are 3*3*3 = 27 cells. There are some problems that come to mind immediately:

  • We don’t want to just constrain the simulation to a finite grid, but we don’t have infinite memory
  • We don’t know how many particles can be inside one cell at once. We don’t want to impose limitation, because that could affect simulation results.
  • We can’t allocate memory with GPU code, so we will be using a fixed size pre-allocated pool.
  • We will need to build this data structure entirely on the GPU.

Fortunately, all of these are solvable with the dynamic hashed grid algorithm!

The first part we should solve is that each cell can contain an unbounded number of particles (well, bounded to the maximum number of allowed particles per simulation). I took this idea from a GDC 2008 presentation by Nvidia. I needed some time to wrap my mind around it, so hopefully I can describe it so that it will be easier for you. It is called “sorted grid” and it is described like this:


It can be tricky to digest, let me explain. Let’s start from left to right. We have a picture and three columns. The picture of the grid shows that we have grid cells 4, 6 and 9 containing some particles. It can be represented by the unsorted list (first column). It can be built on the GPU by iterating through the particle list and each particle writes its grid cell index into a list. But it is not yet usable to perform a constant time lookup. The second column is the sorted list. It’s the previous list, only the particle index list is now sorted by grid cell index. It is good for us, because now if we iterate through the particle index list, the corresponding grid cell indices are continuously laid out in ascending order. But we need the ability to look up an arbitrary cell in constant time and have a continuous array of particle list which it contains. We are nearly there. As the last step, we can process this list and create an “offset list” for each grid cell. We will have an array of grid cells each holding either

  • UINT32_MAX (0xFFFFFFFF) When the cell is empty
  • integer value >= 0 indicating the offset into sorted particle list

With this information, we can now say everything about a particle when we iterate through them:

  • Its corresponding grid cell
  • The grid cell’s offset into the sorted list
  • We know that the sorted list contains some number of particles in the same cell from the offset until we read a particle which is in an other cell.

As we can see, there is no need for memory allocation apart from the pre-allocation, so it is doable on the GPU. We should follows these steps:

  1. Create the unsorted listWe will be running a compute shader with the SV_DispatchThreadID indexing the particle index list, which indexes the main particle property buffer containing position, simulation properties, etc..We will have a cellIndexBuffer, which is the same size as the particle index buffer. It could also be a property of the particle structure, but this is better memory access pattern because it is only used in optional particle simulation path. Also, we can have a general purpose sorting shader more easily.
    uint particleIndex = particleIndexBuffer[dispatchThreadID.x];
    Particle particle = particleBuffer[particleIndex];
    int3 cellIndex = floor(particle.position / h);
    uint flatCellIndex = GetFlatCellIndex(cellIndex);
    cellIndexBuffer[particleIndex] = flatCellIndex;
  2. Create the sorted listRun a Bitonic Merge Sort Algorithm which performs comparisons on the cellIndexBuffer but swaps elements of the particleIndexBuffer. Bitonic sort is a well researched parallel sorting algorithm, and you can pull the implementation from AMD’s particle demo on GitHub.
  3. Create the offset listThe offset list will be recreated from scratch each frame, so first ensure that it is completely cleared to invalid offsets (0xFFFFFFFF).Then, we can exploit Atomic operations to efficiently compute the cell offsets. We will dispatch a compute shader that will read one particle index from SV_DispatchThreadID semantic. We can read the corresponding grid cell index or compute it. Loading is preferable here, because we can achieve better cache coherency by not loading position from the bloated particle buffer and converting it to cell index.The Atomic Min operation works because we have high values everywhere in the offset buffer initially, so we can store for each grid cell the minimum of all the threads that had a particle corresponding to that cell.
    uint particleIndex = particleIndexBuffer[dispatchThreadID.x];
    uint cellIndex = cellIndexBuffer[particleIndex];
    InterlockedMin(cellOffsetBuffer[cellIndex], dispatchThreadID.x);
  4. The structure is complete, all we have to do is iterate it!

Before going further, let’s address our remaining desire, that the grid should be “infinite”, not only constrained to a small area. This is a really simple extension, called hashing. The only thing that will change is the computation of the flat cell index. Let’s define the function

inline uint GetFlatCellIndex(in int3 cellIndex)
  const uint p1 = 73856093; // some large primes
  const uint p2 = 19349663;
  const uint p3 = 83492791;
  int n = p1 * cellIndex.x ^ p2*cellIndex.y ^ p3*cellIndex.z;
  return n;

The rest of the algorithm is unchanged, just ensure that wherever looking up a grid cell, we have to use the same function. Also, the choice of prime numbers and grid cell count should have a big impact on the efficiency of the hashing (because of hash collisions, multiple world positions can index the same physical cell), so we should tune this to our needs.

All that’s left now is iterating this structure to resolve the SPH simulation. It will be traversed two times, first for density evaluation, then for force computation. But the iteration itself is the same in both cases:

int3 cellIndex = floor(particleA.position / h);

for(int i = -1; i <= 1; ++i)
  for(int j = -1; j <= 1; ++j)
    for(int k = -1; k <= 1; ++k)
       int3 neighborIndex = cellIndex + int3(i, j, k);
       uint flatNeighborIndex = GetFlatCellIndex(neighborIndex);
       // look up the offset to the cell:
       uint neighborIterator = cellOffsetBuffer[flatNeighborIndex];

       // iterate through particles in the neighbour cell (if iterator offset is valid)
       while(neighborIterator != 0xFFFFFFFF && neighborIterator < particleCount)
         uint particleIndexB = particleIndexBuffer[neighborIterator];
         if(cellIndexBuffer[particleIndexB] != flatNeighborIndex)
           break;  // it means we stepped out of the neighbour cell list!

         // Here you can load particleB and do the SPH evaluation logic

         neighborIterator++;  // iterate...


This is it. It might be a good idea to eliminate the triple loop with a lookup table (but don’t unroll!). But it is better visualized like this for simplicity/code size. This will yield us a huge performance gain and we are able to simulate 100 000 particles on a low end GPU in real time ~30 FPS:


Apart from particle collision, at least for debug purposes, it is a good idea to implement planar collisions, just to keep the simulation in bounds to test how it can fill out a volume. Planar collision is simple, if a particle would go behind a plane, clamp its position and invert the particle velocity based on the plane reflection vector. Apart from that, we can implement rigid collisions with unsimulated particles (for example driven by animation system). But that is out of scope for today.

As with any technique, there are multiple possibilities to further optimize. I will not cover hardware level GPU optimizations here because I have not yet researched this area thoroughly for this and it is already a long article. My general direction would be first to profile for any obvious bottlenecks. Without doing so, I imagine that memory latency could be improved by using shared memory somehow for the grid based lookups. It is not trivial anymore like with N-body simulation, because the thread group can span arbitrary amount of different grid cells at once. Apart from that, we should always aim to stay in cache wherever we can, so deinterleaving certain less frequently used particle properties might be a good idea. Also, watch out for bank conflicts when writing to (shared) memory, you can do so by aligning your data to bank size, and interleave certain frequently used together properties together into a structure. Can we use wave intrinsics to speed up certain parts of the simulation? I have some ideas I still haven’t tested out.

You can see that there is more room for additional optimizations and I still haven’t talked about rendering the fluid. These are very good candidates for other blog posts in the future, so stay tuned!

As always, you are free to pull my implementation from my open source engine:

Also, my older article describes a general purpose GPU particle system which is a good candidate to fit a fluid simulator into. Or just to implement a cool large-scale particle system.

If you find any mistakes or have any questions, let me know in the comments!

UPDATE: There was a divide by zero bug, which is now fixed. In the force computation step, we must check if the distance between two particles is greater than zero, otherwise avoid computing the SPH forces between those two particles. It sometimes manifested as a flickering anomaly that looked more like a synchronization issue. Sorry for the inconvenience! I marked occurences in code samples with **

[Download sample model] for [Wicked Engine Editor]

Thoughts on Skinning and LDS


I’m letting out some thoughts on using LDS memory as a means to optimize a skinning compute shader. Consider the following workload: each thread is responsible for animating a single vertex, so it loads the vertex position, normal, bone indices and bone weights from a vertex buffer. After this, it starts doing the skinning: for each bone index, load a bone matrix from a buffer in VRAM then multiply the vertex positions and normals by these matrices and weight them by the vertex bone weights. Usually a vertex will contain 4 bone indices and 4 corresponding weights. Which means that for each vertex we are loading 4 matrices from VRAM. Each matrix is 3 float4 vectors, so 48 bytes of data. We have thousands of vertices for each model we will animate, but only a couple of hundred bones usually. So should each vertex load 4 bone matrices from a random place in the bone array?

Instead, we can utilize the LDS the following way: At the beginning of the shader, when the vertex information is being loaded, each thread also loads one matrix from the bone array and stores it inside the LDS. We must also synchronize the group to ensure the LDS bone array has been complete. After all the memory has been read from VRAM, we continue processing the skinning: iterate through all bone indices for the vertex, and load the corresponding bone matrix from LDS, then transform the vertex and blend by the bone weights. We just eliminated a bunch of memory latency from the shader.

Consider what happens when you have a loop which iterates through the bone indices for a vertex: First you load the bone and you want to immediately use it for the skinning computation, then repeat. Loading from a VRAM buffer causes significant latency until the data is ready to be used. If we unroll the loop, the load instructions can be rearranged and padded with ALU instructions that don’t depend on those to hide latency a bit. But unrolling the loop increases register allocations (VGPR = Vector General Purpose Register, for storing unique data to the thread; buffer loads consume VGPR unless they are known to be common to the group at compile time, then they can be placed to scalar registers – SGPR) and can result in lower occupancy as we have a very limited budget of them. We also want a dynamic loop instead because maybe a vertex has fewer bones than the maximum, so processing could exit early. So having tight dynamic loop with VRAM load then immediately ALU instructions is maybe not so good. But once that loop only accesses LDS, the latency can be significantly reduced, and performance should increase.

But LDS does not come for free and it is also a limited resource, like VGPR. Let’s look at the AMD GCN architecture: We have a maximum of 64 KB of LDS for a single compute unit (CU), though HLSL only lets us use 32 KB in a shader. If a shader uses the whole 32 KB, it means that the shader can only be running two instances of itself on the CU. We have a bone data structure which is a 3×4 float matrix, 48 bytes. We can fit 682 bones into LDS and still have two instances of the compute shader operate in parallel. But most of the time we hardly have skeletons consisting of that many bones. In my experience, less than 100 bones should be enough for most cases, but we surely won’t use more than say 256 bones for a highly detailed model, either in real time apps. So say that our shader will declare an LDS bone array of 256 bones, and the thread group size is also 256, so each thread will load one bone into LDS. 256*48 bytes = 12 KB. This means that 5 instances of this shader could be running in parallel on a CU, so 5*256 = 1280 vertices processed. That is if we don’t exceed the max VGPR count of 65536 for a CU. In this case it means that a single shader must at maximum fit into the¬† 51 VGPR limit (calculated as¬†65536 VGPR / 1280 threads). Most cases we will easily fit into even a 128 bone limit, so an LDS bone array size of 128 and thread group size of 128 threads will just be enough and be much easier on the GPU.

However, I can imagine a scenario, which could be worse with the LDS method, if there is a complicated skeleton, but small mesh referencing only a few of the bones. In this case when there is one skeleton for multiple meshes, maybe we should combine the meshes into a single vertex buffer and use an index buffer to separate between them, so this way a single dispatch could animate all the meshes, while they can be divided into several draw calls when needed.

It sounds like a good idea to utilize LDS in a skinning shader and it is a further potential improvement over skinning in vertex/geometry shader and doing stream out. But as with anything on the GPU this should be profiled on the target hardware which you are developing on. Right now my test cases were unfortunately maybe not the best assets there are, and a puny integrated Intel mobile GPU, but even so I could find a small performance improvement with this method.

Thank you for reading, you can find my implementation on GitHUB! Please tell me if there is any incorrect information.

UPDATE: I’ve made a simple test application with heavy vertex count skinned models, and toggleable LDS skinning: Download WickedEngineLDSSkinningTest (You will probably need Windows 10 to run it and DirectX 11)

Further reading:

Large thread groups on GCN

Easy Transparent Shadow Maps


Supporting transparencies with traditional shadow mapping is straight forward and allows for nice effects but as with anything related to rendering transparents with rasterization, there are corner cases.

The implementation is really simple once you have implemented shadow mapping for opaque objects. After the opaque shadow pass, we must render the transparents into a color buffer, but reject samples which would be occluded by opaques, so using a depth read-only depth stencil state. The transparents should be blended multiplicatively. Sorting does not matter with a multiply blend state. In bullet points:

  1. Render opaque objects into depth stencil texture from light’s point of view
  2. Bind render target for shadow color filter: R8G8B8A8_UNORM works good
  3. Clear render target to 1,1,1,0 (RGBA) color
  4. Apply depth stencil state with depth read, but no write
  5. Apply multiplicative blend state eg:
    • SrcBlend = BLEND_DEST_COLOR
    • DestBlend = BLEND_ZERO
    • BlendOp = BLEND_OP_ADD
  6. Render transparents in arbitrary order

When reading shadow maps in shading passes, we only need to multiply the lighting value with the transparent shadow map color filter if the pixel is inside the light. There is a slight problem with this approach, that you will notice immediately. Transparent objects now receive their own colored self shadow too. The simplest fix is to just disable the colored part of the shadow calculation for transparent objects. We can already produce nice effects with this, this is not a huge price to pay.

See it in action, transparents are rendered without colored self-shadows:


But they receive shadows from opaque objects just fine:


There is a technique which would allow us to render colored shadows on top of transparents too. This involves keeping an additional shadow depth map for transparencies. The flow of this technique is like this (from a Blizzard presentation):

  1. Render opaque shadow map
  2. Render transparent shadow map
    • To a separate depth texture!
    • Depth writes ON
  3. Clear shadow color filter texture (like in the simple approach)
  4. Render transparent again to color filter render target
    • But use the opaque shadow map’s depth stencil
    • depth read ON
    • depth write OFF

And in the shading step, now there will be two shadow map checks, one for the opaque, one for the transparent shadow maps. Only multiply the light with the shadow filter color texture when the transparent shadow check fails.

This will eliminate false self shadows from transparent objects. But unfortunately now when a transparent receives a colored shadow, its own transparent shadow color will also contribute to itself.

What’s more interesting, are the additional effects we can achieve with transparent shadow maps:

Textured shadows, which can be used as a projector for example, just put a transparent textured geometry in front of a light:


And underwater refraction caustics:


The caustics are implemented as adding additional light to the surface, so even if the surface below the water would be in the water’s shadow it would add in the caustic effect. For this I am using the transparent color filter render target’s alpha channel with an additive blend mode. The water shadow shader renders the slope of the water normal map into the alpha channel, eg. this:

float4 outputColor.a = 1 – saturate(dot(normalMap.rgb * 2 – 1, float3(0, 0, 1)));

Which could probably use a more sophisticated method, which I didn’t bother to find yet. So in the light evaluation shaders, I add the transparent color filter alpha channel to the pixelIsInShadow value multiplied by some causticStrength modulator to render the caustics. Caustics are added because there can be brighter caustics than the light brightness, because in reality they are generated when multiple rays (photons) arrive at the same location from different sources.

That’s it, I think this is a worthwhile technique to include in a game engine. Even the simplest implementation can be used for many interesting effects. Thank you for reading!

Related article:

StarCraft 2 Effects and techniques

Optimizing tile-based light culling

Tile-based lighting techniques like Forward+ and Tiled Deferred rendering are widely used these days. With the help of such technique we can efficiently query every light affecting any surface. But a trivial implementation has many ways to improve. The biggest goal is to refine the culling results as much as possible to help reduce the shading cost. There are some clever algorithms I want to show here which are relatively easy to implement but can greatly increase performance.


For the starting point I assume a tiled rendering pipeline like the one described in a GDC 2011 presentation by Johan Andersson (DICE). The following post will be applicable to the Tiled Deferred and Forward+ variants of that pipeline. A quick recap of such a renderer:

  1. Dispatch a compute shader thread group per screen-space tile
  2. Thread group calculates tile frustum in view space
  3. Each thread reads pixel from depth buffer, performs atomic min-max to determine depth bounds
  4. Tile frustum is truncated by depth bounds
  5. Each thread computes if a light is inside the frustum, adds the light to a per tile list if yes
  6. Compute final shading from the per tile light list
    • Deferred can just read gbuffer, loop through per tile light list in the same shader and write the light buffer texture
    • Forward+ exports light list that will be read by second geometry pass and compute shading in pixel shaders

Our main goal is to refine culling results with eliminating false positives, which means cull lights which do not contribute to the final shading.

1.) Help out frustum culling with AABB

The first problem arises when we perform the culling of lights as sphere-frustum intersection tests. Let me demonstrate this with a point light:



Blue pixels on the second picture visualize which tiles contain the sphere light source after the culling. But the results are strange, why is the blue area kind of sqare shaped, when it should be a circle instead? This results because the math used for sphere – frustum plane tests are not accurate enough when we test big spheres against small frustums. Let me try illustrating this problem. The following image shows a tile frustum as seen on the screen:


As you can see, the sphere is completely outside the frustum, but none of the plane tests pass completely, so the light is not culled away properly. We can overcome this, by using axis aligned boxes (AABBs) instead of frustums. It is implemented like this:

bool SphereIntersectsAABB(in Sphere sphere, in AABB aabb)


  float3 vDelta = max(0, abs( Р Рaabb.extents);

  float fDistSq = dot(vDelta, vDelta);

  return fDistSq <= sphere.radius * sphere.radius;


The result:


This one seems good enough, but don’t throw out frustum culling just yet!

2.) Depth discontinuities are enemy

Holes in the depth buffer can greatly reduce efficiency of a tile based light selection, because the tile enclosing min-max depth AABB can get huge really fast:


In the image above I tried to illustrate (from an above point of view) that a depth discontinuity made the AABB large enough to intersect with a light which the frustum culling would have rejected. This is why AABB should be used alongside frustum culling and complementing each other.

Depth discontinuities usually introduce an other inefficiency, because there might be cases when a light will lie in empty space, not intersecting with anything, but still inside the tile, so the shading will receive the light, but it will not contribute at all:


As you can see, that light is inside the frustum, inside the AABB, but it is in empty space, between geometries, but our current algorithm will add it to the light list.

To solve this, there is a technique called 2.5D light culling, introduced by Takahiro Harada. In addition to that presentation, I would like to give an implementation for this in HLSL. So the basic idea is to create two bitmasks, one for the tile and one for the light which we are checking. The bitmasks are used by doing a bitwise AND operation with them to determine if the light intersects any geometry (when AND returns non zero) in the tile or not (when AND returns zero).


For the sake of a simpler image, I used a 9-bit mask, but we should use a 32-bit mask which we can represent by a uint variable.

The first bitmask is created for the whole tile once. While each thread reads its corresponding pixel from the depth buffer, it does an atomic min-max already, but now it also fills in a single relevant bit in a uint and performs an atomic OR to the tile bitmask. So what is the relevant bit? The algorithm says that we divide our tile depth range into 32 pieces and a 32-bit uint variable will contain those ranges. We first determine our tile depth bounds in linear space for this, then fill in the corresponding bit accordingly:

groupshared uint tileDepthMask = 0;

// …

float minDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, minDepth, 1)).z;

float maxDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, maxDepth, 1)).z;

float realDepthVS = UnprojectScreenSpaceToViewSpace(float4(0, 0, pixelDepth, 1)).z;

float depthRangeRecip = 32.0f / (maxDepthVS – minDepthVS);

uint depthmaskcellindex = max(0, min(32, floor((realDepthVS – minDepthVS) * depthRangeRecip)));

InterlockedOr(tileDepthMask, 1 << depthmaskcellindex);


This code is being run by every thread in the group. The unexplained function called UnprojectScreenSpaceToViewSpace just does what is says, the input is a screen coordinate point, and transforms it to view space. We are only interested in the Z coordinate here, so we only need to transform the input with the inverse projection matrix and divide the result by the w component afterwards. Otherwise if we would be interested in XY coordinates, we would also need to transform from [0,1] to [-1,1] range before projection. The function would look like this for the common case:

float4 UnprojectScreenSpaceToViewSpace(in float4 screenPoint)


  float4 clipSpace = float4(float2(screenPoint.x, 1 РscreenPoint.y) * 2 Р1, screenPoint.z, screenPoint.w);

  float4 viewSpace = mul(clipSpace, xInverseProjectionMatrix);

  viewSpace /= viewSpace.w;

  return viewSpace;


So the bitmask construction code might look a bit intimidating, so let me explain a bit better what’s happening. We calculate the minZ, maxZ and current pixel Z in view space and determine the depth slice size which a single bit will represent (depthRangeRecip). Then shift a bit to the right place and adding it to the group shared tile mask by means of an atomic OR operation.

The tile mask is complete, so we only need to know how to construct a light mask. That must be done inside the loop where we are culling lights. On the first try I cooked up this:

float fMin = lightPosViewSpace.z – lightRadius.r;

float fMax = lightPosViewSpace.z + lightRadius.r;

uint lightMaskcellindexSTART = max(0, min(32, floor((fMin – minDepthVS) * depthRangeRecip)));

uint lightMaskcellindexEND = max(0, min(32, floor((fMax – minDepthVS) * depthRangeRecip)));

uint lightMask = 0;

for (uint c = lightMaskcellindexSTART; c <= lightMaskcellindexEND; ++c)


  lightMask |= 1 << c;


Here we determine the beginning and end ranges of a sphere light inside the view space depth range and push bits into the mask in a loop to the correct places one-by one:

In this mask for example, lightMaskcellindexSTART is the 11th bit from the right, and lightMaskcellindexEND is the 21st bit from the right:


Of course this loop seems like a waste to do inside a shader, so I needed to come up with something better. Rethinking how this a smaller bitfield could be pushed inside a bigger bitrange gave me the idea to exploit the truncation by the bitwise shift operators:

  • First, fill full mask like this:


    • uint lightMask = 0xFFFFFFFF;
  • Then Shift right with spare amount to keep mask only:


    • lightMask >>= 31 – (lightMaskcellindexEND – lightMaskcellindexSTART);
  • Last, shift left with START amount to correct mask position:


    • lightMask <<= lightMaskcellindexSTART;

So the resulting code eliminated a loop for only a very few instructions which is a lot better.

We have the tile mask and the light mask, so the only thing left to do is AND them to determine if a light touches something or not:

bool intersect2_5D = tileMask & lightMask;

And the resulting comparison of culling results in a high depth discontinuity scene with alpha tested vegetation and many point lights (red tiles have more that 50 lights):


As you can see, there is a big improvement in the tile visualizer heatmap, the shading will process much less lights and performance will improve in these difficult cases with alpha tested geometry. The scene shown above had the following timing results (performed with DX11 timestamp queries):

  • Basic 2D culling:
    • Culling: 0.53 ms
    • Shading: 10.72 ms
  • Improved 2.5D culling:
    • Culling: 0.64 ms
    • Shading: 7.64 ms

As you can see this is a definite improvement because while the culling shader took a bit more time to finish (0.1 ms), the object shading  took 3 ms less. I made a more detailed video some time ago:

3.) Other light types

We surely have the need to cull other light types than point lights. Spot lights come to mind at first thought, but there can be the desire to cull decal boxes or area-aligned local environment probe volumes as well.

  • Spot lights:
    • Implement a cone culling algorithm in shader code. This can be more expensive than sphere culling and again not too accurate with frustum tests. Then going further and implementing cone – AABB tests will further slow down our algorithm.
    • Or approximate the spotlight cone with tightly fitting a sphere around it. We don’t need to implement any other culling algorithms, only fitting a sphere around the spotlight, which is a straight forward, easy calculation. But this will result in excessive waste for thin/long cones. Code to fit sphere around cone:
      • float¬†spotLightConeHalfAngleCos = cos(spotLightFOV * 0.5f);

        float sphereRadius = spotlightRange * 0.5f / (spotLightConeHalfAngleCos *spotLightConeHalfAngleCos);

        float3 sphereCenter = spotLightPos + spotLightDirection * sphereRadius;

      • Remember that you can precalculate this outside the shader, so the shader will only have a sphere to evaluate.
    • An other method involves doing cone-sphere test instead and can produce much more accurate results than the previous two, check it out here.
  • Decal/Environment probe box:
    • The idea here is to get away with AABB-AABB tests. But our decals or envprobes can be oriented in world space arbitrarily so they are OBBs! We can do something that I’d call a coarse AABB test. I transform the tile AABB from view space to world space (using the inverse view matrix of the camera) and recalculate the AABB of the resulting OBB. Then for each decal I transform the world space tile AABB by the decal’s inverse world matrix and perform an AABB-AABB intersection test with the resulting AABB and a unit AABB. The results are kind of good, but we have to refine again with the frustum as well. Transforming an AABB with a matrix is quite heavy though, because I am multiplying each corner with a matrix and keep track of the min-max at the same time. With two AABB transforms it gets heavy, but I have not yet found a better solution for this. And culling decals or envprobes is much less frequent than lights, so I comfort myself with that idea for the time being. The results:


4.) Gather vs. Scatter approaches

The implementation described by the above mentioned Battlefield presentation deals with the gather approach, which means that each thread group iterates through every potential light. A culling shader like this works best for a couple hundred lights, but can become slow when approaching a few thousand. To support more lights, we can implement some kind of a scatter approach. My suggestion is to have a coarse culling step before the regular light culling which operates on much bigger screen tiles and dispatches the work differently. This shader is dispatched so that each thread will process a single light, and determine which tile it belongs to and write (scatter) the light index into the according tile. Then the regular culling would read the light lists from the coarse tiles as opposed to iterating through each light on the screen. We could also use an acceleration structure for the lights like an octree for example and let the shader use that instead of coarse culling.

A scatter approach could be implemented for a different purpose as well: refining culling results. I described above that we can approximate spotlights with spheres, but an other idea would be to rasterize a cone instead in low resolution and let the pixel shader write the light into the appropriate tile corresponding to the invoked pixel.

5.) Don’t be greedy

It might be tempting to create a single uber-shader handling everything, creating frustums, reading depth buffer and assembling depth bounds, creating tile AABB, creating tile bitmask, culling lights, and in the case of tiled deferred, also evaluating the shading at the end. In reality, though it could be the wrong path to take.

First, on AMD GCN architecture for example, resources such as registers are shared across hardware execution units and if a shader uses too many, there will be contention for them and parallel execution will be reduced so that overall performance will be bottlenecked. This is called register pressure. Our shader which is creating frustums at the beginning are already using many registers for example which could be precalculated instead to lighten the load. AABB calculation further reduces available registers and so does calculating the tile bitmask. A tiled deferred shader at the end of the culling shader can also be very complex and utilizing many registers at once.

Then there is the part when we are creating the depth bounds with atomic operations. Atomic operations can be slow, so calculating the depth bounds in a separate shader by means of parallel reduction could be a better alternative. A reduced resolution depth buffer can also be reused later as a hierarchical-Z pyramid for instance.

Divergent branching in a shader is only a good idea if we design the shader to be highly coherent in the branches it takes for a thread group. A light culling setup usually works best in 16×16 or 32×32 pixel tiles, and each thread gets a minor task of culling a single light. This task is highly divergent in the path each thread will take. A light evaluation shader has a different behaviour as opposed to that, because each thread will potentially process the same array of lights in its given tile. Except there could be cases when a light/decal calculation will exit early or skip shadow calculations, etc… with a pixel granularity instead of per tile granularity. In that case, it is already inefficient to utilize big thread groups because long iterating threads will hold back the rest in the group from exiting early and the hardware will be underutilized. So a smaller tile size for the light evaluation should be preferred (8×8 worked best in my experiments).

Seeing these problems I propose to separate the big shader into several smaller parts. Frustum precomputation could go into its own shader. An other shader could reduce the depth buffer and create the depth bounds with the tile bitmask information. Tile AABB computation could also go in there potentially. The culling shader would only load the depth bounds, tile AABB and bitmask from memory, then perform culling the lights, then export the per tile light lists into memory. The last shader is the surface shading, light evaluating shader, which is a vertex-pixel shader for Forward+ and a compute shader for tiled deferred with a smaller blocksize than the culling (8×8 proposed as in previous paragraph).

Separating the shaders opens up an other optimization possibility, to use Indirect Dispatch. Consider for example that the culling determined for a tile that no lights are inside it, so it would instruct the surface shading shader not to dispatch any work groups for that tile.


A lot has been covered, so let’s recap quickly:

  • Frustum-sphere tests are inaccurate
  • Refine with AABB-sphere tests
  • Work against depth discontinuities with 2.5D culling bitmasks
  • Small burden on culling can be big win with shading
  • Approximate complex light types with spheres and AABBs
  • Avoid overworked ubershaders
  • Prefer smaller shaders, better fitted to specific task at hand

Thanks for reading, hope you enjoyed it! If you have any feedback, please share with me. Were there any mistakes? Definetly share ūüôā

Further reading:

Introduction to tiled deferred rendering

Reference for 2.5D culling

Tile based rendering advancements GDC 2015

Awesome Forward+ tutorial

RealTimeCollisionDetection.NET has most of the intersection code

Investigating efficient spotlight culling

Next power of two in HLSL

There are many occasions when a programmer would want to calculate the next power of two for a given number. For me it was a bitonic sorting algorithm operating in a compute shader and I had this piece of code be responsible for calculating the next power of two of a number:

uint myNumberPowerOfTwo = 1;

while( myNumberPowerOfTwo < myNumber)


   myNumberPowerOfTwo <<= 1;


It gets the job done, but doesn’t look so nice. For not unusual cases when myNumber is more than 1000 it can already take ten cycles to loop. I recently learned that HLSL has a built in function called firstbithigh. It returns the position of the first non zero bit in a 32-bit number starting from the left to the right (from high order to low). With its help, we can rewrite the algorithm as follows:

uint myNumberPowerOfTwo = 2 << firstbithigh(myNumber);

It does the same thing, so how does it work? Take a random number and write it in binary:


For that number, firstbithigh returns the number of the location of the first non zero bit (from the left), which is 24. The next power of two to that number would be:


That number has seven leading zeroes. All we have to do is zero the whole number but set a single bit to non-zero. To do it, first set up 2 in binary:


Then shift the whole thing left by the position of the first bit in myNumber to get what we are looking for:


Nice one, but we have a few corner cases we need to eliminate. If myNumber is already a power of two, we will get a greater power of two which might not be what we are looking for. In that case we should return myNumber itself. If you subtract one from myNumber before calling firstbithigh on it, it will ensure that power of two numbers will return themselves:

uint myNumberPowerOfTwo = 2 << firstbithigh(myNumber – 1);

Take 1024 for example, it will get the next power of two for 1023 which is 1024. Or take 129 for example, it will get the next power of two for 128 which is 256, so works out quite well. If you try to calculate it for 0 however, there is a potential bug if myNumber is unsigned, as -1 in unsigned would wrap around to 0xFFFFFFFF. If you can potentially call it for zeroes and operate on unsigned, I would suggest doing it like this:

uint myNumberPowerOfTwo = 2 << firstbithigh(max(1, myNumber) – 1);

If it is signed int, then you don’t have to worry about it, as the firstbithigh function will return zero for negative numbers.

Have any question, noticed a mistake? Please post in the comments below!

Thank you for reading!