Thoughts on light culling: stream compaction vs flat bit arrays

I had my eyes set on the light culling using flat bit arrays technique for a long time now and finally decided to give it a try. Let me share my notes on why it seemed so interesting and why I replaced the stream compaction technique with this. I will describe both techniques and make comparisons between them. Wall of text incoming with occasional code pieces, in brain dump style.

(For this post, I considered a layout where the comparison between the two techniques for each point follow each other immediately, but decided against it, and instead there are two big blocks: stream compaction technique, then the flat bit array technique following it. The points are in similar order for both, so comparison could be easily made by having two browser windows opened for example. It wouldn’t be so simple to do the opposite thing.)

Stream compaction

Idea is based on Johan Andresson’s 2011 GDC talk: DirectX 11 rendering in Battlefield 3

Culling phase:

groupshared uint ArrayLength = 0;
groupshared uint Array[256];

void AppendLight(uint lightIndex)
{
    uint index;
    InterlockedAdd(ArrayLength, 1, index);
    if (index < 256)
    {
        Array[index] = lightIndex;
    }
}
// Parameter is SV_GroupIndex
void BitonicSort( in uint localIdxFlattened )
{
    uint numArray = ArrayLength;
    uint numArrayPowerOfTwo = 2 << firstbithigh(numArray - 1);

    for( uint nMergeSize = 2; nMergeSize <= numArrayPowerOfTwo; nMergeSize = nMergeSize * 2 )
    {
        for( uint nMergeSubSize = nMergeSize >> 1; nMergeSubSize > 0; nMergeSubSize = nMergeSubSize >> 1 )
        {
            uint tmp_index = localIdxFlattened;
            uint index_low = tmp_index & ( nMergeSubSize - 1 );
            uint index_high = 2 * ( tmp_index - index_low );
            uint index = index_high + index_low;

            uint nSwapElem = nMergeSubSize == nMergeSize >> 1 ? index_high + ( 2 * nMergeSubSize - 1 ) - index_low : index_high + nMergeSubSize + index_low;
			
            if( nSwapElem < numArray && index < numArray )
            {
                if( Array[ index ] < Array[ nSwapElem ] )
                {
                    uint uTemp = Array[ index ];
                    Array[ index ] = Array[ nSwapElem ];
                    Array[ nSwapElem ] = uTemp;
                }
            }
            GroupMemoryBarrierWithGroupSync();
        }
    }
}

Rendering phase:

Light culling in Sponza

Flat bit arrays

Idea from Michal Drobot’s 2017 Siggraph presentation: Improved culling for tiled and clustered rendering

Culling phase:

groupshared uint tile[256 / 32];

void AppendLight(uint lightIndex)
{
    const uint bucket_index = entityIndex / 32;
    const uint bucket_place = entityIndex % 32;
    InterlockedOr(tile[bucket_index], 1 << bucket_place);
}

Rendering phase:

// retrieve tile index (for 16x16 block size):
const uint2 tileIndex = uint2(floor(pixel.xy / 16));

// compute flattened tile index because tile buffer is linear array:
//  g_tileCount = ScreenResoulution.xy / 16
//  g_bucketCountPerTile = max_light_count / 32
const uint flatTileIndex = (tileIndex.x + tileIndex.y * g_tileCount.x) * g_bucketCountPerTile;

for(uint bucket = 0; bucket < g_bucketCountPerTile; ++bucket)
{
    uint bucket_bits = Tiles[flatTileIndex + bucket];

    while(bucket_bits != 0)
    {
        const uint bucket_bit_index = firstbitlow(bucket_bits);
        const uint entity_index = bucket * 32 + bucket_bit_index;
        bucket_bits ^= 1 << bucket_bit_index;
        
        Light light = LightArray[entity_index];
        // process light here...
    }
}
Light culling in Bistro

Results:

I’ll compare some results on the bistro scene. The configuration is for GPU times and measured with timestamp queries:

The flat bit array method is the winner here, no matter how I look at it. The biggest boost is in the culling shader that is greatly simplified, but the opaque geometry pass is consistently better as well. (4k means 3840×2160 resolution here, 1080p is full HD 1920×1080)

An other small test I made was for the decals, and I just used a simple cornell box, then placed 100 decals into one spot, overlapping with each other, 4k resolution and covering a small part of the screen:

that was the 100 small decals, red tile = 100 decals, black = 0 decals

It’s a win again, and the closer I zoomed the camera into the decals, the bigger the difference was between the two techniques. This is the result for full screen decal coverage, 100 overlapping decals, still 4k resolution:

To be honest, I did not expect that kind of speedup, so I am very pleased and can say time was well spent implementing this.

Other notes:

Implementation samples:

As always, you can find my implementation for these in Wicked Engine. I have made two snapshot branches for the two separate techniques, and while the main branch will updated, these will remain like this to represent the state of this blog:

Some other light culling resources:

turanszkij Avatar

Posted by

5 responses to “Thoughts on light culling: stream compaction vs flat bit arrays”

  1. I tried the flat bit array method after reading your post– worked really well–

    I’m doing the culling/sorting on CPU, and passing in a couple uint params as instance data to vertex shader(also a path that uses constant buffer, but that doesn’t jive as well with multidraw).

    I scalarized the uints using driver extension(I’m on d3d11) to grab first active lane, that was a solid performance improvement.

  2. why you are saying “It will be also interesting to try not using a culling compute shader, but rely on conservative rasterization (or MSAA), and draw the light proxies instead”, is that just the traditional deferred shading method? are the light proxies drawn with the resolution of the tile texture? (in most situations it means reducing the resolution right?) thanks!

    1. Hi, in this case I mean draw the light proxies in the tile resolution texture and only do the atomic to store the light index (not do the actual shading like in deferred). Then in either forward or tiled deferred pass do the light calculation just like in the current solution.

      1. i see. thanks

  3. Having workgroups the size of subgoups is not a good idea at all – they should be larger; For nvidia, each SM can process some integer number of workgroups at a time, on my 1050ti, its maximum of 32 per SM. Launching shader with size 32, will make each SM run 1024 invocations at a time, however the maximum possible is 2048. So, the size 64 per workgroup is a better minimum than 32 for nvidia. Something a bit larger probably would be better… Other things that limit amount of threads per SM are shared memory size and register count. Of course, better occupancy doesn’t necessarily means better performance…

Leave a Reply

Discover more from Wicked Engine

Subscribe now to keep reading and get access to the full archive.

Continue reading