Wicked Engine Net

Thoughts on light culling: stream compaction vs flat bit arrays

I had my eyes set on the light culling using flat bit arrays technique for a long time now and finally decided to give it a try. Let me share my notes on why it seemed so interesting and why I replaced the stream compaction technique with this. I will describe both techniques and make comparisons between them. Wall of text incoming with occasional code pieces, in brain dump style.

(For this post, I considered a layout where the comparison between the two techniques for each point follow each other immediately, but decided against it, and instead there are two big blocks: stream compaction technique, then the flat bit array technique following it. The points are in similar order for both, so comparison could be easily made by having two browser windows opened for example. It wouldn’t be so simple to do the opposite thing.)

Stream compaction

Idea is based on Johan Andresson’s 2011 GDC talk: DirectX 11 rendering in Battlefield 3

Culling phase:

groupshared uint ArrayLength = 0;
groupshared uint Array[256];

void AppendLight(uint lightIndex)
{
    uint index;
    InterlockedAdd(ArrayLength, 1, index);
    if (index < 256)
    {
        Array[index] = lightIndex;
    }
}
// Parameter is SV_GroupIndex
void BitonicSort( in uint localIdxFlattened )
{
    uint numArray = ArrayLength;
    uint numArrayPowerOfTwo = 2 << firstbithigh(numArray - 1);

    for( uint nMergeSize = 2; nMergeSize <= numArrayPowerOfTwo; nMergeSize = nMergeSize * 2 )
    {
        for( uint nMergeSubSize = nMergeSize >> 1; nMergeSubSize > 0; nMergeSubSize = nMergeSubSize >> 1 )
        {
            uint tmp_index = localIdxFlattened;
            uint index_low = tmp_index & ( nMergeSubSize - 1 );
            uint index_high = 2 * ( tmp_index - index_low );
            uint index = index_high + index_low;

            uint nSwapElem = nMergeSubSize == nMergeSize >> 1 ? index_high + ( 2 * nMergeSubSize - 1 ) - index_low : index_high + nMergeSubSize + index_low;
			
            if( nSwapElem < numArray && index < numArray )
            {
                if( Array[ index ] < Array[ nSwapElem ] )
                {
                    uint uTemp = Array[ index ];
                    Array[ index ] = Array[ nSwapElem ];
                    Array[ nSwapElem ] = uTemp;
                }
            }
            GroupMemoryBarrierWithGroupSync();
        }
    }
}

Rendering phase:

Light culling in Sponza

Flat bit arrays

Idea from Michal Drobot’s 2017 Siggraph presentation: Improved culling for tiled and clustered rendering

Culling phase:

groupshared uint tile[256 / 32];

void AppendLight(uint lightIndex)
{
    const uint bucket_index = entityIndex / 32;
    const uint bucket_place = entityIndex % 32;
    InterlockedOr(tile[bucket_index], 1 << bucket_place);
}

Rendering phase:

// retrieve tile index (for 16x16 block size):
const uint2 tileIndex = uint2(floor(pixel.xy / 16));

// compute flattened tile index because tile buffer is linear array:
//  g_tileCount = ScreenResoulution.xy / 16
//  g_bucketCountPerTile = max_light_count / 32
const uint flatTileIndex = (tileIndex.x + tileIndex.y * g_tileCount.x) * g_bucketCountPerTile;

for(uint bucket = 0; bucket < g_bucketCountPerTile; ++bucket)
{
    uint bucket_bits = Tiles[flatTileIndex + bucket];

    while(bucket_bits != 0)
    {
        const uint bucket_bit_index = firstbitlow(bucket_bits);
        const uint entity_index = bucket * 32 + bucket_bit_index;
        bucket_bits ^= 1 << bucket_bit_index;
        
        Light light = LightArray[entity_index];
        // process light here...
    }
}
Light culling in Bistro

Results:

I’ll compare some results on the bistro scene. The configuration is for GPU times and measured with timestamp queries:

The flat bit array method is the winner here, no matter how I look at it. The biggest boost is in the culling shader that is greatly simplified, but the opaque geometry pass is consistently better as well. (4k means 3840×2160 resolution here, 1080p is full HD 1920×1080)

An other small test I made was for the decals, and I just used a simple cornell box, then placed 100 decals into one spot, overlapping with each other, 4k resolution and covering a small part of the screen:

that was the 100 small decals, red tile = 100 decals, black = 0 decals

It’s a win again, and the closer I zoomed the camera into the decals, the bigger the difference was between the two techniques. This is the result for full screen decal coverage, 100 overlapping decals, still 4k resolution:

To be honest, I did not expect that kind of speedup, so I am very pleased and can say time was well spent implementing this.

Other notes:

Implementation samples:

As always, you can find my implementation for these in Wicked Engine. I have made two snapshot branches for the two separate techniques, and while the main branch will updated, these will remain like this to represent the state of this blog:

Some other light culling resources: