Tile-based optimization for post processing

One way to optimize heavy post processing shaders is to determine which parts of the screen could use a simpler version. The simplest form of this is use branching in the shader code to early exit or switch to a variant with reduced sample count or computations. This comes with a downside that even the parts where early exit occur, must allocate as many hardware resources (registers or groupshared memory) as the heaviest path. Also, branching in the shader code can be expensive when the branch condition is not uniform (for example: constant buffer values or SV_GroupID are uniform, but texture coordinates or SV_DispatchThreadID are not), because multiple threads can potentially branch differently and cause divergence in execution and additional instructions.

Read More »

Improved normal reconstruction from depth

In a 3D renderer, we might want to read the scene normal vectors at some point, for example post processes. We can write them out using MRT – multiple render target outputs from the object rendering shaders and write the surface normals to a texture. But that normal map texture usually contains normals that have normal map perturbation applied on them, which some algorithms don’t like. Algorithms that need to ray trace/ray march from the surface in multiple directions around the normal direction are not good fit with these normal mapped textures, because there is a disconnect between what is in the normal map and what is the real surface triangle normal.

Read More »

Thoughts on light culling: stream compaction vs flat bit arrays

I had my eyes set on the light culling using flat bit arrays technique for a long time now and finally decided to give it a try. Let me share my notes on why it seemed so interesting and why I replaced the stream compaction technique with this. I will describe both techniques and make comparisons between them. Wall of text incoming with occasional code pieces, in brain dump style.

Read More »

Simple job system using standard C++

After experimenting with the entity-component system this fall, I wanted to see how difficult it would be to put my other unused CPU cores to good use. I never really got into CPU multithreading seriously, so this is something new for me. The idea behind the entity-component system is both to make more efficient use of a single CPU by having similar data laid out linearly in memory (thus using cache prefetching when iterating), and also making it easier to write parallelized code (because data dependecies are more apparent). In this post I want to talk about the job system I came up with. It will not only make sense in the entity-component system, but generally it should perform well for any large data processing task. I wanted to remain in standard C++ (~11) realm, which I found that it is entirely possible. However, it can be extended with platform specific extensions if wanted. Let’s break this blog up into multiple parts:

Read More »

GPU Fluid Simulation

Let’s take a look at how to efficiently implement a particle based fluid simulation for real time rendering. We will be running a Smooth Particle Hydrodynamics (SPH) simulation on the GPU. This post is intended for experienced developers and provide the general steps of implementation. It is not a step-by step tutorial, but rather introducing algorithms, data structures, ideas and optimization tips. There are multiple parts I will write about: computing SPH, N-body simulation, dynamic hashed grid acceleration structure. You will find example code pieces here written as HLSL compute shaders.

To be able to implement a simulator like this, you should already have a basic particle simulation/rendering up and running on the GPU. Like this. Then the fluid simulation step can just be inserted between particle emission and particle integration phase. Everything should just work from this point. At the end of the article, you should be able to do this on a low end GPU in real time (30 000 particles, 80 FPS, GTX 1050):

sphinit

Read More »

Thoughts on Skinning and LDS

havoc

I’m letting out some thoughts on using LDS memory as a means to optimize a skinning compute shader. Consider the following workload: each thread is responsible for animating a single vertex, so it loads the vertex position, normal, bone indices and bone weights from a vertex buffer. After this, it starts doing the skinning: for each bone index, load a bone matrix from a buffer in VRAM then multiply the vertex positions and normals by these matrices and weight them by the vertex bone weights. Usually a vertex will contain 4 bone indices and 4 corresponding weights. Which means that for each vertex we are loading 4 matrices from VRAM. Each matrix is 3 float4 vectors, so 48 bytes of data. We have thousands of vertices for each model we will animate, but only a couple of hundred bones usually. So should each vertex load 4 bone matrices from a random place in the bone array?

Read More »

Optimizing tile-based light culling

Tile-based lighting techniques like Forward+ and Tiled Deferred rendering are widely used these days. With the help of such technique we can efficiently query every light affecting any surface. But a trivial implementation has many ways to improve. The biggest goal is to refine the culling results as much as possible to help reduce the shading cost. There are some clever algorithms I want to show here which are relatively easy to implement but can greatly increase performance.

Read More »

Next power of two in HLSL

There are many occasions when a programmer would want to calculate the next power of two for a given number. For me it was a bitonic sorting algorithm operating in a compute shader and I had this piece of code be responsible for calculating the next power of two of a number:

uint myNumberPowerOfTwo = 1;

while( myNumberPowerOfTwo < myNumber)

{

   myNumberPowerOfTwo <<= 1;

}

It gets the job done, but doesn’t look so nice. For not unusual cases when myNumber is more than 1000 it can already take ten cycles to loop. I recently learned that HLSL has a built in function called firstbithigh. It returns the position of the first non zero bit in a 32-bit number starting from the left to the right (from high order to low). With its help, we can rewrite the algorithm as follows:

uint myNumberPowerOfTwo = 2 << firstbithigh(myNumber);

It does the same thing, so how does it work? Take a random number and write it in binary:

Read More »