Improved normal reconstruction from depth

In a 3D renderer, we might want to read the scene normal vectors at some point, for example post processes. We can write them out using MRT – multiple render target outputs from the object rendering shaders and write the surface normals to a texture. But that normal map texture usually contains normals that have normal map perturbation applied on them, which some algorithms don’t like. Algorithms that need to ray trace/ray march from the surface in multiple directions around the normal direction are not good fit with these normal mapped textures, because there is a disconnect between what is in the normal map and what is the real surface triangle normal.

Read More »

Thoughts on light culling: stream compaction vs flat bit arrays

I had my eyes set on the light culling using flat bit arrays technique for a long time now and finally decided to give it a try. Let me share my notes on why it seemed so interesting and why I replaced the stream compaction technique with this. I will describe both techniques and make comparisons between them. Wall of text incoming with occasional code pieces, in brain dump style.

Read More »

Simple job system using standard C++

After experimenting with the entity-component system this fall, I wanted to see how difficult it would be to put my other unused CPU cores to good use. I never really got into CPU multithreading seriously, so this is something new for me. The idea behind the entity-component system is both to make more efficient use of a single CPU by having similar data laid out linearly in memory (thus using cache prefetching when iterating), and also making it easier to write parallelized code (because data dependecies are more apparent). In this post I want to talk about the job system I came up with. It will not only make sense in the entity-component system, but generally it should perform well for any large data processing task. I wanted to remain in standard C++ (~11) realm, which I found that it is entirely possible. However, it can be extended with platform specific extensions if wanted. Let’s break this blog up into multiple parts:

Read More »

GPU Fluid Simulation

Let’s take a look at how to efficiently implement a particle based fluid simulation for real time rendering. We will be running a Smooth Particle Hydrodynamics (SPH) simulation on the GPU. This post is intended for experienced developers and provide the general steps of implementation. It is not a step-by step tutorial, but rather introducing algorithms, data structures, ideas and optimization tips. There are multiple parts I will write about: computing SPH, N-body simulation, dynamic hashed grid acceleration structure. You will find example code pieces here written as HLSL compute shaders.

To be able to implement a simulator like this, you should already have a basic particle simulation/rendering up and running on the GPU. Like this. Then the fluid simulation step can just be inserted between particle emission and particle integration phase. Everything should just work from this point. At the end of the article, you should be able to do this on a low end GPU in real time (30 000 particles, 80 FPS, GTX 1050):

sphinit

Read More »

Thoughts on Skinning and LDS

havoc

I’m letting out some thoughts on using LDS memory as a means to optimize a skinning compute shader. Consider the following workload: each thread is responsible for animating a single vertex, so it loads the vertex position, normal, bone indices and bone weights from a vertex buffer. After this, it starts doing the skinning: for each bone index, load a bone matrix from a buffer in VRAM then multiply the vertex positions and normals by these matrices and weight them by the vertex bone weights. Usually a vertex will contain 4 bone indices and 4 corresponding weights. Which means that for each vertex we are loading 4 matrices from VRAM. Each matrix is 3 float4 vectors, so 48 bytes of data. We have thousands of vertices for each model we will animate, but only a couple of hundred bones usually. So should each vertex load 4 bone matrices from a random place in the bone array?

Read More »

Optimizing tile-based light culling

Tile-based lighting techniques like Forward+ and Tiled Deferred rendering are widely used these days. With the help of such technique we can efficiently query every light affecting any surface. But a trivial implementation has many ways to improve. The biggest goal is to refine the culling results as much as possible to help reduce the shading cost. There are some clever algorithms I want to show here which are relatively easy to implement but can greatly increase performance.

Read More »

Next power of two in HLSL

There are many occasions when a programmer would want to calculate the next power of two for a given number. For me it was a bitonic sorting algorithm operating in a compute shader and I had this piece of code be responsible for calculating the next power of two of a number:

uint myNumberPowerOfTwo = 1;

while( myNumberPowerOfTwo < myNumber)

{

   myNumberPowerOfTwo <<= 1;

}

It gets the job done, but doesn’t look so nice. For not unusual cases when myNumber is more than 1000 it can already take ten cycles to loop. I recently learned that HLSL has a built in function called firstbithigh. It returns the position of the first non zero bit in a 32-bit number starting from the left to the right (from high order to low). With its help, we can rewrite the algorithm as follows:

uint myNumberPowerOfTwo = 2 << firstbithigh(myNumber);

It does the same thing, so how does it work? Take a random number and write it in binary:

Read More »

GPU-based particle simulation

gpuparticles0

I finally took the leap and threw out my old CPU-based particle simulation code and ventured to GPU realms with it. The old system could spawn particles on the surface on a mesh with a starting velocity of each particle modulated by the surface normal. It kept a copy of each particle on CPU, updated them sequentially, then uploaded them to GPU for rendering each frame. The new system needed to keep the same set of features at a minimum, but GPU simulation also opens up more possibilities because we have direct access to resources like textures created by the rendering pipeline. It is also highly parallellized compared to the CPU solution, both the emitting and the simulation phase which means we can do a much higher amount of particles in the same amount of time. There is less data moving between the system and GPU, we can get away with only a single constant buffer update and command buffer generation, the rest of the data lives completely in VRAM. This makes simulation on a massive scale a reality.

Read More »

Which blend state for me?

think

If you are familiar with creating graphics applications, you are probably somewhat familiar with different blending states. If you are like me, then you were not overly confident in using them, and got some basics ones copy-pasted from the web. Maybe got away with simpe alpha blending and additive states, and heard of premultiplied alpha somewhere but didn’t really care as long as it looked decent enough at the time. Surely, there are a lot of much more interesting stuff waiting for you to be implemented. Then later you realize, that something looks off with an alpha blended sprite somewhere. You correct it with some quick fix and forget about it. A week later, you are want to be playing with some particle systems, but there is something wrong with that, the blending doesn’t look good anymore because of a dirty tweak you made earlier. Also, your GUI layer was displaying the wrong color the whole time, but just enough not to notice. There are just so many opportunities for screwing up your blending states without noticing it immediately. Correcting the mistakes can really quickly turn into a big headache. Here I want to give some practical examples and explanations of different use cases, for techniques mainly used in 3D rendering engines.

Read More »