GPU-based particle simulation

gpuparticles0

I finally took the leap and threw out my old CPU-based particle simulation code and ventured to GPU realms with it. The old system could spawn particles on the surface on a mesh with a starting velocity of each particle modulated by the surface normal. It kept a copy of each particle on CPU, updated them sequentially, then uploaded them to GPU for rendering each frame. The new system needed to keep the same set of features at a minimum, but GPU simulation also opens up more possibilities because we have direct access to resources like textures created by the rendering pipeline. It is also highly parallellized compared to the CPU solution, both the emitting and the simulation phase which means we can do a much higher amount of particles in the same amount of time. There is less data moving between the system and GPU, we can get away with only a single constant buffer update and command buffer generation, the rest of the data lives completely in VRAM. This makes simulation on a massive scale a reality.

If that got you interested, check out the video presentation of my implementation in Wicked Engine:

So, the high level flow of the GPU particle system described here is the following:

  1. Initialize resources:
    • Particle buffer with a size of maximum amount of particles [ParticleType*MAX_PARTICLE_COUNT]
    • Dead particle index buffer, with every particle marked as dead in the beginning [uint32*MAX_PARTICLE_COUNT]
    • 2 Alive particle index lists, empty at the beginning [uint32*MAX_PARTICLE_COUNT]
      • We need two of them, because the emitter writes the first one, simulation kills dead particles and writes the alive list again to draw later
    • Counter buffer:
      • alive particle count [uint32]
      • dead particle count [uint32]
      • real emit count = min(requested emit count, dead particle count) [uint32]
      • particle count after simulation (optional, I use it for sorting) [uint32]
    • Indirect argument buffer:
      • emit compute shader args [uint32*3]
      • simulation compute shader args [uint32*3]
      • draw args [uint32*4]
      • sorting compute shader arguments (optional) [uint32*3]
    • Random color texture for creating random values in the shaders
  2. Kick off particle simulation:
    • Update a constant buffer holding emitter properties:
      • Emitted particle count in current frame
      • Emitter mesh vertex, index counts
      • Starting particle size, randomness, velocity, lifespan, and any other emitter property
    • Write indirect arguments of following compute passes:
      • Emitting compute shader thread group sizes
      • Simulation compute shader thread group sizes
      • Reset draw argument buffer
    • Copy last frame simulated particle count to current frame alive counter
  3. Emitting compute shader:
    • Bind mesh vertex/index buffers, random colors texture
    • Spawn as many threads as there are particles to emit
    • Initialize a new particle on a random point on the emitter mesh surface
    • Decrement dead list counter atomically while getting last value, this is our new dead particle index, read the dead list on that location to retrieve the particle index for the particle buffer
    • Write the new particle to the particle buffer on this index
    • Increment alive particle count, write particle index into alive list 1
  4. Simulation compute shader:
    • Each thread reads alive list 1, and updates particle properties if particle has life > 0, then writes it into alive list 2. Increment Draw argument buffer.
    • Otherwise, kill particle by incrementing dead list counter and writing particle index to dead list
    • Write particle distance squared to camera for sorting (optional)
    • Iterate through force fields in the scene and update particle according (optional)
    • Check collisions with depth buffer and bounce off particle (optional)
    • Update AABB by atomic min-maxing particle positions for additional culling steps (optional)
  5. Sorting compute shader (optional):
    • An algorithm like bitonic sorting maps well to GPU, can sort a large amount
    • Multiple dispatches required
    • Additional constant buffer updates might be required
  6. Swap alive lists:
    • Alive list 1 is the alive list from previous frame + emitted particles in this frame.
    • In this frame we might have killed off particles in the simulation step and written the new list into Alive list 2. This will be used when drawing, and input to the next frame emitting phase.
  7. Draw alive list 1:
    • After the swap, alive list 1 should contain only the alive particle indices in the current frame.
    • Draw only the current alive list count with DrawIndirect. Indirect arguments were written by the simulation compute shader.
  8. Kick back and profit 🙂
    • Use your new additional CPU time for something cool (until you move that to the GPU as well)

Note: for adding particles, you could use append-consume structured buffers, or counters written by atomic operations in the shader code. The append-consume buffers might include an additional performance optimization hidden from the user, which is GDS (global data share) for the hardware that supports it. Basically it is a small piece of fast access memory visible to every thread group located on a separate chip instead of the RAM. I went with the atomic counter approach and haven’t tested performance differences yet. The append-consume buffers are not available in every API which makes them less appealing.

gpuparticles3

The following features are new and nicely fit into the new GPU particle pipeline:

  • Sorting
    • I never bothered with particle sorting on the CPU. It was already kind of slow without it, so I got away with only sorting per-emitter, so farther away emitters were drawn earlier. I decided to go with bitonic sorting because I could just pull that from the web. This is a bit too involving and though that it would consume too much time to implement on my own and debug. AMD has a really nice implementation available. Sorting becomes a required step if the particles are not additively blended because threads are now writing them in arbitrary order.
  • Depth buffer collisions
    • This is a very nice feature of GPU particle systems. This is essentially a free physics simulation for particles which are on the screen. This only involves reading the depth buffer in the simulation phase, checking if the particle is behind it, and if it is, then read the normal buffer (or reconstruct normal from depth buffer) and modulate particle velocity by reflecting it with the surface normal.
  • Force fields
    • This is completely possible with CPU particle systems as well, but now we can apply them to a much bigger simulation. In the simulation compute shader we can preload some force fields to LDS (local data share) for faster memory access.
  • Emit from skinned mesh
    • Mesh skinning is done on the GPU nowadays, so using the skinned meshes while emitting becomes trivial, with no additional cost whatsoever.
  • Async compute
    • Now I still haven’t had a chance to try any async compute, but this seems like a nice candidate for that because simulation could be very much decoupled from rendering and it could lead to better utilization of GPU resources. Async compute is available in the modern low level graphics APIs like DX12, Vulkan and console specific APIs. It also requires hardware support which is available only in the latest GPUs.

gpuparticles1

Debugging

Debugging a system which is living on the GPU is harder than on the CPU but essential. We should ideally make use of a graphics debugger software, but there are also opportunities to make our life easier with creating some utilities for this purpose. The thing that helped me most is writing out some data about the simulation to the screen. For this, we need to access the data which is resident on the GPU, which we can do as if we were downloading something from a remote machine. Using the DirectX 11 API, we can do this by creating a resource of the same type and size that we want to download and creating it with D3D11_USAGE_STAGING flag, no bind flags and READ CPU access. We have to issue a copy into this buffer from the one we want to download by calling ID3D11DeviceContext::CopyResource, then read the buffer contents by mapping it with READ flags. As the buffer contents will only be available when the frame is finished with rendering until that point, we can either introduce a CPU-GPU sync point and wait in place until the operation completes or do the mapping a few frames later. In a debugging scenario, a sync point might be sufficient and simpler to implement, but we should avoid any such behaviour in the final version of the application.

Drawing

Drawing billboards would be seem like a nice place to use geometry shaders. Unfortunately, geometry shaders introduce inefficiencies in the graphics pipeline, because of various reasons. Primitives need to be traversed and written to memory serially, some architectures even go as far as writing the GS output to system memory. The option of my choice is just leaving the geometry shader and doing the billboard expansion in the vertex shader. For this, we must spawn the VS with a triangle list topology, vertex count of particleCount * 6 and calculate the particle index and billboard vertex index from the SV_VertexID system-value semantic. Like this:

static const float3 BILLBOARD[] = {

  float3(-1, -1, 0), // 0

  float3(1, -1, 0), // 1

  float3(-1, 1, 0), // 2

  float3(-1, 1, 0), // 3

  float3(1, -1, 0), // 4

  float3(1, 1, 0), // 5

};

VertextoPixel main(uint fakeIndex : SV_VERTEXID)

{

  uint vertexID = fakeIndex % 6;

  uint instanceID = fakeIndex / 6;

  Particle particle = particleBuffer[aliveList[instanceID]];

  float3 quadPos = BILLBOARD[vertexID];

  // …

}

Additionally, for better drawing performance, you should use indexed drawing with 4 vertices per quad, but that way the two index lists will be six times the size each, so bandwidth will increase for the simulation. Maybe it is still worth it, I need to compare performance results.

Conclusion:

There are many possibilities to extend this system, because compute shaders make it very flexible. I am overall happy with how this turned out. Provided my previous particle systems were quite simplistic, porting all the features was not very hard and I haven’t had to make any compromises. The new system frees up CPU resources which are more valuable for gameplay logic and other systems which are interconnected. Particles are usually completely decoupled from the rest of the engine so they are an ideal candidate for running it remotely on a GPU.

You can check the source code of my implementation of GPU-based particles in Wicked Engine:

Feel free to rip off any source code from there! Thank you for reading!

Inspiration from:

Compute – based GPU particles by Gareth Thomas

Which blend state for me?

think

If you are familiar with creating graphics applications, you are probably somewhat familiar with different blending states. If you are like me, then you were not overly confident in using them, and got some basics ones copy-pasted from the web. Maybe got away with simpe alpha blending and additive states, and heard of premultiplied alpha somewhere but didn’t really care as long as it looked decent enough at the time. Surely, there are a lot of much more interesting stuff waiting for you to be implemented. Then later you realize, that something looks off with an alpha blended sprite somewhere. You correct it with some quick fix and forget about it. A week later, you are want to be playing with some particle systems, but there is something wrong with that, the blending doesn’t look good anymore because of a dirty tweak you made earlier. Also, your GUI layer was displaying the wrong color the whole time, but just enough not to notice. There are just so many opportunities for screwing up your blending states without noticing it immediately. Correcting the mistakes can really quickly turn into a big headache. Here I want to give some practical examples and explanations of different use cases, for techniques mainly used in 3D rendering engines.

First thing is rendering alpha blended sprites on top of each other, just to the back buffer immediately. We need a regular alpha blending renderstate for that which does this:

dst.rgb = src.rgb * src.a + dst.rgb * (1 – src.a)

Here, dst means the resulting color in the rendertarget (which is now the backbuffer). Src is the color of the sprite which the pixel shader returns. Our colors are just standard 32 bit rgba in the range [0, 1] here. With this, we have successfully calculated a good output color which we can just write as-is to the back buffer. For this, we don’t care what the alpha output is, because no further composition will be happening. Here is the corresponding state description for DirectX 11:

D3D11_RENDER_TARGET_BLEND_DESC desc;

desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_SRC_ALPHA;

desc.DestBlend = D3D11_BLEND_INV_SRC_ALPHA;

desc.BlendOp = D3D11_BLEND_OP_ADD;

The tricky part comes when we want to draw our alpha blended sprites to separate layers which will be composited later on. In that case we also have to be careful what alpha value we write out. Take a simple scenario for example, in which you render a sprite with alpha blending to a render target, and later you render your rendertarget to your backbuffer with the same alpha blending. For this, we want to accumulate alpha values, so just add them:

dst.a = src.a + dst.a

Which is equivalent to the following blend state in DirectX 11 (just append to the previous snippet):

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

Accumulating alpha seems like a good fit for a 32 bit render target, as values will be clamped to one, so opacity will increase with overlapping sprites, but colors won’t be over saturated. Try blending the render target layer now to the back buffer. There will be an error, which could not be obvious at first (which means the worst kind of error). Let me show you:

blendOn the black background, you might even not notice the error at first, but on the white background it becomes apparent at once (in this case, “background” means our backbuffer). For different images, there might be different scenarios where the problem becomes apparent. This could be a challenge to overcome, if you already made a lot of assets, maybe even compensated somehow for the error without addressing the source of the error: the blend operation is not correct anymore! First, we blended the sprite onto the layer, this is still correct, but we blend the layer with the same operation again, so alpha messes with the colors two times here. There is a correct solution to the problem: premultiplied alpha blend operation, which is this:

dst.rgb = src.rgb + dst.rgb * (1 – src.a)

dst.a = src.a + dst.a

Create it in DX11 like this:

D3D11_RENDER_TARGET_BLEND_DESC desc;

desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_ONE;

desc.DestBlend = D3D11_BLEND_INV_SRC_ALPHA;

desc.BlendOp = D3D11_BLEND_OP_ADD;

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

The only change is that we do not multiply with source alpha any more. Our problem is fixed now, we can keep using our regular alpha blending on sprites, but use premultiplied blending for the layers. Simple right? But do not forget about premultiplied alpha just yet, it can help us out for more problems as well.

I had multiple problems with rendering particle systems to off-screen buffers for soft particles, and can also help with performance if the buffer is of small resolution. One of the problems was the above mentioned faulty alpha blending of particles (particles are the sprites, the layer is the off-screen buffer when mapped to the previous example). The other issue is that I also want to render additive particles to the same render target and I want to blend the whole thing later in a single pass. This is an additive blending state:

dst.rgb = src.rgb * src.a + dst.rgb

dst.a = dst.a

Which corresponds to this state in DX11:

D3D11_RENDER_TARGET_BLEND_DESC desc;

desc.BlendEnable = TRUE;

desc.SrcBlend = D3D11_BLEND_SRC_ALPHA;

desc.DestBlend = D3D11_BLEND_ONE;

desc.BlendOp = D3D11_BLEND_OP_ADD;

dest.SrcBlendAlpha = D3D11_BLEND_ZERO;

dest.DestBlendAlpha = D3D11_BLEND_ONE;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

Notice that src output doesn’t contribute to alpha, but that is no problem at all, because the premultiplied layer blending will still take the layer color, just adding the full destination color to it. This is only possible if our layer is configured for premultiplied blending, otherwise it would just disappear upon blending. We can also have our particle textures themselves be in premultiplied texture format (where the texture colors are already multiplied by alpha) and blend state, and blending that onto the layer, it just works. When I wasn’t familiar with this, I used a separate render target for regular alpha blended and premultiplied texture format particles and an other one for additive ones, what a waste! We can see that premultiplied blending is already very flexible, so keep that in mind because it can save the day on many occasions. It also makes a huge difference in mipmap generation, see this neat article from Nvidia!

Side note: premultiplied alpha blending was also widely used because it has better blending performance. Think of it as precomputing the alpha blend factor and storing it inside the texture. The performance reasons are probably not so apparent today.

I had an other problem with particle system rendering, because they were rendered to a HDR floating point target. That one doesn’t clamp the alpha values. Consider the case when a particle’s alpha value is bigger than one for whatever reason and blended with regular alpha blending: the term dest.rgb * (1 – src.a) is now of course producing negative values. This is easy to overcome with just saturating the alpha output of the pixel shader, done! The other problem is this: dst.a = src.a + dst.a which can still result in larger than one alpha values, but this will only be a problem later on, when blending the layer as premultiplied alpha. We would need to saturate the (1 – src.a) term in the blending state but we cannot, there is not such state. We have a blend D3D11_BLEND_SRC_ALPHA_SAT value, but there is not for the inverse of it. The workaround that I am using for this is to modify the particle alpha blending state to accumulate alpha a bit differently:

dest.a = src.a + dst.a * (1 – src.a)

In DX11 terms:

dest.SrcBlendAlpha = D3D11_BLEND_ONE;

dest.DestBlendAlpha = D3D11_BLEND_INV_SRC_ALPHA;

dest.BlendOpAlpha = D3D11_BLEND_OP_ADD;

This accumulation method is probably not perfect, but works really well in practice:

particleblend

That’s it, I think these are the three most important blending modes, most effects can be achieved with a combination of these. Just always keep an eye on your blend state creation and be very explicit about it, that is the way to avoid many bugs down the road. If you were like me and haven’t paid much attention to these  until now, this is the best time to reiterate on this, because tracking down errors associated with this becomes a hard journey later. Thanks for reading!

Forward+ decal rendering

decalsDrawing decals in deferred renderers is quite simple, straight forward and efficient: Just render boxes like you render the lights, read the gbuffer in the pixel shader, project onto the surface, then sample and blend the decal texture. The light evaluation then already computes lighting for the decaled surfaces. In traditional forward rendering pipelines, this is not so trivial. It is usually done by cutting out geometry under the decal, creating a new mesh from it with projected texture coordinates and render it for all lights, additively. Apart from the obvious increased draw call count and fillrate consumption, there is even potential for z-fighting artifacts. While moving to tile-based forward rendering (Forward+), we can surely think of something more high-tech.

We want to avoid additional geometry creation, increased draw call count while keeping the lighting computation constant. But in addition to these, with this new technique we can even trivially support modification of surface properties, creating decals which can modify surface normal, roughness, metalness, emissive, etc. or even do parallax occlusion mapping. We can even apply decals to transparent surfaces easily! This article will describe the outline of the technique without source code. You can look at my implementation however, here: culling, and sorting shader; blending evaluation shader.

In forward+ we have a light culling step and a light evaluation separately. The decals will be inserted for both passes. A culling compute shader iterates through a global light array and decides for each screen space tile, which lights are inside and adds them to a global list (in case of tiled deferred, it just adds them to a local list and evaluates lighting there and then). For adding decals to the culling, we need to extend the light descriptor structure to be able to hold decal information, and add functions to the shader to be able to cull oriented bounding boxes(OBBs). We can implement OBB culling by doing coarse AABB tests. Transform the AABB of the tile by the decal OBB’s inverse matrix (while keeping min-max up to date) and test the resulting AABB against a unit AABB. This is achieved by determining the 8 corner points of the tile AABB, transforming each by the inverse decal OBB, then determining the min and max corner points of the resulting points.

tiled_decals

Rendering the decals takes place in the object shaders while we also evaluate the lighting. If the decals can modify surface parameters, like normals, it is essential that we render the decals before the lights. For that, we must have a sorted decal list. We can not avoid sorting the decals, anyway, as I have found out the hard way. Because the culling is performed in parallel, the decals can be added to the tile in arbitrary order. But we have a strict order when blending the decals, that is the order we have placed them onto the surface. If we don’t sort, it can lead to severe flickering artifacts when there are overlapping decals. Thankfully the sorting is straight forward, easily parallellized and can be done in the LDS (Local Data Share memory) entirely. I have gotten this piece of code from an AMD presentation (a bitonic sort implementation in LDS).

The easiest way is that we sort the decals in the CS so that the bottom decal is first, and the top is last (bottom-to-top sorting). This way, we can do regular alpha blending (which is a simple lerp in the shader) easily. Though we can do better. This way we sample all of the decals, even if the bottom ones are completely covered by decals placed on top. Instead we should sort the opposite way, so that first we evaluate the top ones, and then the decals underneath but just until the alpha accumulation reaches one. We can skip the rest. The blending equation also needs to be modified for this. The same idea is presented in the above mentioned AMD presentation for tile based particle systems. The modified blending equation looks like this:

color = ( invDestA x srcA x srcCol ) + destCol

alpha = srcA + ( invSrcA x destA )

This method can save us much rendering time when multiple decals are overlapping. But this can result in different output when we have emissive decals for example. In the bottom-to-top blending, emissive decals will always be visible because the contribution is added to the light buffer, but the top-to-bottom sorting (and skip) algorithm will skip the decals which are completely covered. I think this is “better” behaviour overall but on a subjective basis of course.

The nice thing about this technique, is that we can trivially modify surface properties, if we just sample all of our decals before all of the lights. Take this for example: we want to modify normal of the surface with the decal normal map. We already have our normal vector in our object shader, so when we get to the decals, just blend it in shader code with the decal normal texture, without the need for any packing/unpacking and tricky blending of g-buffers (a’la deferred). The light evaluation which comes after it “just works” with the new surface normal without any modification at all.

Maybe you have noticed, that we need to do the decal evaluation in dynamically branching code, which means that we must leave the default mip-mapping support. This is because from the compiler’s standpoint, we might perfectly well not be evaluating the same decals in neighbouring pixels, but we need those helper pixels for correct screen space derivative coordinates. In our case when we have multiple of two pixel count tiles (I am using 16×16 tiles), we are being coherent for our helper pixels, but the compiler doesn’t know that unfortunately. I haven’t yet found a satisfying way to overcome this problem. I experimented with linear distance/screen space size based mip selection, but found them unsatisfying for my purposes (they might be ok for a single game/camera type though).

 

Update: Thanks to MJP, I learned a new technique for obtaining nice mip mapping results: We just need to take the derivatives of the surface world position, transform it by the decal projection matrix (but leave the translation), and we have the decal derivatives that we can feed into Texture2D::SampleGrad for example. An additional note is that when using a texture atlas for the decals, we need to take into consideration the atlas texture coordinate transformation. So, just multiply the decal derivatives by the atlas transformation’s scaling part. Cool technique!

We also need to somehow dynamically support different decal textures in the same object shader. A texture atlas comes handy in this case, or bindless textures are also an option in newer APIs.

As we have added decal support to the tiled light array structure, the structure probably is getting bloated, which means less cache efficient, because most lights probably don’t need a decal inverse matrix (for projection), texture atlas offsets, etc. For this, the decals could probably get their own structure and a different array, or just tightly pack everything in a raw buffer (byteaddressbuffer in DX11). I need to experiment with this.

Conclusion:

This technique is a clear upgrade from the traditional forward rendered decals, but comparing it with the deferred decals is not a trivial matter. First, we can certainly optimize deferred decals in several ways. I have been already toying with the idea of using Rasterizer Ordered Views to optimize the blending in a similar way, and eliminating overdraw. Secondly, we have increased branching and register pressure in the forward rendering pass, while rasterization of deferred decals is a much more light weight shader which can be better parallellized when the overdraw is not so apparent. In that case, we can get away with rendering much more deferred decals than tiled decals. The tile-based approach gets much better with increased overdraw because of the “skip the occluded” behaviour as well as the reduced bandwidth cost of not having to sample a G-buffer for each decal. Forgive me for not providing actual profiling data at the moment, this article intends to be merely a brain dump, but I also hope somewhat inspirational.

XBOX One and UWP rant

I have experimented with the Universal Windows Platform (UWP) for some time now. I mainly maintain the UWP compatibility for Wicked Engine, a game engine written by me in C++ and using the DirectX 11 API for rendering. A perfect candidate for an UWP applications, because there need not be difficult porting for new platforms, supporting Desktop, Phone, tablet, augmented reality devices and XBOX One should be trivial (at least in theory).

The game engine originally only targetted the traditional Windows desktop environment, when I found out about the Universal Windows Platform, I knew I had to try it out. It turned out to be quite easy. I just needed to replace a few calls to the original WinAPI with UWP variants, like getting the screen resolution, the window handling and keyboard input. We also need to have an UWP compliant application entry point, you can see an example in the Wicked Engine Demos project. All the rest remains the same, like rendering code, joystick input, sound system, physics engine and the scripting interface. Only, for phone/tablet, you probably need to handle the touch screen, but UWP comes with a nice API for it.

I have been testing the Windows Phone and UWP desktop builds, and while I don’t think they are very relevant as of now, still feels nice to have support for them. Recently, I had the chance to acquire an XBOX One devkit for a short time, so I jumped on it and set up some demos in my engine and built it as an UWP app. The whole process was incredibly easy and straight forward, no change needed to be done in the code, the hardest part was launching the app on a “remote machine” which is the XBOX devkit itself. In the application, there were some lightweight scenes, like displaying a rotating image, static mesh, animated mesh, and some heavy rendering tasks, like and outdoor scene with multiple lights, shadows, instanced trees, particle systems, water, reflections, etc. The performance was abysmal in all of them. It really caught me off guard, because everything performed much better on my phone. It wasn’t even rendering most of the things in real time, but with one frame per second for the complex scene, and even the image rendering could barely do 50 FPS. With a little investigation, I have found out that I was creating my graphics device with the WARP rasterizer, which is a software renderer. The minimal requirements of the engine is D3D11, feature level 11.0 hardware, because it is making use of advanced rendering techniques, which need support for structured buffers, unordered access resources, compute shaders, tessellation and the like from the graphics driver. It could only create an appropriate graphics device in software mode, which is very slow because now the entire rendering runs on the CPU.

But can’t the XBOX One make use of D3D11 and even D3D12? The answer is, it can, but not with UWP apps. For developing high performance 3D applications, like games, you must be a registered XBOX developer, obtain the XBOX SDK (XDK) and build against the new platform that comes with it. Here’s what they say about it in the “known issues” of UWP development article on MSDN:

“UWP on Xbox One supports DirectX 11 Feature Level 10. DirectX 12 is not supported at this time.

Xbox One, like all traditional games consoles, is a specialized piece of hardware that requires a specific SDK to access its full potential. If you are working on a game that requires access to the maximum potential of the Xbox One hardware, you can register with the ID@XBOX program to get access to that SDK, which includes DirectX 12 support.”

Apart from that, the developer is also just getting access to limited system resources, like smaller amount of allowed memory allocation, reduced CPU speed and fewer threads with the UWP platform. PIX, the graphics debugger which comes with the XDK also didn’t recognize my application, but maybe that was because the WARP device. I could capture a frame with the Visual Studio graphics debugger, however.

This is just a really big letdown for me because I was really interested trying out the performance of my engine on a game console. I have no chance now, because I am not a registered developer, so no way of trying it out this time. Maybe I will look into the developer program, but I suspect that for a successful application, they need an actual company, business plan and probably much more.

Thanks for reading!

Skinning in a Compute Shader

skinning

Recently I have moved my mesh skinning implementation from a streamout geometry shader to compute shader. One reason for this was the ugly API for the streamout which I wanted to leave behind, but the more important reason was that this could come with several benefits.

First, compared to traditional skinning in a vertex shader, the render pipeline can be simplified, because we only perform skinning once for each mesh instead of in each render pass. So when we render our animated models multiple times, for shadow maps, Z-prepass, lighting pass, etc.. we are using regular vertex shaders for those passes with the vertex buffer swapped out for the pre-skinned vertex buffer. Also, we avoid many render state setup, like binding bone matrix buffers for each render pass. But this can be done in a geometry shader with stream out capabilities as well.

The compute shader approach has some other nice features compared to the first point. The render pipeline of Wicked Engine requires the creation of a screen space velocity buffer. For that, we need out previous frame animated vertex positions. If we don’t do it in a compute shader, we probably need to skin each vertex with the previous frame bone transforms in the current frame to get the velocity of the vertex which is currentPos – prevPos (If we have deinterleaved vertex buffers, we could avoid it by swapping vertex position buffers). In a compute shader, this becomes quite straight forward, however. Perform skinning only for the current frame bone matrices, but before writing out the skinned vertex to the buffer, load the previous value of the position and that is your previous frame vertex position. Write it out then to the buffer at the end.

In a compute shader, it is the developer who can assign the workload across several threads, not rely on the default vertex shader thread invocations. Also, the vertex shader stage has strict ordering specifications, because vertices must be written out in the exact same order they arrived. A compute shader can just randomly write into the skinned vertex buffer when it is finished. That said, it is also the developer’s responsibility to avoid writing conflits. Thankfully, it is quite trivial, because we are writing a linear array of data.

In compute shaders we can also make use of LDS memory to reduce memory reads. This can be implemented as each thread in a group only loads one bone data from main memory and stores it in LDS. Then the skinning computation just reads the bone data from LDS, and because each bone now doesn’t read 4 bones from VRAM but LDS, it has the potential for speedup. I have made a blog about this.

An other nice feature is the possibility lo leverage async compute in a newer graphics APIs like DirectX 12, Vulkan or the Playstation 4 graphics API. I don’t have experience with it, but I imagine it would be more taxing on the memory, because we would probably need to double buffer the skinned vertex buffers.

An other possible optimization is possible with this. If the performance is bottlenecked by the skinning in our scene, we can avoid skinning meshes in the distance for every other frame or so for example, so a kind of a level of detail technique for skinning.

The downside is that this technique comes with increased memory requirements, because we must write into global memory to provide the data up front for following render passes. We also avoid the fast on-chip memory of the GPU (memory for vertex shader to pixel shader parameters) for storing the skinned values.

Here is my shader implementation for skinning a mesh in a compute shader:


struct Bone

{

float4x4 pose;

};

StructuredBuffer boneBuffer;

ByteAddressBuffer vertexBuffer_POS; // T-Pose pos

ByteAddressBuffer vertexBuffer_NOR; // T-Pose normal

ByteAddressBuffer vertexBuffer_WEI; // bone weights

ByteAddressBuffer vertexBuffer_BON; // bone indices

RWByteAddressBuffer streamoutBuffer_POS; // skinned pos

RWByteAddressBuffer streamoutBuffer_NOR; // skinned normal

RWByteAddressBuffer streamoutBuffer_PRE; // previous frame skinned pos

inline void Skinning(inout float4 pos, inout float4 nor, in float4 inBon, in float4 inWei)

{

 float4 p = 0, pp = 0;

 float3 n = 0;

 float4x4 m;

 float3x3 m3;

 float weisum = 0;

// force loop to reduce register pressure

 // though this way we can not interleave TEX - ALU operations

 [loop]

 for (uint i = 0; ((i < 4) && (weisum<1.0f)); ++i)

 {

 m = boneBuffer[(uint)inBon[i]].pose;

 m3 = (float3x3)m;

p += mul(float4(pos.xyz, 1), m)*inWei[i];

 n += mul(nor.xyz, m3)*inWei[i];

weisum += inWei[i];

 }

bool w = any(inWei);

 pos.xyz = w ? p.xyz : pos.xyz;

 nor.xyz = w ? n : nor.xyz;

}

[numthreads(256, 1, 1)]

void main( uint3 DTid : SV_DispatchThreadID )

{

 const uint fetchAddress = DTid.x * 16; // stride is 16 bytes for each vertex buffer now...

uint4 pos_u = vertexBuffer_POS.Load4(fetchAddress);

 uint4 nor_u = vertexBuffer_NOR.Load4(fetchAddress);

 uint4 wei_u = vertexBuffer_WEI.Load4(fetchAddress);

 uint4 bon_u = vertexBuffer_BON.Load4(fetchAddress);

float4 pos = asfloat(pos_u);

 float4 nor = asfloat(nor_u);

 float4 wei = asfloat(wei_u);

 float4 bon = asfloat(bon_u);

Skinning(pos, nor, bon, wei);

pos_u = asuint(pos);

 nor_u = asuint(nor);

// copy prev frame current pos to current frame prev pos

streamoutBuffer_PRE.Store4(fetchAddress, streamoutBuffer_POS.Load4(fetchAddress));

// write out skinned props:

 streamoutBuffer_POS.Store4(fetchAddress, pos_u);

 streamoutBuffer_NOR.Store4(fetchAddress, nor_u);

}

Oh god I hate this wordpress code editor… (maybe I just can’t use it properly)

As you can see, quite simple code, I just call this compute shader with something like this:

Dispatch( ceil(mesh.vertices.getCount() / 256.0f), 1, 1);

These vertex buffers are not packed yet as of now, which is quite inefficient. Of course, positions could probably be stored in 16-bit float3s (but you must animate in local space then), normals can be packed nicely into 32-bit uints, bone weights and indices should be packed into a single buffer and packed into uints as well. If you are using raw buffers (byteaddressbuffer in hlsl), then you have to do the type conversion yourself. You can also use typed buffers, but performance may be diminished. You can see an example of the optimizations with manual type conversion of compressed vertex streams in my Wicked Engine repo.

I am using precomputed skinning in Wicked Engine for a long time now, so can’t compare with the vertex shader approach, but it is definetly not worse than the streamout technique. I can imagine that for some titles, it might not be worth it to store additional vertex buffers to VRAM and avoid on-chip memory for skinning results. However, this technique could be a candidate in optimization scenarios because it is easy to implement and I think also easier to maintain because we can avoid the shader permutations for skinned and not skinned models.

Thanks for reading!

Area Lights

arealight

I am trying to get back into blogging. I thought writing about implementing area light rendering might help me with that.

If you are interested in the full source code, pull my implementetion from the Wicked Engine lighting shader. I won’t post it here, because I’d rather just talk about it.

A 2014 Siggraph presentation from Frostbite caught my attention for showcasing their research on real time area light rendering. When learning graphics programming from various tutorials, there is explanations for punctual light source rendering, like point, spot and directional lights. Even most games get away with using these simplistic light sources.

For rendering area lights, we need much more complicated light equations and more performance requirements for our shaders. Luckily, the above mentioned presentation came with a paper with all the shaders for diffuse light equations for spherical, disc, rectangular and tube light sources.

The code for specular lighting for these type of lights was not included in that paper, but it mentioned the “representative point method“. What this technique essentially does is that you keep the specular calculation, but change the light vector. The light vector was the vector pointing from the light position to the surface position. But for our lights, we are not interested in the reflection between the light’s center and the surface, but between the light “mesh” and the surface.

Representative point method

If we modify the light vector to point from the surface to the closest point on the light mesh to the reflection vector, then we can keep using our specular BRDF equation and we will get a nice result; the specular highlight will be in the shape of the light mesh (or somewhere close to it). It is important, that this is not a physically accurate model, but it is something nice which is still performant in real time.

My first intuition was just that we could trace the mesh with the reflection ray. Then our light vector (L) is the vector from the surface point (P) to the intersection point (I), so L=I-P. The problem is, what if there is no intersection? Then we won’t have a light vector to feed into our specular brdf. This way we are only getting hard cutoff reflections, and surface roughness won’t work because the reflections can’t be properly “blurred” on the edges where there is no trace hit.

The correct approach to this is, that we have to find the closest point on mesh from the reflection ray. If the trace succeeds, then our closest point is the hit point, if not, then we have to “rotate” the light vector to sucessfully trace the mesh. We don’t actually rotate, just find the closest point (C), and so our new light vector is: L=C-P.

See the image below (V = view vector, R = reflection vector):

representative_pointFor all our four different light types, we have to come up with the code to find the closest point.

  • Sphere light:
    • This one is simple: first, calculate the real reflection vector (R) and the old lightvector (L). Additional symbols: surface normal vector (N) and view vector (V).

      R = reflect(V, N);

      centerToRay = dot(L,R) * R – L;

      closestPoint = L + centerToRay * saturate(lightRadius / length(centerToRay));

  • Disc light:
    • The idea is, first trace the disc plane with the reflection vector, then just calculate the closest point to the sphere from the plane intersection point like for the sphere light type. Tracing a plane is trivial:

      distanceToPlane = dot(planeNormal, (planeOrigin – rayOrigin) / dot(planeNormal, rayDirection));

      planeIntersectionPoint = rayOrigin + rayDirection * distanceToPlane;

  • Rectangle light:
    • Now this is a bit more complicated. The algorithm I use consists of two paths: The first path is when the reflection ray could trace the rectangle. The second path is, when the trace didn’t succeed. In that case, we need to find the intersection with the plane of the rectangle, then find the closest point on one of the four edges of the rectangle to the plane intersection point.
    • For tracing the rectangle, I trace the two triangles that make up the rect and take the correct intersection if it exists. Tracing a triangle involves tracing the triangle plane, then deciding if we are inside the triangle. A, B, C are the triangle corner points.

      planeNormal = normalize(cross(B – A, C – B));

      planeOrigin = A;

      t = Trace_Plane(rayOrigin, rayDirection, planeOrigin, planeNormal);

      p = rayOrigin + rayDirection * tN1 = normalize(cross(B – A, p – B));

      N2 = normalize(cross(C – B, p – C));

      N3 = normalize(cross(A – C, p – A));d0 = dot(N1, N2);

      d1 = dot(N2, N3);

      intersects = (d0 > 0.99) AND (d1 > 0.99);

    • The other algorithm is finding the closest point on line segment from point. A and B are the line segment endpoints. C is the point on the plane.

      AB = B – A;

      t = dot(C – A, AB) / dot(AB, AB);

      closestPointOnSegment = A + saturate(t) * AB;

  • Tube light:
    • First, we should calculate the closest point on the tube line segment to R. Then just place a sphere on that point and do as we did for the sphere light (that is, calculate the closest point on sphere to the reflection ray R). Every algorithm is already described already to this point, so all that needs to be done is just put them together.

So what to do when you have the closest point on the light surface? You have to convert it to the new light vector: newLightVector = closestPoint – surfacePos.

When you have your new light vector, you can feed it to the specular brdf function and in the end you will get a nice specular highlight!

Shadows

With the regular shadow mapping techniques, we can do shadows for area lights as well. Results are again not accurate, but get the job done. In Wicked Engine, I am only doing regular cube map shadows for area lights like I would do for point lights. I can not say I am happy with them, especially for long tube lights for example. In an other engine however, I have been experimenting with dual paraboloid shadow mapping for point lights. I recommend a single paraboloid shadow map for the disc and rectangle are lights in the light facing direction. These are better in my opinion than regular perspective shadow maps, because they distort very much for high field of views (these light types would require near 180 degrees of FOV).

For the sphere and tube light types I still recommend cubemap shadows.

Original sources:

Voxel-based Global Illumination

vxgi0

People are always asking me of the Voxel Global Illumination technique in Wicked Engine so I thought writing a blog about it would be a good idea.

There are several use cases of a voxel data structure. One interesting application is using it to calculate global illumination. There are a couple of techniques for that, too. I have chosen the voxel cone tracing approach, because I found it the most flexible one for dynamic scenes, but CryEngine for example, uses Light propagation volumes instead with a sparse voxel octree which has smaller memory footprint. The cone tracing technique works best with a regular voxel grid because we perform ray-marching against the data like with screen space reflections for example. A regular voxel grid consumes more memory, but it is faster to create (voxelize), and more cache efficient to traverse (ray-march).

So let’s break down this technique into pieces. I have to disclose this at the beginnning: We can do everything in this technique real time if we do everything on the GPU. First, we have our scene model with polygonal meshes. We need to convert it to a voxel representation. The voxel structure is a 3D texture which holds the direct illumination of the voxelized geometries in each pixel. There is an optional step here which I describe later. Once we have this, we can pre-integrate it by creating a mipmap chain for the resource. This is essential for cone tracing because we want to ray-march the texture with quadrilinear interpolation (sampling a 3D texture with min-mag-mip-linear filtering). We can then retrieve the bounced direct illumination in a final screen space cone tracing pass. The additional step in the middle is relevant if we want more bounces, because we can dispatch additional cone tracing compute shader passes for the whole structure (not in screen space).

The nice thing about this technique is that we can retrieve all sorts of effects. We have “free” ambient occlusion by default when doing this cone tracing, light bouncing, but we can retrieve reflections, refractions and shadows as well from this voxel structure with additional ray march steps. We can have a configurable amount of light bounces. Cone tracing code can be shared between the bouncing and querying shader and different types of rays as well. The entire thing remains fully on the GPU, the CPU is only responsible for command buffer generation.

Following this, I will describe the above steps in more detail. I will be using the DirectX 11 graphics API, but any modern API will probably do the job. You will definetly need a recent GPU for the most efficient implementation. This technique is targeted for PCs or the most recent consoles (Playstation 4 or Xbox One). It most likely can not run on mobile or handheld devices because of their limited hardware.

I think this is an advanced topic and I’d like to aim for experienced graphics programmers, so I won’t present code samples for the more trivial parts, but the whole implementation is available to anyone in Wicked Engine.

Part 1: Voxelization on the GPU

The most involving part is definetly the first one, the voxelization step. It involves making use of advanced graphics API features like geometry shaders, abandoning the output merger and writing into resources “by hand”. We can also make use of new hardware features like conservative rasterization and rasterizer ordered views, but we will implement them in the shaders as well.

The main trick is to be able to run this real time is that we need to parallelize the process well. For that, we will exploit the fixed function rasterization hardware, and we will get a pixel shader invocation for each voxel which will be rendered. We also do only a single render pass for every object.

We need to integrate the following pipeline to our scene rendering algorithm:

1.) Vertex shader

The voxelizing vertex shader needs to transform vertices into world space and pass through the attributes to the geometry shader stage. Or just do a pass through and transform to world space in the GS, doesn’t matter.

2.) Geometry shader

This will be responsible to select the best facing axis of each triangle received from the vertex shader. This is important because we want to voxelize each triangle once, on the axis it is best visible, otherwise we would get seams and bad looking results.

// select the greatest component of the face normal input[3] is the input array of three vertices
float3 facenormal = abs(input[0].nor + input[1].nor + input[2].nor);
 uint maxi = facenormal[1] > facenormal[0] ? 1 : 0;
 maxi = facenormal[2] > facenormal[maxi] ? 2 : maxi;

After we determined the dominant axis, we need to project to it orthogonally by swizzling the position’s xyz components, then setting the z component to 1 and scaling it to clip space.

for (uint i = 0; i &lt; 3; ++i)
 {
 // voxel space pos:
 output[i].pos = float4((input[i].pos.xyz - g_xWorld_VoxelRadianceDataCenter) / g_xWorld_VoxelRadianceDataSize, 1);

// Project onto dominant axis:
 if (maxi == 0)
 {
 output[i].pos.xyz = output[i].pos.zyx;
 }
 else if (maxi == 1)
 {
 output[i].pos.xyz = output[i].pos.xzy;
 }

// projected pos:
 output[i].pos.xy /= g_xWorld_VoxelRadianceDataRes;

output[i].pos.z = 1;

output[i].N = input[i].nor;
 output[i].tex = input[i].tex;
 output[i].P = input[i].pos.xyz;
 output[i].instanceColor = input[i].instanceColor;
 }

At the end, we could also expand our triangle a bit to be more conservative to avoid gaps. We could also just be setting a conservative rasterizer state if we have hardware support for it and avoid the expansion here.

// Conservative Rasterization setup:
 float2 side0N = normalize(output[1].pos.xy - output[0].pos.xy);
 float2 side1N = normalize(output[2].pos.xy - output[1].pos.xy);
 float2 side2N = normalize(output[0].pos.xy - output[2].pos.xy);
 const float texelSize = 1.0f / g_xWorld_VoxelRadianceDataRes;
 output[0].pos.xy += normalize(-side0N + side2N)*texelSize;
 output[1].pos.xy += normalize(side0N - side1N)*texelSize;
 output[2].pos.xy += normalize(side1N - side2N)*texelSize;

It is important to pass the vertices’ world position to the pixel shader, because we will use that directly to index into our voxel grid daa structure and write into it. We will also need texture coords and normals for correct diffuse color and lighting.

3.) Pixel shader

After the geometry shader, the rasterizer unit scheduled some pixel shader invocations for our voxels, so in the pixel shader we determine the color of the voxel and write it into our data structure. We probably need to sample our base texture of the surface and evaluate direct lighting which affects the fragment (the voxel). While evaluating the lighting, use a forward rendering approach, so iterate through the nearby lights for the fragment and do the light calculations for the diffuse part of the light. Leave the specular out of it, because we don’t care about the view dependant part now, we want to be able to query lighting from any direction anyway later. I recommend using a simplified lighting model, but try to keep it somewhat consistent with your main lighting model which is probably a physically based model (at least it is for me and you should also have one :P) and account for the energy loss caused by leaving out the specularity.

When you calculated the color of the voxel, write it out by using the following trick: I didn’t bind a render target for the render pass, but I have set an Unordered Access View by calling OMSetRenderTargetsAndUnorderedAccessViews(). So the shader returns nothing, but we write into our voxel grid in the shader code. My voxel grid is a RWStructuredBuffer here to be able to support atomic operations easily, but later it will be converted to a 3D texture for easier filtering and better cache utilization. The Structured buffer is a linear array of VoxelType of size gridDimensions X*Y*Z. VoxelType is a structure holding a 32 bit uint for the voxel color (packed HDR color with 0-255 RGB, an emissive multiplier in 7 bits and the last bit indicates if the voxel is empty or not). The structure also contains a normal vector packed into a uint. Our interpolated 3D world position comes in handy when determining the write position into the buffer, just truncate and flatten the interpolated world position which you reveived from the geometry shader. For writing the results, you must use atomic max operations on the voxel uints. You could be writing to a texture here without atomic operations, but using rasterizer ordered views, bt they don’t support volume resources, so a multi pass approach would be necessary for the individual slices of the texture.

An additional note: If you have generated shadow maps, you can use them in your lighting calculations here to get more proper illumination when cone tracing. If you don’t have shadow maps, you can even use the voxel grid to retrieve (soft) shadow information for the scene later.

voxelGI_GIF

If you got so far, you just voxelized the scene. You should write a debugger to visualize the results. I am using a naive approach which is maybe a bit slow, but gets the job done. I issue a Draw() command with a vertex count of voxel grid dimensions X*Y*Z, read my voxel grid in the vertex shader indexed by the SV_VertexID, then expand to a cube in the geometry shader if the voxel color is not empty (greater than zero). The pixel shader outputs the voxel color for each screen pixel covered.

Part 2: Filtering the data

We voxelized our scene into a linear array of voxels with nicely packed data. The packed data helped in the voxelization process, but it is no good for cone tracing, we need a texture which we can filter and sample. I have a compute shader which unpacks the voxel data, copies it into a 3D texture with RGBA16 format for HDR colors and finally it also clears the packed voxel data by filling it with zeroes. A nice effect would be not just writing the target texture, but intepolating with old values so that abrupt changes in lighting, or moving objects don’t cause much flickering. But we have to account for moving camera and offsetting the voxel grid. We could lerp intelligently with a nice algorithm, but I found that the easiest method is just “do not lerp when the voxel grid got offset” was good enough for me.

Then we generate a mip chain for the 3D texture. DX11 can do this automatically for us by calling GenerateMips() on the device context, but we can also do it in shaders if we want better quality than the default box filter. I experimented with gaussian filtering, but I couldn’t write one to be fast enough to be worthwhile, so I am using the default filter.

But what about the normals, because we saved them in the voxelization process? They are only needed when doing multiple light bounces or in more advanced voxel algorithms, like anisotropic voxelization.

vxao

Part 3: Cone tracing

We have the voxel scene ready for our needs, so let’s query it for information. To gather the global illumination for the scene, we have to run the cone tracing in screen space for every pixel on the screen once. This can happen in the forward rendering object shaders or against the gbuffer in a deferred renderer, when rendering a full screen quad, or in a compute shader. In forward rendering, we may lose some performance because of the worse thread utilization if we have many small triangles. A Z-prepass is an absolute must have if we are doing this in forward rendering. We don’t want to shade a pixel multiple times because this is a heavy computation.

For diffuse light bounces, we need the pixel’s surface normal and world position at minimum. From the world position, calculate the voxel grid coordinate, then shoot rays in the direction of the normal and around the normal in a hemisphere. But the ray should not start at the surface voxel, but the next voxel along the ray, so we don’t accumulate the current surface’s lighting. Begin ray marching, and each step sample your voxel from increasing mip levels, accumulate color and alpha and when alpha reaches 1, exit and divide the distance travelled. Do this for each ray, and in the end divide the accumulated result with the number of rays as well. Now you have light bounce information and ambient occlusion information as well, just add it to your diffuse light buffer.

Assembling the hemisphere: You can create a hemisphere on a surface by using a static array of precomputed randomized positions on a sphere and the surface normal. First, if you do a reflect(surfaceNormal, randomPointOnSphere), you get a random point on a sphere with variance added by the normal vector. This helps with banding as discrete precomputed points get modulated by surface normal. We still have a sphere, but we want the upper half of it, so check if a point goes below the “horizon” and force it to go to the other direction if it does:

bool belowHorizon = dot(surfaceNormal, randomPointOnSphere) < 0;

coneDirection = belowHorizon ? – coneDirection : coneDirection;

Avoid self-occlusion: So far, the method of my choice to avoid self occlusion is to start the cone tracing with offset from the surface by the normal direction and also the cone direction. If I don’t do this, then the cone starts off the surface and immediately samples its own voxel, so each surface would get its own contribution from the GI, which is not good. But if we start further off, then that means close by surfaces will not contribute to each other’s GI and there will be a visible disconnect in lighting. I imagine it would help to use anisotropic voxels, which means store a unique voxel for a few directions and only sample the voxels facing the opposite direction to the cone. This of course would require much additional memory to store.

Accumulating alpha: The correct way to accumulate alpha is a bit different to regular alpha blending:

float4 color = 0, alpha = 0;

// …

// And inside cone tracing loop:

float4 voxel = SampleVoxels().rgba;

float4 a = 1 – alpha;

color += a * voxel.rgb;

alpha += a * voxel.a;

As you can see, this is more like a front-to back blending. This is important, because otherwise we would receive a black staircase artefact on the edge of voxels, where the unfilled (black) regions with zero alpha would bleed into the result very aggressively.

Stepping the cone: When we step along the ray in voxel-size increments (ray-marching) in world space, we can retrieve the diameter of the cone for a given position by calculating this:

float coneDiameter = 2 * tan(coneHalfAngle) * distanceFromConeOrigin;

Then we can retrieve the correct mip level to sample from the 3D texture by doing:

float mip = log2(coneDiameter / voxelSize);

With this, we have a single light bounce for our scene. But much better results can be achieved with at least a single secondary light bounce. Read on for that.

vxgi

Part 4: Additinal light bounces

This is a simple step if you are familiar with compute shaders and you have wrapped the cone tracing function to be reusable. When we filtered our voxel grid, we spawn a thread in a compute shader for each voxel (better just for the non-empty voxels), unpack its normal vector and do the cone tracing like in the previous step, but instead for each pixel on the screen, we need to do it for each voxel. This needs to write into an additional 3D texture by the way, because we are sampling the filtered one in this pass, so mind the additional memory footprint.

Part 5: Cone traced reflections

To trace reflections with cone tracing, use the same technique, but the steps along mip levels should take the surface roughness into account. For rough surfaces, the cone should approach the diffuse cone tracing size, for smooth surfaces, keep the mip level increasing to minimum. Just experiment with it until you get results which you like. Or go physically based, and it will be much cooler and would probably go for a nice paper.

Maybe the voxel grid resolution which is used for the diffuse GI is not fine enough for reflections. You will probably want to use a finer voxelization for them. Maybe using separate voxel data for diffuse and specular reflections is a good idea, with some update frequency optimizations. You could, for example update the diffuse voxels in even frames and specular voxels in odd frames orsomething like that.

You probably want this as a fallback to screen space reflections, if they are available.

vxgi1

Part 6: Consider optimizations

The technique, at the current stage, will only work on very fast GPUs. But there already are some games using tech like this (Rise of Tomb Raider using voxel AO), or parts of it, even on consoles (The Tomorrow Children). This is possible with some aggressive optimization techniques. Sparse Voxel Octrees can reduce memory requirements, voxel cascades can bring up framerates with clever updating frequency changes. And of course do not re-voxelize anything that is not necessary, eg. static objects (however, it can be difficult to separate them, because dynamic lights should also force re-voxelization of static objects if they intersect).

And as always, you can see my source code at my GitHub! Points of interest:

Thank you for reading!