Graphics API abstraction

Wicked Engine can handle today’s advanced rendering effects, with multiple graphics APIs (DX11, DX12 and Vulkan at the time of writing this). The key to enable this is to use a good graphics abstraction, so these complicated algorithms only need to be written once.

A common interface is developed that all game engine code can use to define rendering intentions without knowing which API is executed with. The interface can be categorized as:

  • resources
  • device

Let’s review them below.


These are simple data structures, without any functionality. They could be really simple, even plain old data (POD) types, for example descriptor structures or enums.

For example all the possible cull modes are described as an enum:


A texture is described as a struct:

struct TextureDesc
	uint32_t Width = 0;
	uint32_t Height = 0;
	uint32_t MipLevels = 1;
	// ... 

I am personally a fan of providing initial default values to everything, which makes them not POD anymore, but a lot more convenient to use. My most favourite feature of the POD is still true, in that none of these contain any custom written constructors, desctructors, move and copy definitions, each are assumed by default.

Some resources are responsible of holding actual GPU data too. Examples for these are: Texture, GPUBuffer, Sampler, PipelineState, etc. A Texture resource looks something like this:

struct Texture
	TextureDesc	desc;
	std::shared_ptr<void> internal_state;

First, the TextureDesc is what describes the texture and is visible to the application/engine. This is stored here for convenience, but it is not necessarily what the actual GPU uses for these resource, that is determined by the implementation, which the internal_state pointer points to, and exists only after the resource was actually “created” (more about this shortly). The internal_state is a void pointer, which means that the implementation is arbitrary, it will be determined when the resource is “created”. The “shared_ptr” was chosen for these reasons:

  • Unlike a void* pointer, the shared_ptr knows how to delete the underlying object (because it stores additional metadata).
  • The copy, move functionality are a given, so we avoid writing a lot of constructors, destructors, etc. manually.
  • The reference counting will delete the object when it is no longer used. This makes it very easy to use it on the engine side, for example: std::vector can be simply used to store mutliple textures, deleting a texture can be achieved by simply removing it from the container, without worrying about how and when it will be destroyed (the requirements of destroying GPU resources can differ greatly between graphics APIs). The object will be also deleted when going out of scope or a new object is constructed in its place.

The implementation defined structure that the internal_state is pointing to can be anything, without needing to change the common interface. For example, the DX11 implementation of Texture looks like this:

struct Texture_DX11
	ComPtr<ID3D11Resource> resource;
	ComPtr<ID3D11ShaderResourceView> srv;
	// ...

The DX11 implementation is quite simple, internally it uses its own reference counting objects, so it doesn’t even have a destructor (everything is destroyed automatically). A DX11 CreateTexture() function would start like this:

bool CreateTexture(const TextureDesc* pDesc, const SubresourceData *pInitialData, Texture *pTexture) const
	pTexture->internal_state = std::make_shared<Texture_DX11>();
	// ...

Assigning to the internal_state both deletes the previous implementation specific object if it existed before, and creates a new one.

The Vulkan implementation at the same time can be quite different and more complicated:

struct Texture_Vulkan
	std::shared_ptr<GraphicsDevice_Vulkan::AllocationHandler> allocationhandler;
	VkImage resource = VK_NULL_HANDLE;
	VkImageView srv = VK_NULL_HANDLE;
	// ...

		allocationhandler->destroyer_images.push_back(std::make_pair(std::make_pair(resource, allocation), framecount));
		allocationhandler->destroyer_imageviews.push_back(std::make_pair(srv, framecount));

This is just a little snippet of everything the Vulkan implementation has on the Texture object, but the main reason I wanted to show this is that Vulkan doesn’t have any notion of reference counting, and there are much more stricter rules regarding when a resource can be destroyed. A custom allocator is used and the destructor is adding the vulkan resources to a deferred removal queue, so that the resources won’t be destroyed while they are still in use by the GPU. The point is, these implementation defined objects can be very flexible and as complex as the API requires it.


The purpose of resources is holding GPU data, but they don’t provide any functionality by themselves. The graphics device interface is responsible for that. This is the most complex part of the abstraction and which contains all API specific implementations. I chose to use inheritance for this. The base class is called GraphicsDevice, and mostly without implementations – except for some helper functions. This interface can be implemented by an API specific class, so there is GraphicsDevice_DX11, GraphicsDevice_DX12 and GraphicsDevice_Vulkan classes. My personal preference is to use the minimal number of code files, so each API specific class is implemented as one header and cpp pair (but this is not a requirement). This makes API specific files quite large, but all functionality can be found in one place, so there is no question where to find eg. DX12 code and that no other part is dependent on it (I use the Ctrl+M+O hotkey in Visual Studio very frequently to collapse functions). Next, I’ll explain what kind of features the GraphicsDevice provides.

Creating resources

The resources must be first created, which means that their GPU-side data gets allocated. The functions like CreateTexture(), CreateBuffer(), CreatePipelineState(), and others will require a descriptor as their input (for example TextureDesc), and they will produce a resource (for example Texture). The descriptor parameters used for creation will be copied to the resource to indicate the most recent creation parameters in the future. If the resource was already created, it will be recreated with the new descriptor parameters (this could mean that the previous contents will be destroyed and simply recreated, but a more optimal path could be taken where it makes sense, for example resizing swapchains). The flow of the creation functions resemble the simplicity of DX11, so the function can initialize the resource with initial data if the initData parameter is not null. Also, after the function returns, the resource can be considered ready to be used.

An example of creating a texture:

TextureDesc desc;
desc.Format = FORMAT_R10G10B10A2_UNORM;
desc.Width = 3840;
desc.Height = 2160;
desc.SampleCount = 8;

Texture texture;
device->CreateTexture(&desc, nullptr, &texture);

The implementation of DX11 maps to this behaviour trivially, but DX12 and Vulkan will ensure by correct synchronization that the resource is ready by the time the GPU is using it. At the time of writing this, the synchronization is performed at the next submit, by letting the GPU wait until previous copy operations are finished.

As an improvement, it would be a good idea to expose some wait mechanism to let the user defer these synchronizations, because in case of bindless resources the API layer doesn’t have any knowledge of exactly when a resource can be accessed from a shader, and the next submit can be a bit too early and prevent potential asynchronous copies.


Buffer and Texture resources can have so called subresources, which are different views onto the resource. For example, a texture with multiple MIP levels could have multiple subresources to view each mip level separately, or a single subresource to view the entire mipchain to perform mip sampling. For buffers, different subresources could be created too that view different regions of the buffer. Read-only and Read-write subresources are also differentiated. When creating a Texture or Buffer, and the descriptor has the appropriate flags, the implementation will always create default subresources that view the entire resource, because this is the most common way of accessing these on the GPU. If required, more specific subresources can be created by using the GraphicsDevice::CreateSubresource() functions. The subresources are only meaningful with the main resource allocation, so they are identified by simple integer numbers in this interface. These are the benefits of identifying subresources with numbers:

  • The subresource lifetime is bound to the lifetime of the main resource. As long as the main resource is alive, so are the subresources, but subresources will not keep the main resource alive.
  • The user doesn’t necessarily have to store subresource identifiers, since they are monotonically increasing numbers. If it’s known that one subresource was created for each mip for example, we implicitly know what number refers to which subresource. This is a very common case with mip generation, texture arrays, 3D textures, cubemaps.
  • In the vast majority of the cases, we don’t want to access anything other than the entire resource, so we can simply omit the subresource identifier from all related functions to assume the whole resource. Related functions are for example binding resources to shaders.

An example of creating a shader resource view for every individual mip level of a texture:

for (uint32_t i = 0; i < texture.desc.MipLevels; ++i)
	int subresource = device->CreateSubresource(
		/*firstSlice=*/ 0, 
		/*sliceCount=*/ 1, 
		/*firstMip=*/ i, 
		/*mipCount=*/ 1

Recording render commands

It is very important to utilize the CPU well when recording a large amount of commands. For this, the CommandList is exposed by the interface which essentially identifies a CPU thread from the point of view of the graphics interface. The CommandList is a monotonically increasing integer number starting from zero. A new CommandList can be started with the GraphicsDevice::BeginCommandList() function, which will give us the next command list and ensure that it is in a correct state ready to record graphics commands. All functions that record commands have a CommandList parameter, which means that the CommandList will be written to, and only one thread at a time should write any CommandList. All these are CPU operations, and no actual GPU work is started by recording commands. Example of using a CommandList:

CommandList cmd = device->BeginCommandList();
Viewport vp;
vp.Width = 100;
vp.Height = 200;
device->BindViewports(1, &vp, cmd);

The CommandList is represented by a number because at the time I saw it as a thread identifier, so several systems are using static arrays that can be indexed by CommandList. Also, these are temporary resources which are not stored anywhere in the engine, so it didn’t make sense to have them be proper graphics resources with an internal_state. Currently I think maybe the decision to expose these as monotonic increasing numbers was maybe a bit too much and makes them a bit rigid to use. In the future I could consider eliminating this, but centrain engine systems might have to be rewritten which use CommandList as thread indices.

Starting GPU work

The GraphicsDevice::SubmitCommandLists() function simply closes all CommandLists that were used so far and submits them in order (same order as BeginCommandList() was called) for GPU execution. This is a highly simplified interface compared to what the native graphics APIs provide. It can be seen as a limitation, but it is also a guideline about how to use the renderer, since it is usually a good idea to batch multiple command lists into a single submit operation, to get the most efficient driver path.

Furthermore, the SubmitCommandLists() is also a point at which swapchains are presented and buffers are synchronized and swapped. This is a place where the CPU can be blocked until the resources that were used by the GPU are freed up and can be modified from the CPU (such as command buffers). There is a possibility to perform the submission from a separate thread to avoid a potential immediate CPU blocking, and start preparing the next frame’s logic that’s not rendering related.

A more fine grained version of Submit could be written in the future that lets to submit only the specified command lists. This could be beneficial when some command lists are not tied to the current frame but doing unrelated workload. It is not the only way to achieve submission that is more loosely related to the frame. The CreateTexture() and CreateBuffer() are special functions that can be used from any thread to submit a small workload to the copy queue, executed asynchronously (used to initialize resources with data).

Async compute

Even with this simplified submission model, async compute configuration is possible. The async compute model was retro-fitted into this model, because previous graphics APIs didn’t have this possiblity. Even so, the interface is straight forward and unobtrusive in my opinion. The BeginCommandList() function has an optional parameter, that can be used to specify a certain GPU_QUEUE, By default, the QUEUE_GRAPHICS will be used, that can execute any type of commands. If the QUEUE_COMPUTE is specified, then the commands will be executed on a separate GPU queue or timeline that is only for compute jobs. These can be scheduled at the same time as graphics queue jobs. The SubmitCommandLists() implementation will handle the API specifics of matching command buffers with queues automatically. Synchronization between queues is performed by the GraphicsDevice::WaitCommandList() command, which is used to specify dependencies between command lists if they are on separate queues. The DX12 and Vulkan APIs can synchronize only queues between separate submits, so the implementation will break up command lists into multiple submits automatically when necessary. Dependencies on the same queue are different, those are handled by GPU Barriers. In case of DX11, which doesn’t have different queues, the implementation can simply ignore the queue parameter and execute everything on the main queue, the result will be the same, only some optimization opportunities will be lost.

Example of using async compute:

CommandList cmd0 = device->BeginCommandList(QUEUE_GRAPHICS);
CommandList cmd1 = device->BeginCommandList(QUEUE_COMPUTE);
device->WaitCommandList(cmd1, cmd0); // cmd1 waits for cmd0 to finish
CommandList cmd2 = device->BeginCommandList(QUEUE_GRAPHICS); // cmd2 doesn't wait, it runs async with cmd1
CommandList cmd3 = device->BeginCommandList(QUEUE_GRAPHICS);
device->WaitCommandList(cmd3, cmd1); // cmd3 waits for cmd1 to finish

device->SubmitCommandLists(); // execute all of the above by the GPU

GPU Barriers

Barriers were introduced in modern PC graphics APIs like DX12 and Vulkan and exposed in this interface to a reasonable extent. After all, there is clearly a reason to give developers this level of control to obtain optimum performance. DX11 has no notion of these, so the GraphicsDevice::Barrier() command’s implementation for that API is simply empty. This also means that using this abstraction, implementing something correctly in DX11 will not ensure that DX12 and Vulkan will work correctly, however it is true the other way around. The barriers are a mix between DX12’s [UAV and Transition] barriers and Vulkan’s [pipeline] barriers. Aliasing barriers are not implemented as of now simply because I haven’t tried them yet. The GPUBarrier struct is a simple struct type which holds no API specific data. A GPUBarrier can be:

  • Memory barrier: Also called as UAV barrier in DX12, this is used to wait until resource writes or shaders are finished. The GPUBarrier::Memory() function can be used to simply declare such a barrier. If no resources are provided as arguments, then all previous GPU work must finish.
  • Buffer barrier: Used to transition a GPUBuffer between different BUFFER_STATEs.
  • Image barriers: Used to transition a Texture between different IMAGE_LAYOUTs.

The GraphicsDevice::Barrier() function can be used to set a batch of barrier commands into the CommandList. Example of setting two barriers:

GPUBarrier barriers[] = {
	GPUBarrier::Image(&texture, IMAGE_LAYOUT_UNORDERED_ACCESS, texture.desc.layout)
device->Barrier(barriers, arraysize(barriers), cmd);

The memory barrier is used to wait for the compute shader to finish, the image barrier transitions the texture from unordered access layout back to its default layout.

In both cases [DX12, Vulkan] the implementation doesn’t issue barriers immediately, but there is an opportunity to process and defer them further when possible. Consider the case when there are two separate functions in the engine for doing some graphics effects. These could be unrelated, and don’t know about each other, so they could issue redundant barrier commands on the same CommandList. It is largely beneficial when the engine can provide high level graphics helper functions which operate independently of each other, but on the lower API specific level, there could be further optimizations happening to remove redundant commands. Deferrals such as these happen elsewhere in the implementation as well, like when setting multiple pipeline states without drawing, or binding multiple resources to a slot.

A useful feature in this abstraction is the ability to define starting layouts for textures and buffers, so the API can transition them to that. It is also useful to make the engine be aware of the expected starting layout of resources at any time. High level graphics code should most of the time transition from the default layout to a temporary layout when necessary and transition back to the default layout. The default layout should be chosen to be the most commonly used layout of the resource for performance and convenience reasons. Without declaring this before resource creation, the default layout is assumed to be a read-only state. Even when two consecutive high level graphics functions transition layouts of one resource rapidly back and forth, there is a chance that some of those transitions will be discovered as redundant by the API layer and removed as described above.

Render Passes

DirectX uses SetRendertargets, Vulkan uses render passes. It is possible to implement the SetRendertargets behaviour with Vulkan render passes, and initially this was the chosen way. But going the other way (implementing render passes with Setrendertargets) is a lot easier, so in time the interface was changed to require render passes. These are now GPU resources that must be created, but allow more features like declaring texture layout transitions and choosing whether to preserve the render contents, clear them or just discard them. DX12 now also supports render passes natively, so yet an other reason to use these. In my experience they enforce a little stricter graphics programming model, but one that lets the developer make less mistakes, like forgetting to set/unset render targets. Sometimes the render target contents are also completely temporary, like a depth buffer that is never read from and used in a single render pass – this doesn’t need to be written out to GPU memory, it can stay entirely in tile-cache in some GPU hardware, an now we have a way to declare this intention.

Example of creating and using a render pass:

RenderPassDesc desc;
desc.attachments.push_back( // add a depth render target
		RenderPassAttachment::LOADOP_CLEAR,		// clear the depth buffer
		RenderPassAttachment::STOREOP_STORE,	// preserve the contents of depth buffer
		IMAGE_LAYOUT_DEPTHSTENCIL,				// renderpass layout
desc.attachments.push_back( // add a color render target
		RenderPassAttachment::LOADOP_DONTCARE	// texture contents are undefined at start
		// rest of parameters are defaults
if (depthBuffer_Main.desc.SampleCount > 1)
	// if MSAA rendering, then add a resolve pass for color render target:
device->CreateRenderPass(&desc, &renderpass_depthprepass);

device->RenderPassBegin(&renderpass_depthprepass, cmd);
// render into texture...

Pipeline States

DX11 has separate state objects for different pipeline stages, which goes against the DX12 and Vulkan way. This was easiest to accomplish by exposing the combined PipelineState objects and let the DX11 implementation build the pipeline state from separate parts behind the scenes, while letting the DX12 and Vulkan implementations work in their native way. At least this was the way at first, but declaring render target formats, sample counts or render passes for every PipelineState was becoming inconvenient from the graphics programming point of view. The interface now supports dynamically compiling PiplelineState objects based on which RenderPass they are used in without the programmer needing to declare that before creation. Most of the pipeline state is still declared however, which is also very useful to avoid forgetting setting individual states, and I found this reduced lots of graphics programming mistakes. The implementation will still do a lot of busy work as early as it can for pipelines, which includes decisions based on shader reflection and hashing every available state at creation time. This way runtime state hashing is minimal, only the pipeline_state_hash x render_pass_hash is computed, which is used to determine if there is already a compiled PSO for the current render pass or it needs to be compiled. The PSO compilation happens on the thread which is recording the command list, so in a way it is multithreaded since the engine makes use of multiple threads in its rendering path.

As an improvement, I am still considering the option to specify the RenderPass beforehand, but optionally, since it could be useful for some more hard coded rendering effects, while getting in the way in other places where the renderer was not designed with knowledge of the current render pass in mind.

Resource binding

The most common method of providing resources to shaders is by declaring slot numbers on the shader side, and binding a resource to that slot number on the application side. This is the only way it works in DX11, while DX12 and Vulkan can emulate this behaviour, although in quite different ways. In DX12, a global shader visible descriptor heap (one for samplers, one for resources) is managed as a ring buffer and all descriptors that the shader uses are allocated and filled out by copying from staging descriptor heaps. In DX12 it is highly recommended to avoid binding descriptor heaps more than one time per frame, so large ones are used, that can hold the tier1 limit (1 million CBV_SRV_UAV and 2048 samplers). The Vulkan paradigm is very different, which uses vkAllocateDescriptorSets() from a separate descriptor pool per thread every time a descriptor layout changes or invalidated due to binding changes. The Vulkan implementation will dynamically grow the descriptor pools if they run out of sets by deleting the old ones and allocating larger ones. Even though the implementations are vastly different, and it poses challenges to implement, they can fit well into the interface model resembling DX11.

In this example, I will bind mip level 5 of a texture that has subresources as created in a previous example, one subresource per mip contiguously so we implicitly know the subresource index:

device->BindResource(PS, &texture, 26, cmd, 5); // binds subresource 5 to pixel shader slot 26

Or binding the full resource:

device->BindResource(PS, &texture, 26, cmd);

One personal annoyance with DX11 here is that it will automatically unbind a texture SRV if it’s bound as a UAV, which will cause spamming of debug warnings or unintended behaviour (Vulkan and DX12 doesn’t have this limitation). Currently this must be taken into account when developing graphics code.

Bindless descriptors

There was an earlier blog for Wicked Engine which discussed bindless descriptors in more detail than I will here:

In short, they are supported in DX12 and Vulkan, while not supported in DX11. The GraphicsDevice interface provides the way of determining if this feature could be used currently or not, by calling:



From the GraphicsDevice perspective, shaders are provided as already compiled binary data blobs into the GraphicsDevice::CreateShader() function. The DX11 implementation will expect up to shader model 5.0 HLSL shader format, the DX12 expects shader model 6.0 or higher HLSL format, while Vulkan needs the SPIRV format. The shaders can be compiled by any tool of preference, since the native shader formats are accepted. DX12 and Vulkan must make use of shader reflection to determine optimal pipeline layouts, so it is a requirement that reflection data shouldn’t be stripped in those cases. Furthermore, the Wicked Engine supplies a shader compiler interface that can compile shaders written in HLSL to be used in DX11, DX12 and Vulkan. This tool invokes the d3dcompiler or dxcompiler DLLs which are the native shader compilers. As an interesting feature, this tool supports compiling every shader in the engine into a C++ header file, to be embedded into the exe, so the engine won’t need to perform additional file loading operations for every shader. The default way is however to compile every shader to a separate .cso file which contains the shader binary. This also supports shader hot reloading (when a shader source changes, the engine can detect it, recompile and reload it).

Mipmap generation

DX11 provided a function called GenerateMips, which is used to generate a full mipchain of a texture in one API call. This is no longer the case with DX12 and Vulkan, where applications are expected to implement this functionality using shaders. This makes sense, as generating mip levels can be dependent on content or art direction. For example Wicked Engine can generate mip levels with various filters or biasing alpha to help combat too much alpha testing in low levels of detail (

As such, the GraphicsDevice exposes no GenerateMips functionality, and it is expected to be implemented using shaders. This reduces implementation of the GraphicsDevice which is a good thing because it makes it easier to maintain. In general I am a fan of not exposing anything from the GraphicsDevice that can be implemented on a higher level, for example copying buffers to textures, copying texture region that comes to mind. The engine is already implemented these using shaders, which makes more sense as shader can scale, filter textures, convert between texture formats and apply border paddings.


This graphics interface in Wicked Engine is the result of several previous iterations and still continues to evolve. This is not developed only as a toy, but already a huge amount of graphics effects are written with it. I also like to think that it could be a guide or learning resource for other people. If the system of abstraction is not to everyone’s liking, then the raw low level API code is still usable and easily accessible as they are contained in their own separate files (only two files per graphics API: .h and .cpp).

Every day supporting DX11 will be more pointless. Right now I still see benefit to keep it operational, as using this can be an easy way to bring graphics effects to life and test them in DX11, and later adding corrections and optimizations for Vulkan and DX12 by adding barriers for example. This could change in the future and DX11 could be removed. It would certainly free up some API limitations.

Thank you for reading! If you have any comments, post them below. To read a more complete documentation about the graphics interface, visit the Wicked Engine Documentation’s Graphics chapter (which is updated independently from this blog and could have differences in the future).

Bindless Descriptors

The Vulkan and DX12 graphics devices now support bindless descriptors in Wicked Engine. Earlier and in DX11, it was only possible to access textures, buffers (resource descriptors) and samplers in the shaders by binding them to specific slots. First, the binding model limitations will be described briefly, then the bindless model will be discussed.

Binding model

Slots are available in limited amounts, so a shader could only access up to 128 shader resource views (read only resources), 8 unordered access views (read/write resources), 16 constant buffers and 16 samplers. These limits were configurable in the DX12 and Vulkan devices before now, but it still meant that shaders had to declare resources on hard coded resource slots, so it was not flexible enough. Just to give an example, this is how you access a texture on a slot by a shader:

Texture2D<float4> texture : register(t27);
// ...
float4 color = texture.Sample(sampler_linear_clamp, uv); 

To bind the texture on the CPU side, do this:

Texture texture;
// ...
device->BindResource(PS, &texture, 27, cmd);

The other problem is the CPU cost of binding descriptors, which is the main cause of CPU performance problems in rendering code. In some cases, the CPU code will bind a lot of descriptors, especially when rendering the scene with complex shaders that require many inputs like textures, buffers, etc. Binding descriptors is CPU intensive because the implementation will copy all descriptors used by a shader into a contiguous memory area resembling the layout that the shader expects (which will be mostly different for every shader). In the Vulkan and DX12 implementations it was possible to make some simplified assumptions regarding descriptor binding, so they would perform more optimally compared to DX11. These simplifications included:

  • using shader reflection, determine for each shader exactly how many descriptors will be needed, instead of creating a layout that fits all
  • only use one set of slots for all shader stages combined, so no separate slots for vertex and pixel shader
  • put some constant buffers to “root descriptor” instead of descriptor table

But the number one highest CPU costing function that still always showed up was the CopyDescriptorsSimple (DX12) or vkUpdateDescriptorSets (Vulkan).

Bindless model

The bindless model solves both the flexibility and CPU performance problems. Instead of declaring slots in shaders, just say that we access the entire descriptor heap as an unbounded array and access a specific one by a simple index:

Texture2D<float4> bindless_textures[] : register(t0, space1);
// ...
float4 color = bindless_textures[27].Sample(sampler_linear_clamp, uv);

The great thing is that the index value “27” in this example can come from anywhere, for example a material structure:

struct Material
  int texture_basecolormap_index;
  int texture_normalmap_index;
  int texture_emissivemap_index;
  // ...

Which means, instead of binding all textures required by the material on the CPU, we just bind the material constant buffer for example (previously you’d need to bind the material constant buffer and all textures as well). The shader will be able to access all textures referenced by the material trivially. From the CPU side there is only one thing we need to do, to get access to the descriptor index, so the value of “texture_basecolormap_index” for example. For that, you can now use this API on the application side:

Texture texture;
// ...
int index = device->GetDescriptorIndex(&texture, SRV);

The GetDescriptorIndex() API can be used after the resource/sampler was created (for example by the CreateTexture function). If the resource wasn’t created yet, or the device doesn’t support bindless, this API will return the value -1. This is important, because shaders must make sure to not use any descriptor which was not created. For example, it is common that a material doesn’t contain a normal map, so the texture_normalmap_index would be -1. In this case the shader must use branching to avoid reading this texture:

if(material.texture_normalmap_index >=0)
  normal = bindless_textures[material.texture_normalmap_index].Sample(sampler_linear_clamp, uv);

To avoid branching in shaders, you can choose to provide a dummy texture (like a 1×1 white texture or something similar) if you detect that one of the texture indices is -1.

If the shader tries to access an invalid resource, undefined behaviour will occur, most likely a GPU hang. To help detecting this, you can start the application with the debugdevice and gpuvalidation command line arguments and the Visual Studio debugger attached. If a problem is detected, the debugging will break and the error messages will be posted to the Visual Studio output window.

Using the binding model alongside the bindless model is supported and makes a lot of sense. You wouldn’t want to access your per frame constant buffer in a bindless way, that is less convenient (but still possible). This is also useful to transition to the bindless model in small steps, replacing your shaders with bindless one step at a time. The only thing to keep in mind, that space0 (or if no space binding is provided) is reserved for descriptor bindings in Wicked Engine, so use explicit space1 and greater for bindless declarations.

You can go even further than always binding a material constant buffer, for example you can have all your materials inside a StructuredBuffer, or just bindless buffers. If you also reference all your vertex buffers in the bindless way, it means you will not have to bind anything separately for any of your scene objects. At some point you will need some way to provide descriptor indices to the shaders, and the simplest facility for that is the push constants. This is the most efficient way to set a small number of constant values for shaders. To use it, declare a push constant block in the shader using the PUSHCONSTANT macro (which provides consistent interface for both Vulkan and DX12):

struct MyPushConstants
  int mesh;
  int material;
  int instances;
  uint instance_offset;
PUSHCONSTANT(push, MyPushConstants);

From the CPU side, set the push constants for the next draw/dispatch call like this:

MyPushConstants push;
push.mesh = device->GetDescriptorIndex(&meshBuffer, SRV);
// fill the rest of push structure...
device->PushConstants(&push, sizeof(push), cmd);

This API is designed for simple, per draw call small data uploads (up to 128 bytes). Each shader can use one push constant block for simplicity. This doesn’t exhibit the same CPU cost as binding a small constant buffer, and has potentially better GPU performance (less memory indirection cost), so prefer this way of specifying per draw data. However, this is only available in DX12 and Vulkan devices and will have no effect in DX11. Check whether push constants can be used by checking bindless support:

  // OK to use push constants and bindless

By having this flexibility of specifying Mesh and Material structures to be used on GPU, it becomes possible to bind no per draw call vertex/instance/material buffers and let CPU performance will be significantly improved.


Subresources are fully supported in bindless just like in non-bindless. The CreateSubresource() API is used to create a subresource on a resource and returns an integer ID, and the GetDescriptorIndex() API will accept this as an optional parameter if you want to query a specific subresource’s own descriptor index. By default, the whole resource is always viewed. Because the subresources have the same lifetime as their main resource, this doesn’t require any special consideration. Small example of retrieving the descriptor index of MipLevel=5 SRV (shader resource view) for a texture:

int mip5 = device->CreateSubresource(&texture, SRV, 0, 1, 5, 1);
int descriptor = device->GetDescriptorIndex(&texture, SRV, mip5);

Constant Buffer problem in Vulkan

A corner case to look out for is the bindless constant buffers. There is no problem with this declaration in DX12:

ConstantBuffer<MyType> bindless_cbuffers : register(b0, space1);

Which makes it easy to use bindless constant buffers, however in Vulkan you can run into unexpected limits regarding constant buffer descriptor count limit. This limit can be queried from here: VkPhysicalDeviceVulkan12Properties::maxDescriptorSetUpdateAfterBindUniformBuffers

For example on a RTX 2060 GPU I got an “unlimited” amount for this, but on GTX 1070 I got a limit of 15, which is not sufficient at all for bindless usage. At the same time I had no problem using bindless constant buffers in DX12 on the same GPU. In this case I had to rewrite the bindless constant buffers as bindless ByteAddressBuffers instead. With the ByteAddressBuffer, you can make use of the Load<T>(address) templated function in HLSL6.0+ to get the same logical result as if reading from a constant buffer. An example how bindless constant buffer can be rewritten:

// Bindless constant buffer:
ConstantBuffer<ShaderMesh> bindless_meshes : register(b0,space2);
ShaderMesh mesh = bindless_meshes[42];

// Can be rewritten as:
ByteAddressBuffer bindless_buffers[] : register(t0, space2);
ShaderMesh mesh = bindless_buffers[42].Load<ShaderMesh>(0);

The ByteAddressBuffer is also nice because less amount of descriptor tables must be bound by the API implementation layer, while supporting more buffer types. I haven’t noticed a performance difference compared to constant buffers on the GPUs/scenes I tested on.


Bindless makes it easier to support raytracing too, as using local root signatures is no longer required. Instead we can provide the mesh descriptor index via the top level acceleration structure InstanceID (which is basically userdata) and InstanceID() intrinsic function that is available in raytracing hit-shaders. The raytracing tier 1.1 (and Vulkan) also provides the GeometryIndex() intrinsic in the shader, which is useful to query a subset of a mesh. For example a flexible bindless scene description in a raytracing shader can look like this:

Texture2D<float4> bindless_textures[] : register(t0, space1);
ByteAddressBuffer bindless_buffers[] : register(t0, space2);
StructuredBuffer<ShaderMeshSubset> bindless_subsets[] : register(t0, space3);

// Then inside the hit shader:
ShaderMesh mesh = bindless_buffers[InstanceID()].Load<ShaderMesh>(0);
ShaderMeshSubset subset = bindless_subsets[mesh.subsetbuffer][GeometryIndex()];
ShaderMaterial material = bindless_buffers[subset.material].Load<ShaderMaterial>(0);
// continue shading...

This example works well, but watch out as there are a lot of dependent resource dereferencing like this, so there should be further experimentation with flattening some hierarchy, which could improve shader performance. For example: now to read a MeshSubset, you need to load the Mesh first, but in most cases there is only 1 subset per mesh, so it could be better to duplicate mesh data for each subset and load them as one. Remember that Mesh and Subset data here are just descriptor indices to reference vertex buffers and such, so the data duplication would not be very excessive.

The above example shows that a variety of bindless resources are declared, which is a bit questionable. Consider the StructuredBuffer case, for every type of StructuredBuffer, now we have to declare a bindless resource with the appropriate type. Internally the DX12 and Vulkan API implementations will bind a descriptor table (descriptor set in Vulkan) for each declaration to satisfy this that views the entire descriptor heap (Vulkan: descriptor pool). This means that the same heap will be bound multiple times, for each bindless declaration which just doesn’t make a lot of sense (Update: As Simon Coenen pointed out in the comments, it is valid to create one descriptor table with multiple unbounded ranges, so there won’t be a need for more than 2 SetDescriptorTable calls per pipeline. Thanks!). DX12 addresses this with the shader model 6.6 feature, which lets you just index a global resource descriptor heap or sampler descriptor heap and cast the result to the appropriate descriptor type:

This makes sense in DX12, but Vulkan needs to create separate descriptor pools for every distinct resource type (storage buffer, uniform buffer, sampled image, etc…), so I’m not yet sure how it would be possible to make the two APIs compatible like this.


If the descriptor index is not a uniform value, you will need to make it one before you use it to index into a bindless heap. You can use the NonUniformResourceIndex() function in HLSL for this. If the value comes from a constant buffer, push constant, or WaveReadFirstLane() then the value is scalar. Otherwise, the NonUniformResourceIndex() will ensure that the scalarization happens correctly for each divergent lane. It is unfortunate that it can be missed easily, because the HLSL compiler doesn’t complain about it. Apparently it is valid to use a divergent index on some hardware, which seems to be the case for me on the Nvidia. To read more about this, visit this blog: Direct3D 12 – Watch out for non-uniform resource index! (

Under the hood

Perhaps it’s worth to describe briefly how all this is implemented with Vulkan and DX12 graphics APIs. In DX12, there are two descriptor heaps at any time that are visible by shaders: one for samplers only, the other is for constant buffers, shader resource views and unordered access views (I call the second one the “resource descriptor heap”). It is not recommended to switch shader visible heaps at any time, so at application start they are created to fit the max amount of descriptors: 2048 samplers and 1 million resource descriptors (tier1 limit of DX12 model). These heaps will hold both bindless descriptors and binding slot layouts alike. The way it works is to just split the heap into two parts: the lower half will contain the bindless descriptors, the upper half of the heap is used as a lock-free ring buffer allocator and if a draw call requires new binding layout, it will be dynamically allocated. I really like the DX12 way of just giving you access to the heap and a way to address and copy descriptors within it, leaves a lot of freedom for implementation. For the bindless allocation, a simple free list is used. When shaders use bindless descriptors (as seen from shader reflection), then the sampler or resource heap will be automatically set to the appropriate root parameter just before the draw/dispatch is called – the user will not need to care about this.

In Vulkan, support for bindless relies on the VkPhysicalDeviceVulkan12Features::descriptorIndexing feature. I decided to create one descriptor pool per descriptor type for bindless, so that’s 8 pools currently that can hold a large number of descriptors. These are created once (using VK_DESCRIPTOR_POOL_CREATE_UPDATE_AFTER_BIND_BIT flag) and the full descriptor sets are allocated immediately. A custom free list allocator is responsible to keep track of which descriptors are free (in a similar same way as in DX12). The binding layouts are allocated from a separate descriptor pool that can support multiple different types of descriptors, and doesn’t support freeing descriptors – the whole pool will always be reset when the GPU finished with it. Based on shader reflection, the device layer will detect which descriptor sets need to be bound where. The downside with Vulkan is that register space declaration maps into descriptor set index. So if you declare a bindless table in space12 for example and no other, the device layer will need to bind dummy descriptor sets starting from set=0, ending with set=11. DX12 doesn’t have this problem, you can declare something in space687 and not care about it, since the root signature handles the mapping. But if you plan to use Vulkan, the advice is to find the minimum space number you can use.

In both of these implementations, allocating bindless descriptors will happen the first time you create the resource, or when creating a subresource, and these are immediately accessible for shaders. Their lifetime is bound to the resource itself, so the user mostly doesn’t need to care about it.

Difficulty of moving to bindless

Because it is possible to use bindless and the old school binding model together, it becomes really easy to move to bindless in baby steps – or even larger steps. The fun problem is coming up with new flexible data structures that make sense. The less fun part is supporting DX11 at the same time when bindless is used. This will be slightly intrusive depending on how far you are going. I decided to go full bindless in the most frequently used shaders – the scene rendering “object shaders” – when using DX12 or Vulkan. This means that the CPU part that renders all meshes will have a large if(bindless) branching. If bindless is supported, only a push constant block is being set, otherwise a material constant buffer and all textures are bound, and it seems manageable so far. The “object shaders” are using a compile time #ifdef path for bindless and all the texture accesses and vertex buffer reads are redefined to read from bindless buffers, a simplified example:

Texture2D<float4> bindless_textures[] : register(t0, space1);
SamplerState bindless_samplers[] : register(t0, space2);
ByteAddressBuffer bindless_buffers[] : register(t0, space3);
PUSHCONSTANT(push, ObjectPushConstants);

inline ShaderMaterial GetMaterial()
  return bindless_buffers[push.material].Load<ShaderMaterial>(0);

#define texture_basecolormap  bindless_textures[GetMaterial().texture_basecolormap_index]

#else // Non-bindless path below:
cbuffer materialCB : register(b0)
  ShaderMaterial g_xMaterial;

inline ShaderMaterial GetMaterial()
  return g_xMaterial;

Texture2D<float4> texture_basecolormap : register(t0);

#endif // BINDLESS

Thankfully the shaders were already using branching to avoid reading material textures that are not loaded, so guarding against uninitialized descriptors was not a problem that needed to be tackled in bindless. It’s pretty important to have this branching because bindless will not set “null descriptors” (descriptors are declared volatile), and referencing a descriptor that’s uninitialized can result in GPU hangs.

The vertex input struct is using functions to get vertex position and other properties and those functions will be redefined based on bindless. In DX11, input layouts are still used to provide vertex data, but in bindless I fetch everything by SV_VertexID and SV_InstanceID system provided input semantics:

struct VertexInput
  // Data coming from bindless fetching:
  uint vertexID : SV_VertexID;
  uint instanceID : SV_InstanceID;

  float4 GetPosition()
    return float4(bindless_buffers[push.vertexbuffer_pos_wind].Load<float3>(vertexID * 16), 1);

  // Data coming from input layout:
  float4 pos : POSITION_NORMAL_WIND;

  float4 GetPosition()
    return float4(, 1);

#endif // BINDLESS

Wicked Engine has ubershader that includes mostly all “object shader” code in one file and relies on compile time defines to enable specific paths. With that it was not a huge effort to move to bindless.


Expect big savings in CPU performance with bindless. In large scenes this can be easily be worth lots of milliseconds. The GPU time can be worse though as the new data structures will probably use more indirections. I didn’t manage to arrive at a consistent conclusion here unfortunately, but also didn’t discover any big performance problems. It’s always true that the bindless model just runs faster overall. Some render passes saw some slight reduction in GPU performance, but some ran faster. The difference between Vulkan and DX12 is also not trivial. I seemed to have more CPU gains with DX12, while the Vulkan GPU times were usually better.


All main graphics debuggers that I use (Renderdoc, Nsight, Pix) now support bindless pretty well. Using bindless in these tools can be quite a bit slower than not using them, so there is room for improvement. Enabling the GPU Based Validation layer is very useful as it will report uninitialized descriptors that are accessed by shaders. However this will make the application very slow in my experience, so it’s not a joy to use it.

Closing thoughts

Looking into the future, more systems will be moved to bindless where it makes sense, especially texture atlases. I think texture atlases are now “deprecated” with bindless. There is just so much effort in updating them and ensuring MIP levels are correct. Not to mention the limitation about the texture formats, since you can’t put different texture formats into the same atlas, and it is really difficult to update atlases with block compressed formats. There is also the GLTF model format which adds multiple new textures to be sampled for each extension, which begins to stress the bind slot limits, especially with some material blending shaders. It will be a lot easier to add new textures to materials now.

There will be a lot of places too where bindless doesn’t provide a benefit, for example global per frame resource bindings. I’m still not comfortable just dropping DX11, as it’s very stable now and supports a lot of older GPUs with acceptable performance. For the newer graphics API, I’d rather focus on working with cutting edge features and not worrying so much about backwards compatibility at the moment.

Thank you for reading!

Some other resources:
Bindless Deferred Decals in The Surge 2 (

Variable Rate Shading: first impressions

Variable Rate Shading (VRS) is a new DX12 feature introduced recently, that can be used to control shading rate. To be more precise, it is used to reduce shading rate, as opposed to the Multi Sampling Anti Aliasing (MSAA) technique which is used to increase it. When using MSAA, every pixel gets allocated multiple samples in the render target, but unless multiple triangles touch it, it will be only shaded once. VRS on the other hand doesn’t allocate multiple samples per pixel, instead it can broadcast one shaded pixel to nearby pixels, and only shade a group of pixels once. The shading rate means how big is the group of pixels that can get shaded as one.


DirectX 12 lets the developer specify the shading rate as a block of pixels, it can be 1×1 (default, most detailed), 1×2, 2×1, 2×2 (least detailed) in the basic hardware implementation. Optionally, hardware can also support 2×4, 4×2, 4×4 pixel group at an additional capability level. The granularity of the shading rate selection can be controlled per draw call by the basic Tier1 VRS hardware. Controlling by draw call is already a huge improvement over MSAA, because that means shading rate is not consistent across the screen. To set the shading rate, it couldn’t be easier:

commandlist5->RSSetShadingRate(D3D12_SHADING_RATE_2X2, nullptr); // later about second parameter

That’s it, unlike MSAA, we don’t need to do any resolve passes, it just works as is.

The Tier2 VRS feature level lets the developer specify the shading rate granularity even per triangle by using the SV_ShadingRate HLSL semantic for a uint shader input parameter. The SV_ShadingRate can be written as output from the vertex shader, domain shader, geometry shader and mesh shader. In all of the cases, the shading rate will be set per primitive, not per vertex, even though vertex and domain shaders only support the per vertex execution model. The triangle will receive the shading rate of the provoking vertex, which is the first vertex of the three vertices that make up the triangle. The pixel shader can also read the shading rate as an input parameter, which could be helpful in visualizing the rate.

The Tier2 VRS implementation also supports controlling the shading rate by a screen aligned texture. The screen aligned texture is a R8_UINT formatted texture, which contains the shading rate information per tile. A tile can be 8×8, 16×16 or 32×32 pixel block, it can be queried from DX12 as part of the D3D12_FEATURE_DATA_D3D12_OPTIONS6 structure:

D3D12_FEATURE_DATA_D3D12_OPTIONS6 features_6;
device->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS6, &features_6, sizeof(features_6));
features_6.VariableShadingRateTier; // shading rate image and per primitive selection only on tier2
features_6.ShadingRateImageTileSize; // tile size will be 8, 16 or 32
features_6.AdditionalShadingRatesSupported; // Whether 2x4, 4x2 and 4x4 rate is supported

Which means, that the shading rate image resolution will be:

width = (screen_width + tileSize - 1) / tileSize;
height = (screen_height + tileSize - 1) / tileSize;

To bind the shading rate image, a call exists:


The shading rate image will need to be written from a compute shader through an Unordered Access View (RWTexture2D<uint>). Before binding it with the RSSetShadingRateImage command, it needs to be in the D3D12_RESOURCE_STATE_SHADING_RATE_SOURCE.

So there are multiple ways to set the shading rate: RSSetShadingRate, SV_ShadingRate, RSSetShadingRateImage, but which one will be in effect? This can be specified through the second parameter of the RSSetShadingRate() call with an array of combiners. The combiners can specify which shading rate selector will be chosen. For example, one that specified the least detailed shading rate (D3D12_SHADING_RATE_COMBINER_MAX), or the most detailed (D3D12_SHADING_RATE_COMBINER_MIN), or by other logic. Right now, I just want to apply the coarsest shading rate that was selected at all times, so I call this at the beginning of every command list:

GetDirectCommandList(cmd)->RSSetShadingRate(D3D12_SHADING_RATE_1X1, combiners);

Next, I’d like to show some of the potential effects these can play into.

  • Materials
    For example, if there is an expensive material or not very important, lower the shading rate of it. It can easily be a per material setting that can be set at authoring time.
Comparison of native (full rate) and variable rate shading (4×4 reduction). Texture sampling quality is reduced with VRS.
Lighting quality is reduced with VRS (below) compared to full resolution shading (above), but geometry edges are retained.
  • Particle systems
    When drawing off screen particles into a low resolution render target, we can save performance easily, but it will be difficult to compose back the particles and retain smooth edges with the geometry in the depth buffer. Instead, we can choose to render with full resolution, and reduce shading rate. We can also keep using hardware depth testing this way and improve performance.
4×4 shading rate reduction for the particles. Large overlapping particles with lighting must reduce shading rate or rendered lower resolution for good performance. (ground plane also using VRS here)
  • Objects in the distance
    Objects in distance can easily reduce shading rate by draw call or per primitive rate selection.
  • Objects behind motion blur or depth of field
    Fast moving or out of focus objects can be shaded more coarsely and the shading rate image features can be used for this. In my first integration, I am using a compute shader that dispatches one thread group for each tile and each thread in the group reads a pixel’s velocity until all pixels in the tile are read. Each pixel’s velocity is mapped to a shading rate and then the most detailed one is determined via atomic operation inside the tile. The shader internally must write values that are taken from D3D12_SHADING_RATE struct, so in order to keep an API independent implementations, these values are not hard coded, but provided in a constant buffer. [My classification shader source code here, but it will probably change in the future]
Shading rate classification by velocity buffer and moving camera.
  • Alpha testing
    For vegetations, we often want to do alpha testing, and often the depth prepass is used with alpha testing. In that case, we don’t have to alpha test in the second pass when we render colors and use more expensive pixel shaders, because the depth buffer is computed and we can rely on depth testing against previous results. Then the idea is that we will be able to reduce shading rate for the alpha tested vegetation only in the second pass, while retaining high resolution alpha testing quality from the depth prepass.
Vegetation particle system using alpha testing


  • One of my observations is that when zoomed in to a reduced shading rate object on the screen, we may see some blockiness, as if it used point/nearest neighbor sampling method, but after some thinking it makes sense, because only a single pixel value is broadcasted to to all neighbors, and no filtering or resolving takes place.
  • Also, mip level selection will be different in coarse shaded regions, because derivatives are larger when using larger pixel blocks. This is because samples are farther away from each other. For me personally, it doesn’t matter because the result is blocky anyways. This should be applied at places where the users will notice these less likely anyway. I am not sure how I would handle it with the Tier2 features, but the per draw call rate selection could be balanced with setting a lower mip LOD bias for the samplers in the draw call when there is coarse shading selected.
  • Although off screen particles can retain more correct composition with the depth buffer with VRS, if we are rendering soft particles (blending computed in the shader from linear depth buffer and particle plane difference), the soft regions where the depth test is not happening will produce some blockiness:
There are particles in the foreground, not depth tested and instead blended in the shader which causes blockiness with VRS enabled (left)
  • Classification
    There are many more aspects to consider when classifying tiles for the image based shading rate selection. Right now, the simple thing to try was select increasingly coarser shading rates with increasing minimum title velocity. The other things to consider would be that I can think of and heard of: depth of field focus, depth discontinuity, visible surface detail. All these are most likely fed from the previous frame and reprojected with the current camera matrices. Tweaking and trying all of these will have to wait for me and probably depending on the kind of game/experience one is making. Strategy games will likely not care about motion blur, unlike racing games.


Enabling the VRS gets me a significant performance boost, especially when applying to large geometries, such as the floor in Sponza (that also uses an expensive parallax occlusion mapping shader for displacement mapping), or the large billboard particles that are overlapping and using an expensive lighting shader. Some performance results using RTX 2060, 4k resolution:

  • Classification:
    0.18ms – from velocity buffer only
  • Forward rendering:
    5.6ms – stationary camera (full res shading)
    1.8ms – moving camera (variable rate shading)
    4ms – only floor (with parallax occlusion mapping) set to constant 4×4 shading rate
  • Motion blur:
    0.75ms – stationary camera (plus curtain moving on screen)
    3.6ms – moving camera
left: visualizing variable shading rate
right: motion blur amount
Motion blur increases cost when blur amount increases, but VRS reduces cost at the same time
  • Billboard particle system (large particles close to camera)
    4ms – unlit, full resolution shading
    3ms – unlit, 4×4 shading rate
    24.7ms – shaded, full resolution shading
    3.4ms – shaded, 4×4 shading rate
particle performance test – very large particles on screen, overlapping, sampling shadow map and lighting calculation

Thanks for reading, you can read about VRS in more detail in the DX12 specs.

As Philip Hammer called out on Twitter, the Nvidia VRS extension is also available in Vulkan, OpenGL and DX11:


Vulkan now has the cross vendor extension for variable rate shading, called the KHR_fragment_shading_rate. This is somewhat different from the DX12 specs, as the shading rate image needs to be a render pass attachment instead of a separate binding. This is different from the former VK_NV_shading_rate which was closer to DX12 in that regard. The new one follows a more Vulkan-like approach. For the example of implementation, you can look at Wicked Engine’s Vulkan interface.

Capsule Collision Detection

Capsule shapes are useful tools for handling simple game physics. Here you will find out why and how to detect and handle collisions between capsule and triangle mesh, as well as with other capsules, without using a physics engine.

(Note on not using a physics engine: There is reason for doing this without physics engine, even if you have a physics engine up and running within your game engine. Physics engine is probably configured for physically correct behaviour which is not necessarily the most gameplay friendly. It also comes with a lot of parameters and lot of overhead, such as the need to set up and manage a physics collision world, probably separately from your own entity system. Also, it is fun to learn how these things work.)

The capsule collisions in action

Capsules are often a better choice to represent game entities than spheres or boxes. Think about a human player character for example:

The bounding sphere doesn’t let the player close to the wall

As you can see, the sphere collision shape will not let the player close to walls if it is large enough to contain the player. But it might work well for other types of more rounded characters. For example imagine this type of roundish entity:

And now more seriously, take a look at the bounding box collision:

Bouding box is better than sphere, but it has corner cases (literally)

The box shape has another problem, that it is not rounded, so it will not be able to traverse stairs, and can be jerky when sliding along walls (shown later). These will lead to weird or unsatisfying gameplay experience.

Capsule is even better

A capsule shape will provide a result that match player expectation more closely. From programming standpoint, it will also be more complicated to implement, but not by much. The implementation is just an extension to sphere collision detection, so let’s start with that.

Sphere – Sphere collision

This is quite simple, we take the distance between the two sphere centres, and if it is less than the added radius of both, they are intersecting. We are also interested in the penetration depth and normal vector to handle game physics such as sliding on top of surfaces/walls (shown later).

Sphere – Triangle collision

This is probably the most complicated part and we should break it down. First, we determine the triangle’s plane from its corner points and see whether the sphere intersects the plane (if not, we do not continue). This is done by taking the distance between the plane and sphere centre. I tried to make the following sample codes to resemble HLSL code, but look at it like pseudo-code:

float3 p0, p1, p2; // triangle corners
float3 center; // sphere center
float3 N = normalize(cross(p1 – p0, p2 – p0)); // plane normal
float dist = dot(center – p0, N); // signed distance between sphere and plane
if(!mesh.is_double_sided() && dist > 0)
  continue; // can pass through back side of triangle (optional)
if(dist < -radius || dist > radius)
  continue; // no intersection

If the code is still running, that means the sphere is intersecting the infinite plane of the triangle. Next, we determine if there is any point inside the triangle that is inside the sphere or not. There are two cases. The first case is that if the sphere centre projected to the plane is inside the triangle perimeter, then the sphere is intersecting the triangle:

When sphere centre projection is inside the triangle…
float3 point0 = center – N * dist; // projected sphere center on triangle plane

// Now determine whether point0 is inside all triangle edges: 
float3 c0 = cross(point0 – p0, p1 – p0) 
float3 c1 = cross(point0 – p1, p2 – p1) 
float3 c2 = cross(point0 – p2, p0 – p2)
bool inside = dot(c0, N) <= 0 && dot(c1, N) <= 0 && dot(c2, N) <= 0;

If inside is true, then we can just take the point0 (projected sphere centre as the closest point on the triangle to the sphere. Otherwise, we are not sure we are intersecting with the triangle, so find the closest point to the sphere centre on all three edges, choose the closest of the three and determine if it is within the sphere radius. So for all three edge line segments, we will determine the closest point to the sphere centre and take the closest one:

When sphere centre projection is not inside triangle…
float3 ClosestPointOnLineSegment(float3 A, float3 B, float3 Point)
  float3 AB = B – A;
  float t = dot(Point – A, AB) / dot(AB, AB);
  return A + saturate(t) * AB; // saturate(t) can be written as: min((max(t, 0), 1)

float radiussq = radius * radius; // sphere radius squared

// Edge 1:
float3 point1 = ClosestPointOnLineSegment(p0, p1, center);
float3 v1 = center – point1;
float distsq1 = dot(v1, v1);
bool intersects = distsq1 < radiussq;

// Edge 2:
float3 point2 = ClosestPointOnLineSegment(p1, p2, center);
float3 v2 = center – point2;
float distsq2 = dot(vec2, vec2);
intersects |= distsq2 < radiussq;

// Edge 3:
float3 point3 = ClosestPointOnLineSegment(p2, p0, center);
float3 v3 = center – point3;
float distsq3 = dot(v3, v3);
intersects |= distsq3 < radiussq;

At the end of this, we will know whether we intersect with the triangle or not. If we intersect, we must compute the penetration depth and normal so that we will be able to properly handle the collision (bounce back or slide on surface). Like I said, we need to determine the closest intersection point for this. It is either inside the triangle, or on one of the edges:

if(inside || intersects)
  float3 best_point = point0;
  float3 intersection_vec;

    intersection_vec = center – point0;
    float3 d = center – point1;
    float best_distsq = dot(d, d);
    best_point = point1;
    intersection_vec = d;

    d = center – point2;
    float distsq = dot(d, d);
    if(distsq < best_distsq)
      distsq = best_distsq;
      best_point = point2;
      intersection_vec = d;

    d = center – point3;
    float distsq = dot(d, d);
    if(distsq < best_distsq)
      distsq = best_distsq;
      best_point = point3; 
      intersection_vec = d;

  float3 len = length(intersection_vec);  // vector3 length calculation: 
  sqrt(dot(v, v))
  float3 penetration_normal = penetration_vec / len;  // normalize
  float penetration_depth = radius – len; // radius = sphere radius
  return true; // intersection success

That’s it for the sphere – triangle intersection!

Capsule collision

The capsule collision is a simple extension to the sphere intersection test. I define the capsule with two points, base and tip, and the radius:

Points A an B are derived from Base, Tip points and radius. They will also be useful in computations. In game code I found that it makes more sense to handle the Base and Tip points instead of A and B sphere centers.

The main idea with the implementation, is that we want to do a sphere intersection test by selecting the most relevant sphere along the capsule axis, clamped to the A-B line segment. Let’s see it in action in the following cases:

Capsule – Capsule collision

We are looking for BestA and BestB points.

As you can see, the intersection test is at the end performed against two spheres. To select the most relevant spheres, the previously defined ClosestPointOnLineSegment() function will come in handy. We will compute the A and B line endpoints on both capsules, then take vectors and distances between all endpoints between the two capsules, to finally compute the closest points between two lines:

// capsule A:
float3 a_Normal = normalize(a.tip – a.base); 
float3 a_LineEndOffset = a_Normal * a.radius; 
float3 a_A = a.base + a_LineEndOffset; 
float3 a_B = a.tip - a_LineEndOffset;

// capsule B:
float3 b_Normal = normalize(b.tip – b.base); 
float3 b_LineEndOffset = b_Normal * b.radius; 
float3 b_A = b.base + b_LineEndOffset; 
float3 b_B = b.tip - b_LineEndOffset;

// vectors between line endpoints:
float3 v0 = b_A – a_A; 
float3 v1 = b_B – a_A; 
float3 v2 = b_A – a_B; 
float3 v3 = b_B – a_B;

// squared distances:
float d0 = dot(v0, v0); 
float d1 = dot(v1, v1); 
float d2 = dot(v2, v2); 
float d3 = dot(v3, v3);

// select best potential endpoint on capsule A:
float3 bestA;
if (d2 < d0 || d2 < d1 || d3 < d0 || d3 < d1)
  bestA = a_B;
  bestA = a_A;

// select point on capsule B line segment nearest to best potential endpoint on A capsule:
float3 bestB = ClosestPointOnLineSegment(b_A, b_B, bestA);

// now do the same for capsule A segment:
bestA = ClosestPointOnLineSegment(a_A, a_B, bestB);

We selected the two best possible candidates on both capsule axes. What remains is to place spheres on those points and perform the sphere intersection routine:

float3 penetration_normal = bestA – bestB;
float len = length(penetration_normal);
penetration_normal /= len;  // normalize
float penetration_depth = a.radius + b.radius – len;
bool intersects = penetration_depth > 0;

Done, we can now detect collision between two capsules.

Capsule – Triangle collision

The principle is very similar to the previous example, we select the closest point on the capsule line to the triangle, place a sphere on that point and then perform the sphere – triangle test. We can find the closest point on a line to triangle with ray tracing. First we trace the plane with the capsule line (ray). Then we determine the closest point on the triangle to the trace point. The reference point is the closest point on the capsule line to that. We will place the sphere on the reference point and do a normal triangle – sphere intersection test as usual.

Only tracing the plane and finding closest point on capsule line is not enough. We must find closest point on capsule line to the triangle. We are looking for the Reference point (pink).
It seems that this image has a mistake: the segment marked with “penetration depth”, is really (radius – penetrationdepth)
// Compute capsule line endpoints A, B like before in capsule-capsule case:
float3 CapsuleNormal = normalize(tip – base); 
float3 LineEndOffset = CapsuleNormal * radius; 
float3 A = base + LineEndOffset; 
float3 B = tip - LineEndOffset;

// Then for each triangle, ray-plane intersection:
//  N is the triangle plane normal (it was computed in sphere – triangle intersection case)
float t = dot(N, (p0 - base) / abs(dot(N, CapsuleNormal)));
float3 line_plane_intersection = base + CapsuleNormal * t;

float3 reference_point = {find closest point on triangle to line_plane_intersection};

// The center of the best sphere candidate:
float3 center = ClosestPointOnLineSegment(A, B, reference_point);

The “find closest point on triangle to line_plane_intersection” refers to the same approach that was used in the sphere – triangle intersection case, when we try to determine if the sphere centre projection is inside the triangle, or find a closest edge point. Here, instead of the sphere centre projection, we use the line_plane_intersection point:

// Determine whether line_plane_intersection is inside all triangle edges: 
float3 c0 = cross(line_plane_intersection – p0, p1 – p0) 
float3 c1 = cross(line_plane_intersection – p1, p2 – p1) 
float3 c2 = cross(line_plane_intersection – p2, p0 – p2)
bool inside = dot(c0, N) <= 0 && dot(c1, N) <= 0 && dot(c2, N) <= 0;

  reference_point = line_plane_intersection;
  // Edge 1:
  float3 point1 = ClosestPointOnLineSegment(p0, p1, line_plane_intersection);
  float3 v1 = line_plane_intersection – point1;
  float distsq = dot(v1, v1);
  float best_dist = distsq;
  reference_point = point1;

  // Edge 2:
  float3 point2 = ClosestPointOnLineSegment(p1, p2, line_plane_intersection);
  float3 v2 = line_plane_intersection – point2;
  distsq = dot(vec2, vec2);
  if(distsq < best_dist)
    reference_point = point2;
    best_dist = distsq;

  // Edge 3:
  float3 point3 = ClosestPointOnLineSegment(p2, p0, line_plane_intersection);
  float3 v3 = line_plane_intersection – point3;
  distsq = dot(v3, v3);
  if(distsq < best_dist)
    reference_point = point3;
    best_dist = distsq;

After this (that you placed the sphere onto the reference point), just finish it off with the sphere – triangle intersection, and you are done.

There is a corner case with this. When the capsule line is parallel to the triangle plane, the ray – plane trace will fail. You can detect this case by checking the dot product between the ray direction (CapsuleNormal) and triangle normal (N). If it’s zero, then they are parallel. When this happens, instead of relying on the tracing, a simple fix is to just take an arbitrary point (for example the first corner) from the triangle and use that as reference point. The reference point will be used the same way as the intersection point, to determine the closest point on the capsule A-B segment.


Now that you have capsule collision, you will want to use the results. You will obviously know when two primitives intersect now, but you also get the penetration normal and depth. You can use these for character sliding on surfaces.

If the capsule collides with a surface, then modify the velocity so that it will be absorbed in the direction of the impact and instead slide parallel to the surface:

Absorbing penetration force (red arrow) will make the box collider stop in place because it is perpendicular to collision plane (purple line). The collision plane is perpendicular to the penetration normal (blue arrow)
The penetration force (red arrow) is nicely absorbed and allows sliding upwards (green arrow).
collisionobject, normal, depth = SceneIntersectCapsule(capsule);
  // Modify player velocity to slide on contact surface:
  float3 velocity_length = length(velocity);
  float3 velocity_normalized = velocity / velocity_length;
  float3 undesired_motion = normal * dot(velocity_normalized, normal);
  float3 desired_motion = velocity_normalized – undesired_motion;
  velocity = desired_motion * velocity_length;

  // Remove penetration (penetration epsilon added to handle infinitely small penetration):
  capsule.Translate(normal * (depth + 0.0001f));

The player sliding on the contact surface is a very nice and expected experience for the player. It handles collision with the wall as well as with the floor. It will automatically handle moving a slope or stairs as well, however, if there is gravity, the player will be constantly sliding down slopes and small stairs. The gravity also makes it difficult to climb the slopes and stairs, which is not very desirable. In my experience, it is best to perform the collision detection twice, once for the player motion forces, and once with only the gravity force, but in the gravity collision phase, don’t do the sliding along surface if the surface is not sloped enough. You can check how much it is sloped by comparing the dot product of the penetration normal and the “up” vector:

if(dot(normal, float3(0, 1, 0)) > 0.3f)
  ground_intersection = true;
  gravity_force = 0;

Discrete collision detection

This kind of collision detection can detect when a stationary capsule intersect with an other one or a triangle, but if the capsule is moving very fast, it can miss objects in between two game update ticks. This phenomenon is called the bullet through paper problem. To avoid this, we must do continuous collision detection. A very simple solution for this is just to do multiple steps of collision detection during one game frame. For example, let’s say that we want to divide the game frame into 5 collision sub-steps:

int ccd_max = 5; 
for(int ccd = 0; ccd < ccd_max; ++ccd)
  float step = velocity / ccd_max;
  collisionobject, normal, depth = SceneIntersectCapsule(capsule);
  // …

The more sub-steps we do, the more accurate the result will be, as well as more expensive.

As a bonus, an idea worth to try in the future is to apply one capsule collision detection step for moving spheres instead of multiple step based continuous collision detection phases. For example to implement geometry collision for a particle system.


Some simple optimizations here can get us a long way. The most obvious one is to perform AABB (axis aligned bounding box) checks before you do a capsule collision test. AABB for a capsule is very easily generated, for example generate two AABBs for the two opposite spheres, and merge the resulting AABBs. Only check collisions between objects whose AABBs are overlapping the capsule. I also got a very prominent speed up by doing the AABB check between capsule AABB and triangle AABB. This was also very important for a large terrain mesh. In continuous collision detection fashion, you can also create a large AABB encompassing the starting and ending capsule positions (according to velocity) for all the CCD steps, so the sub-steps will only check a subset of potential collider objects.

It would be also preferable to use a BVH (bounding volume hierarchy, hierarchical bounding box acceleration structure) to reduce the number of comparisons for detecting overlapping objects/triangles.

An other thing that you probably want to return early as possible when doing the continuous collision detection sub-steps. So don’t search for the closest intersection, but return immediately if you find one. Let the gameplay code adjust position/velocity and continue with the next sub-step. It might be the case that no further collision occurs after it anyway. At least in my experience, this approach posed no visible problems.

I based the sphere collision detection approach on how the DirectXMath library does it, but modified in some places to work better for my case.

I hope you found it useful. If you notice any mistakes or have any ideas to make it better, hit the comment section below.

Tile-based optimization for post processing

One way to optimize heavy post processing shaders is to determine which parts of the screen could use a simpler version. The simplest form of this is use branching in the shader code to early exit or switch to a variant with reduced sample count or computations. This comes with a downside that even the parts where early exit occur, must allocate as many hardware resources (registers or groupshared memory) as the heaviest path. Also, branching in the shader code can be expensive when the branch condition is not uniform (for example: constant buffer values or SV_GroupID are uniform, but texture coordinates or SV_DispatchThreadID are not), because multiple threads can potentially branch differently and cause divergence in execution and additional instructions.

Instead of branching, we can consider to have multiple different shaders with different complexities and use the DispatchIndirect functionality of DX11+ graphics APIs. The idea is to first determine which parts of the screen will require what kind of complexity, this will be done per tile. A tile could be for example at 32×32 or 16×16 pixels in size, but it could depend on the kind of post process. What do I mean by complexity? Let’s take depth of field for example. Depth of field is used to focus on an object and make it sharp, while making the foreground and background blurry. The part of the screen that will appear completely sharp will use early exit shader, without doing any blurring. The other parts of the screen will instead perform a blur. For a more complex depth of field effect, such as the one described in Siggraph 2014 by Jorge Jimenez as “scatter-as-you-gather” depth of field (such a great presentation!), the blur can be separated into two kinds. A simple blur can be used where the whole tile contains similar CoC (Circle of Confusion – the amount of blur, basically), and a more expensive weighted blur will be used in the tiles where there are vastly different CoC values. With this knowledge we can design a prepass shader that classifies tiles as how complex shader they require. After that, we just DispatchIndirect multiple times back to back – one for each complexity and using different shaders. This requies a lot of setup work, so it’s a good idea to leave it for last step in the optimizations.

The implementation will consist of the following steps:

1) Tile classification.

Usually you are already gathering some per tile information for these kind of post processes. For example for the Depth of field I put the classification in the same shader that computes the min-max CoC per tile. This shader writes to 3 tile list in my case:

  • early exit tiles (when tile maximum CoC is small enough)
  • cheap tiles (maxCoC – minCoc is small enough)
  • expensive tiles (everything else)

Take a look at this example:

Focusing on the character face…
(The cute model is made by Sketchfab user woopoodle)
The face is classified as early out (blue), while parts of the geometry that is facing the camera are cheap (green). The rest is expensive (red), because the CoC range is large
Focusing on the statue in the background…
The lion head in the back is early exit (blue that’s barely visible), there are a lot of cheap tiles (green), because the CoC is capped to a maximum value, while the character silhouette is expensive (red), because it contains focused and blurred pixels as well (or just high CoC difference)

For motion blur, we can also do similar things, but using min-max of velocity magnitudes.
For screen space reflection, we could classify by reflectivity and roughness.
For tiled deferred shading, we can classify tiles for different material types. Uncharted 4 did this as you can find here:
Possibly other techniques can benefit as well.
The tile classification shader is implemented via stream compaction, like this:

static const uint POSTPROCESS_BLOCKSIZE = 8;


RWByteAddressBuffer tile_statistics;
RWStructuredBuffer<uint> tiles_earlyexit;
RWStructuredBuffer<uint> tiles_cheap;
RWStructuredBuffer<uint> tiles_expensive;

void main(uint3 DTid : SV_DispatchThreadID) // this runs one thread per tile
  // ...
  const uint tile = (DTid.x & 0xFFFF) | ((DTid.y & 0xFFFF) << 16); // pack current 2D tile index to uint

  uint prevCount;
  if (max_coc < 0.4f)
    tile_statistics.InterlockedAdd(TILE_STATISTICS_OFFSET_EARLYEXIT, 1, prevCount);
    tiles_earlyexit[prevCount] = tile;
  else if (abs(max_coc - min_coc) < 0.2f)
    tile_statistics.InterlockedAdd(TILE_STATISTICS_OFFSET_CHEAP, 1, prevCount);
    tiles_cheap[prevCount] = tile;
    tile_statistics.InterlockedAdd(TILE_STATISTICS_OFFSET_EXPENSIVE, 1, prevCount);
    tiles_expensive[prevCount] = tile;

2) Kick indirect jobs.

If your tile size is the same as the POSTPROCESS_BLOCKSIZE that will render the post process, you could omit this step and just stream compact to the indirect argument buffers inside the classification shader itself. But in my case I am using 32×32 pixel tiles, while the thread count of the compute shaders is 8×8. So the “kick jobs” shader will compute the actual dispatch count and write indirect argument buffers. This shader will also be resposible to reset the counts for the next frame. It is using only 1 thread:

static const uint DEPTHOFFIELD_TILESIZE = 32;


#define sqr(a)		((a)*(a))

RWByteAddressBuffer tile_statistics;
RWStructuredBuffer<uint> tiles_earlyexit;
RWStructuredBuffer<uint> tiles_cheap;
RWStructuredBuffer<uint> tiles_expensive;

[numthreads(1, 1, 1)]
void main(uint3 DTid : SV_DispatchThreadID)
  // Load statistics:
  const uint earlyexit_count = tile_statistics.Load(TILE_STATISTICS_OFFSET_EARLYEXIT);
  const uint cheap_count = tile_statistics.Load(TILE_STATISTICS_OFFSET_CHEAP);
  const uint expensive_count = tile_statistics.Load(TILE_STATISTICS_OFFSET_EXPENSIVE);

  // Reset counters:
  tile_statistics.Store(TILE_STATISTICS_OFFSET_EARLYEXIT, 0);
  tile_statistics.Store(TILE_STATISTICS_OFFSET_CHEAP, 0);
  tile_statistics.Store(TILE_STATISTICS_OFFSET_EXPENSIVE, 0);

  // Create indirect dispatch arguments:
  const uint tile_replicate = sqr(DEPTHOFFIELD_TILESIZE / POSTPROCESS_BLOCKSIZE); // for all tiles, we will replicate this amount of work
  tile_statistics.Store3(INDIRECT_OFFSET_EARLYEXIT, uint3(earlyexit_count * tile_replicate, 1, 1));
  tile_statistics.Store3(INDIRECT_OFFSET_CHEAP, uint3(cheap_count * tile_replicate, 1, 1));
  tile_statistics.Store3(INDIRECT_OFFSET_EXPENSIVE, uint3(expensive_count * tile_replicate, 1, 1));

Note that the tile_statistics buffer will also be used as the indirect argument buffer for DispatchIndirect later, but using offsets into the buffer. The value tile_replicate is key to have different tile size than the threadcount of the post processing shaders (POSTPROCESS_BLOCKSIZE). Essentially we will dispatch multiple thread groups per tile to account for this difference. However, TILESIZE should be evenly divisible by POSTPROCESS_BLOCKSIZE to keep the code simple.

3) Use DispatchIndirect

DispatchIndirect(&buffer_tile_statistics, INDIRECT_OFFSET_EARLYEXIT);

DispatchIndirect(&buffer_tile_statistics, INDIRECT_OFFSET_CHEAP);

DispatchIndirect(&buffer_tile_statistics, INDIRECT_OFFSET_EXPENSIVE);

Note that if you are using DX12 or Vulkan, you don’t even need to synchronize between these indirect executions because they touch mutually exclusive parts of the screen. Unfortunately, DX11 will always wait for a compute shader to finish before starting the next which is a slight inefficiency in this case.

4) Execute post process

You will need to determine which tile and which pixel are you currently shading. I will refer to tile which is the 32 pixel wide large tile taht was classified and subtile that is the 8 pixel wide small tile that corresponds to the thread group size. You can see what I mean on this drawing:

It is also called “Tileception”

So first we read from the corresponding tile list (early exit shader reads from early exit tile list, cheap shader from cheap tile list, and so on…) and unpack the tile coordinate like this:

// flattened array index to 2D array index
inline uint2 unflatten2D(uint idx, uint dim)
  return uint2(idx % dim, idx / dim);

RWStructuredBuffer<uint> tiles;

void main(uint3 Gid : SV_GroupID, uint3 GTid : SV_GroupThreadID)
  const uint tile_replicate = sqr(DEPTHOFFIELD_TILESIZE / POSTPROCESS_BLOCKSIZE);
  const uint tile_idx = Gid.x / tile_replicate;
  const uint tile_packed = tiles[tile_idx];
  const uint2 tile = uint2(tile_packed & 0xFFFF, (tile_packed >> 16) & 0xFFFF);

After we got the tile, we can continue to compute the pixel we want to shade:

  const uint subtile_idx = Gid.x % tile_replicate;
  const uint2 subtile = unflatten2D(subtile_idx, DEPTHOFFIELD_TILESIZE / POSTPROCESS_BLOCKSIZE);
  const uint2 subtile_upperleft = tile * DEPTHOFFIELD_TILESIZE + subtile * POSTPROCESS_BLOCKSIZE;
  const uint2 pixel = subtile_upperleft + unflatten2D(GTid.x, POSTPROCESS_BLOCKSIZE);

Note that we are running a one dimensional kernel instead of 2 dimensional. I think this simplifies the implementation because the tile lists are also one dimensional, but we need to use the unflatten2D helper function to convert 1D array index to 2D when computing the pixel coordinate. Also note that this code adds instructions to the shader, but up until the last line, those instructions are not divergent (because they are relying on the uniform SV_GroupID semantic), so they can be considered cheap as they are not using the most precious hardware resources, the VGPR (Vector General Purpose Registers).

After we have the pixel coordinate, we can continue writing the regular post processing code as usual.

Interestingly, I have only seen performance benefit with this optimization when the loops were unrolled in the blurring shaders. Otherwise, they were not bottlenecked by register usage, but the dynamic loops were not performing very well. After that, I have experienced an improvement with the tiling optimization, about 0.3 milliseconds were saved in Depth of Field at 4k resolution on Nvidia GTX 1070 GPU.

Probably further tests need to be conducted and trying different tile sizes and classification threshold values. However, this is already a good one to have in your graphics optimization toolbox.

You can find my implementation of this in WickedEngine for the motion blur and depth of field effects.

Let me know if you spot any mistakes or have feedback in the comments!

Entity-component system

Here goes my idea of an entity-component system written in C++. I’ve been using this in my home-made game engine, Wicked Engine for exactly a year now and I am still very happy with it. The focus is on simplicity and performance, not adding many features.

This entity-component system (ECS from now on) is used to manage the game scene. That means storing game objects in memory, updating and rendering them, writing them to disk and loading them from disk to memory. The simple ECS that can handle these consists of:

  • Entity: a unique identifier (a number). Imagine it like an ID in a database.
  • Component: a data blob (a struct). This is the data that is associated with an entity. It is up to the user to create the data type, by creating a struct or a class in C++.
  • System: data transformation (a function). This is any function that operates on a number of components. The user will write simple functions that take components as inputs and produce components as outputs. It is just any C++ function, really.
  • ComponentManager: entity to component mapping container. Each entity can have one component held in it. The user can check if an entity and component exists, and query an entity’s component. This is the thing that we will implement here and give to the user as a tool to work with.

The ECS will not only be simple, but performance is a big concern. We will want to have a lot of game objects, and manage them as fast as we can. Today, memory access is often the slow part of an application. The best way to avoid fetching memory and waiting for it to arrive, is to make good use of the cache and load from cache as often as we can. By keeping data in linear arrays, we are in a very good position, because most of the time we will iterate through every component and do something with them. The CPU is good at this when data is accessed linearly and memory can be prefetched into cache ahead of time. An other nice feature is that linear arrays will lend themselves very well to multithreading; you can use parallel for loops very simply and efficiently to process a large number of data. The ComponentManager will be responsible to manage components in a linear fashion. This is also called in many places Data Oriented Design (DoD).

Let’s look at an example first, what kind of code the user of the ECS will need to write:

// Create a new unique entity:
Entity entity = CreateEntity();

// A component:
struct Position
  float x, y, z;

// ComponentManager to hold components:
ComponentManager<Position> positions;

// Check whether the entity has a position component:
positions.Contains(entity); // returns false

// Create a new Position component for the entity:
Position& position = positions.Create(entity);

// Check whether the entity has a position component:
positions.Contains(entity); // returns true now

// A system that moves every entities' position up by one:
void UpdatePositionsSystem(ComponentManager<Position>& positions)
  for(size_t i = 0; i < position.GetCount(); ++i)
    positions[i].y += 1;

Apart from the simple example above, the ECS will have some other functionality too, but that comes later, when we need it. Let’s implement what we have so far, one by one.


First, we have the CreateEntity(). This should give us a unique number identifier. The Entity is just a typedef for a 32-bit unsigned int:

#include <cstdint>
typedef uint32_t Entity;

In the simplest implementation for CreateEntity(), you could give a monotonically increasing number:

Entity CreateEntity()
  static Entity next = 0;
  return ++next;

However, that comes with a lot of issues if you will ever need to serialize the entities, write them to disk and load later. Just imagine starting your program, creating some entities, saving them, then close the program, start again, create some entities (starts from one) then loading the previous entities (those that also started from one). Your entities will no longer be unique in this case. We will address this later, in the Serialization part, for now you can use this, then later replace this function if you are interested in serialization.

By the way, we start from one, because we reserve zero as an invalid entity handle. This is useful, for example giving default value to entities that have not yet been created:

static const Entity INVALID_ENTITY = 0;

This is it for the Entity for now. The crux of the implementation will be after this, the ComponentManager.


The ComponentManager is a container that we will implement. It will be storing components and entities in linear arrays, for the previously mentioned performance reasons. By storing both components and entities, we have the ability to query a component’s entity or an entity’s component for a specific index position inside the ComponentManager. Very useful if we just iterate through all components and want to look up into an other type of ComponentManager for a component’s entity. Speaking of lookups, sometimes we want to just look up a specific entity’s component, without iteration through the whole array, for that a hash map will be perfect. This ComponentManager can be templated based on component type, so it can look sort of like a C++ standard container:

#include <vector>
#include <unordered_map>

template<typename Component>
class ComponentManager
  std::vector<Component> components;
  std::vector<Entity> entities;
  std::unordered_map<Entity, size_t> lookup;

We then start implementing the functionality for our previous example. With a Contains() method, we can check if an entity has this kind of component or not:

bool Contains(Entity entity) const
  return lookup.find(entity) != lookup.end();

Of course, first, we should be able to create components for entities with a Create() function:

Component& Create(Entity entity)
  // INVALID_ENTITY is not allowed!
  assert(entity != INVALID_ENTITY);

  // Only one of this component type per entity is allowed!
  assert(lookup.find(entity) == lookup.end());

  // Entity count must always be the same as the number of components!
  assert(entities.size() == components.size());
  assert(lookup.size() == components.size());

  // Update the entity lookup table:
  lookup[entity] = components.size();

  // New components are always pushed to the end:

  // Also push corresponding entity:

  return components.back();

I like to put in a lot of asserts wherever I can, so a debug version of the program will catch any issues. For simplicity’s sake, I like the limitation that an entity can have only one components of the same kind. If the user would ever want more components, more ComponentManager instances can be used, but I never wanted to have more than one component of the same kind for an entity.

To iterate through every component (or entity, because there is the same amount of components and entities), we can query the amount of components/entities:

size_t GetCount() const { return components.size(); }

Then to iterate in a linear fashion, we can get a component reference from the linear array by a size_t indexer and the [] operator:

Component& operator[](size_t index) { return components[index]; }

All of these methods are a part of the templated ComponentManager class, so they reside in the header file. In fact, all our implementation will be in the same header file as part of the ComponentManager class. So far, this is enough to use the simple example that I written above. However, this is not enough functionality. Next, I will give some other useful functions that the ComponentManager should have and use cases for them.

For example, I often iterate through components and use the [] operator to index a component. But what if I need the Entity for the same component? It is not required that the component data struct contain an Entity reference. Instead, the ComponentManager can give the corresponding Entity back to the user:

Entity GetEntity(size_t index) const { return entities[index]; }

The use case is if we have different components for one entity, and one component wants to look at an other component’s data:

ComponentManager<Position> positions;
// ... continued from previous example, entity has a position now...

struct Mass
 float value;
ComponentManager<Mass> masses;

for(size_t i = 0; i < positions.GetCount(); ++i)
  Position& position = positions[i];
  Entity entity = positions.GetEntity(i);
  Mass* mass = masses.GetComponent(entity);
  if(mass != nullptr)
    position.y += 1.0f / mass->value;

In the above example, in order to update the position, we need to get access to the mass, but that is stored in an other component. We check if the entity has a mass, and if it has, we update the position according to the mass. The GetComponent() function also needs to be implemented now:

Component* GetComponent(Entity entity)
  auto it = lookup.find(entity);
  if (it != lookup.end())
    return &components[it->second];
  return nullptr;

As you can see, this will return the pointer to a component if it exists, nullptr otherwise. This performs a hash map lookup, that is slower than linear access indexing, but we won’t use it as frequently. Or at least, we shouldn’t! We should lay out our data that map lookups like this will be minimized. It is often better to duplicate data in an ECS and keep the cache warm.

What about deleting game objects? This can be a frequently used functionality of the ECS, so we must implement it. We can remove the last element of a linear array trivially, we just decrease the array size by one (std::vector’s pop() function) and destroy the last object. Removing from the middle of the array is a bit more tricky. We replace the element to be removed with the last element in the array, and decrease the array’s size. This is done by the Remove() function (duh):

void Remove(Entity entity)
  auto it = lookup.find(entity);
  if (it != lookup.end())
    // Directly index into components and entities array:
    const size_t index = it->second;
    const Entity entity = entities[index];

    if (index < components.size() - 1)
      // Swap out the dead element with the last one:
      components[index] = std::move(components.back()); // try to use move
      entities[index] = entities.back();

      // Update the lookup table:
      lookup[entities[index]] = index;

    // Shrink the container:

As you can see, we also have to keep the lookup table in sync with the entity and component arrays. I also try to use C++ move semantics and avoid copying component data if possible. It will be possible if you use only default constructors and destructors, or write move constructors for your components yourself. Otherwise, copy will be used.

If you remove an element like this, it can be problematic however, if you rely on component ordering. There was at least one case where I relied on the order. However, so far we have a pretty well usable ECS already. The following sections of the blog will be about two bigger topics:

  • Scene graph hierarchy: the problem is that we rely on component ordering in the ECS hierarchy traversal
  • Serialization: the problem is that Entity uniqueness must persist after serialization/deserialization.

If you are interested in the above topics, stick around. 🙂

Scene Graph

For me, this was actually the most interesting part of implementing the ECS because this had the most question marks in my head about. In a usual pointer based scene graph tree implementation, we can just use recursion and traverse all tree nodes starting from the root node trivially. With the ECS implementation, this is done a bit differently. We have components here, stored in linear arrays, and we can’t use pointers to index components, because they can move around in the ComponentManager when they are being added and removed. So how to do it?

The basic scene graph is a tree of nodes, where each node has a spatial position, and all its children’s positions will be relative to their parent. This means, that we can have:

  • TransformComponent: position, scale, rotation, world matrix
  • HierarchyComponent: entity’s parent, parent’s inverse world matrix at the time of attaching

With these two components, and iterating components linearly from first to last, we are able to represent hierarchy in a scene. There is one other missing piece: we will rely on the order of components in the linear arrays. Because children can refer to their single parent, but parents have no knowledge of their (possibly many) children, parents must be placed before their children in the component arrays, so that when a child computes its transform world matrix, the parent is already up to date because it was computed earlier.

For example, let’s say that we have an Attach() function:

void Attach(Entity entity, Entity parent);

This function will operate on the HierarchyComponent array and add a new component at the end, that will be the child:

struct HierarchyComponent
  Entity parentID = INVALID_ENTITY;
  Matrix world_parent_inverse_bind;
ComponentManager<HierarchyComponent> hierarchy;

void Attach(Entity entity, Entity parent)
  hierarchy.Create(entity).parentID = parent;
  // ... to be continued

However, the child we are adding might have been already a parent of other children. In that case, we want to search for a component whose parent is the current entity which we are attaching and move it before it.

// ...continued

for (size_t i = 0; i < hierarchy.GetCount(); ++i)
  const HierarchyComponent& child = hierarchy[i];
  if (child.parentID == entity)
    hierarchy.MoveItem(hierarchy.GetCount() - 1, i);

But that’s not sufficient as someone pointed out in the comments, because if a whole sub-tree is reattached, not only the parent can get in the wrong place, but all children of that, so we have to carefully check every entity and move them if they are actually a parent of some component that’s before them. So replace the previous block with this more complicated one instead:

// ...continued

if (hierarchy.GetCount() > 1)
  for (size_t i = hierarchy.GetCount() - 1; i > 0; --i)
    Entity parent_candidate_entity = hierarchy.GetEntity(i);
    const HierarchyComponent& parent_candidate = hierarchy[i];
    for (size_t j = 0; j < i; ++j)
      const HierarchyComponent& child_candidate = hierarchy[j];

      if (child_candidate.parentID == parent_candidate_entity)
        hierarchy.MoveItem(i, j);
        ++i; // next outer iteration will check the same index again as parent candidate, however things were moved upwards, so it will be a different entity!

Explanation: We start from the end of the hierarchy (from the component we just added), and check if any previous components are its children or not. If we find a child before it, we move it before the child (but keep the ordering of all components after the child intact). We move towards the beginning of the hierarchy chain and for every component, check if there is a child before them and move if necessary.

The MoveItem() is a new method for the ComponentManager class. It will move an element to a specified index inside the array and moves all elements that are in the way by one position:

void MoveItem(size_t index_from, size_t index_to)
  assert(index_from < GetCount());
  assert(index_to < GetCount());
  if (index_from == index_to)

  // Save the moved component and entity:
  Component component = std::move(components[index_from]);
  Entity entity = entities[index_from];

  // Every other entity-component that's in the way gets moved by one and lut is kept updated:
  const int direction = index_from < index_to ? 1 : -1;
  for (size_t i = index_from; i != index_to; i += direction)
    const size_t next = i + direction;
    components[i] = std::move(components[next]);
    entities[i] = entities[next];
    lookup[entities[i]] = i;

  // Saved entity-component moved to the required position:
  components[index_to] = std::move(component);
  entities[index_to] = entity;
  lookup[entity] = index_to;

That could have been done in a different way, by allocating a temporary memory and moving all the prior elements in one go instead of one by one. The current solution instead avoids extra memory allocations.

After calling MoveItem() in the Attach() function, we are not sure anymore that references remained intact, so we query the currently added child:

HierarchyComponent& parentcomponent = *hierarchy.GetComponent(entity);

Then we continue to query the TransformComponent of the parent (create it if it doesn’t exist):

TransformComponent* transform_parent = transforms.GetComponent(parent);
if (transform_parent == nullptr)
  transform_parent = &transforms.Create(parent);
// Save the parent's inverse worldmatrix:
parentcomponent.world_parent_inverse_bind = MatrixInverse(transform_parent->worldmatrix);

This is it for the Attach() function. But sometimes, we also detach children from parents. The Detach function() could be like this:

void Scene::Component_Detach(Entity entity)
  const HierarchyComponent* parent = hierarchy.GetComponent(entity);

  if (parent != nullptr)
    TransformComponent* transform = transforms.GetComponent(entity);
    if (transform != nullptr)

So we look into the hierarchy array, if the entity is in it, we detach it by applying transformation (the current position, rotation, scale is converted to world matrix) and then removing the component from the hierarchy. Again, we need to remove it while keping the ordering sorted. For that, a Remove_KeepSorted() function was implemented:

void Remove_KeepSorted(Entity entity)
  auto it = lookup.find(entity);
  if (it != lookup.end())
    // Directly index into components and entities array:
    const size_t index = it->second;
    const Entity entity = entities[index];

    if (index < components.size() - 1)
      // Move every component left by one that is after this element:
      for (size_t i = index + 1; i < components.size(); ++i)
        components[i - 1] = std::move(components[i]);
      // Move every entity left by one that is after this element:
      for (size_t i = index + 1; i < entities.size(); ++i)
        entities[i - 1] = entities[i];
        lookup[entities[i - 1]] = i - 1;

    // Shrink the container:

So we remove an element from the middle, but instead of swapping in the last one, we move every element after this one to the left, so ordering is preserved.

So we can attach, detach components now. All of it makes sense when we iterate the hierarchy and update transforms in the right order. This is the Hierarchy Update System:

void RunHierarchyUpdateSystem(
  const ComponentManager<HierarchyComponent>& hierarchy,
  ComponentManager<TransformComponent>& transforms,
  for (size_t i = 0; i < hierarchy.GetCount(); ++i)
    const HierarchyComponent& parentcomponent = hierarchy[i];
    Entity entity = hierarchy.GetEntity(i);

    TransformComponent* transform_child = transforms.GetComponent(entity);
    TransformComponent* transform_parent = transforms.GetComponent(parentcomponent.parentID);
    if (transform_child != nullptr && transform_parent != nullptr)
      transform_child->UpdateTransform_Parented(*transform_parent, parentcomponent.world_parent_inverse_bind);

We iterate the hierarchy components, but also look up from transforms, and use the parent’s inverse bind matrix when updating the transform matrix.

This concludes the basic implementation of the ECS scene graph traversal. The Attach and Detach functions are more heavy than with a pointer based approach, but the traversal itself which is the commonly executed path, should be faster, if we have a flat hierarchy (eg. one parent has many children). If the hierarchy is more deep however, the pointer based tree traversal can beat the ECS implementation. This needs more investigation from me, but I was happy with keeping only the ECS hierarchy path for everything, instead of having also a pointer based scene graph, for example skeleton bones, which usually have deep hierarchy.

A closing side note: Even though hierarchy is contained in a linear array now, multithreading is still not possible with a parallel for loop in this specific case. Because the order of the components matter, the execution order also matters. You can still find ways to parallelize, for example by doing the whole hierarchy update on a separate thread, while also doing the material update system or other kind of unrelated systems in parallel with it.


One of my main objectives was to efficiently serialize and deserialize the whole entity component system in the engine. The ComponentManager lends itself nicely for this with its linear arrays. We can just easily dump the array contents (entity array and component array) into a file and write them to disk. Loading them is also very convenient. There is only one issue. Entity serialization needs some extra care to keep them unique. We can no longer use simple monotonically increasing IDs for Entities (Update: Actually, we can, updated the post below with my new approach*). Instead, let’s generate random numbers:

#include <random>

std::random_device rand_dev;
std::mt19937 generator(rand_dev());

Entity CreateEntity()
  std::uniform_int_distribution<int>  distr(1, INT_MAX);
  return distr(generator);

The above code uses C++ random generator to generate uniformly distributed random numbers between 1 and INT_MAX. That range is enough so that we are unlikely to get collisions very soon (Note: it recommended to use 64-bit keys instead, that I discovered since writing this blog). But! there is an other problem as well: What if we want to load the same entity collection twice? Imagine that we serialized a model and saved to the disk. We want to load the model twice into the scene. The Entity numbers were randomized, but if we load them twice, they will no longer be unique!

I use the following solution: have a specific serialization post-process for Entities. We combine them with a random seed that is unique per deserialization pass. So instead of simply deserializing Entity like this:

Archive archive;
// ...
archive >> entity;

We use a slightly more complicated method:

uint32_t seed = get_random_number_once_per_serialization();
// ...
archive >> entity;
if (entity != INVALID_ENTITY && seed > 0)
  entity = ((entity << 1) ^ seed) >> 1;

Of course, I just combined that into a SerializeEntity(Entity entity, uint32_t seed); function. So I just make sure to always serialize Entities with that function.

Sometimes, you don’t want to process the entity IDs when serializing, for example when you don’t save them to disk. In that case, you can leave the seed as zero and entity IDs will remain unmodified.

I also gave the ComponentManager the ability to serialize itself, by having a Serialize() function. That will also accept the Entity seed. However, we must look out that in some cases, components can hold Enity members to reference something. This is the case with HierarchyComponent as discussed before, but other cases, like a Mesh Instance holding the Mesh Entity ID are possible too. If the ComponentManager’s Serialize function is used, then we must ensure that the Component array gets serialized and the components’ Entity reference members are seeded correctly. Thus, a Serialize(Archive archive, uint32_t seed) function will be necessary for Components in this case.

*Update: Serialization with monotonic entities

Instead of randomizing entities which could result in hash collisions, we can keep using the monotonically increasing IDs, and a lookup table in the serialization phase to avoid collisions with already existing entities while serializing. My solution is to incorporate an EntitySerializer struct with a hash map for all serialization methods:

struct EntitySerializer
     std::unordered_map remap;
     bool allow_remap = true;

The unordered map is a hash map where we will look up whether an entity that’s being deserialized was already deserialized before in the current session. If it wasn’t, we generate a new monotonic ID for it, otherwise, we will look it up from the map. I create a new entity serializer object for each separate deserialization, which is important because we can deserialize the same memory (containing the same IDs) twice, but in those cases we also want to avoid ID collisions.

The allow_remap is used to detect when we in fact don’t want to deserialize to unique IDs. For example, We want to duplicate an entity with all of its components, which is a mesh instance, so we wil serialize a new mesh instance. But the component stores an entity ID which points to a mesh entity, which we want to keep intact.

inline void SerializeEntity(wiArchive& archive, Entity& entity, EntitySerializer& seri)
   if (archive.IsReadMode())
     Entity mem;
     archive >> mem;

     if (seri.allow_remap)
       auto it = seri.remap.find(mem);
       if (it == seri.remap.end())
         entity = CreateEntity();
         seri.remap[mem] = entity;
         entity = it->second;
       entity = mem;
     archive << entity;

Well, it was a very long write up, I hope someone finds it useful. At least I kind of documented this thing for myself, so that’s already worth it.

My implementation of the ECS can be found here. It contains some additional functionality that was out of scope for this blog. The main use case is Scene management, most of the systems and components are implemented in wiScene.

Let me know in the comments if you find mistakes or have questions, or just simply want to discuss something!

Improved normal reconstruction from depth

In a 3D renderer, we might want to read the scene normal vectors at some point, for example post processes. We can write them out using MRT – multiple render target outputs from the object rendering shaders and write the surface normals to a texture. But that normal map texture usually contains normals that have normal map perturbation applied on them, which some algorithms don’t like. Algorithms that need to ray trace/ray march from the surface in multiple directions around the normal direction are not good fit with these normal mapped textures, because there is a disconnect between what is in the normal map and what is the real surface triangle normal. SSAO for example falls into this category. We want to sample rays in the normal-oriented hemisphere from the surface and check those rays against the depth buffer. The depth buffer however, contains flat triangle geometry, that doesn’t include the normal map sufrace perturbation. Because of this, rays can go below the surface that the ray is started from, and that will result in some unwanted darkening.

We could write out the flat surface normals with MRT, that would give us the perfect solution. But we probably don’t want to have yet an other render target to write in our object shader pass, to conserve memory bandwidth.

There are ways to reconstruct the flat surface normals from the depth buffer. The idea is that we want to find the normal of the triangle that the depth buffer pixel belongs to. We can find out the slope of the triangle by first computing the pixel world position from the depth buffer, than using ddx(), ddy() and cross() to determine the normal direction:

float depth = texture_depth.SampleLevel(sampler_point_clamp, uv, 0).r;
float3 P = reconstructPosition(uv, depth, InverseViewProjection);
float3 normal = normalize(cross(ddx(P), ddy(P)));

By the way, there are multiple ways to reconstruct world position from the depth buffer. One of those ways is by using the native depth buffer and the inverse view projection matrix that was used to render that depth:

float3 reconstructPosition(in float2 uv, in float z, in float4x4 InvVP)
  float x = uv.x * 2.0f - 1.0f;
  float y = (1.0 - uv.y) * 2.0f - 1.0f;
  float4 position_s = float4(x, y, z, 1.0f);
  float4 position_v = mul(InvVP, position_s);
  return / position_v.w;

This solution to compute normals is only usable in a pixel shader. In a compute shader you don’t have access to ddx() and ddy() functions. What you can do instead is to compute world positions of nearby pixels, and use them as triangle corners. If you have the triangle corners, you can use cross() to compute the triangle normal, which gives you the flat normal of the current pixel:

float2 uv0 = uv; // center
float2 uv1 = uv + float2(1, 0) / depth_dimensions; // right 
float2 uv2 = uv + float2(0, 1) / depth_dimensions; // top

float depth0 = texture_depth.SampleLevel(sampler_point_clamp, uv0, 0).r;
float depth1 = texture_depth.SampleLevel(sampler_point_clamp, uv1, 0).r;
float depth2 = texture_depth.SampleLevel(sampler_point_clamp, uv2, 0).r;

float3 P0 = reconstructPosition(uv0, depth0, InverseViewProjection);
float3 P1 = reconstructPosition(uv1, depth1, InverseViewProjection);
float3 P2 = reconstructPosition(uv2, depth2, InverseViewProjection);

float3 normal = normalize(cross(P2 - P0, P1 - P0));

At this point, you might call it a day and use your flat normals. However, you might also notice some edge cases – both of these techniques literally fail on object edges. Even though different object might be at nearby pixels, we are only naively using the depth buffer and make the assumption that nearby samples are forming actual triangles that exist in the scene.

The image above is a screenshot of the reconstructed normal buffer of the Cornell Box scene. Most notably, the edges of the boxes and the silhouette of the bunny became outlined, but that shouldn’t be the case. If we zoom in we see better:

If you use these flat normals, you can get strong artifacts with SSAO on the object edges:

Notice the left edge of the boxes and the bunny. Those normals caused havoc on those parts. Interestingly, the right edges for example don’t show the artifact. But that depends on which nearby samples you select and reconstruct the triangle corners from, you could select triangles that would fail on right edges and not fail on left edges. In more complicated scenes however, with more color variety and not so huge edges, these artifacts could remain hidden.

So what’s the solution? We should select the triangle that will be most correct for our center sample. The most correct triangle is the one whose corners are most probably on the same object as the center sample. We can approximate that by for example comparing depths of nearby samples to the center sample and picking the ones that are closest, and forming a triangle from those.

An effective way to do this is to sample the pixel neighbors in a cross pattern. That will give us the center sample, the left from it, the one above it, the one on the right and finally the one below it. From these five samples, four triangles can be formed that we could be interested in:

We will select the best triangle corner of the horizontal cross axis, and one best corner from the vertical cross axis by comparing their depth values to the center sample and selecting the closest one. For the base triangle corner, we will always use the center sample. When we decided on the best P0, P1 and P2 triangle corners, use the cross product just like before to determine the triangle normal:

float3 normal = normalize(cross(P2 - P0, P1 - P0));

Watch out however for the triangle winding order. If you assume counter clockwise order like me, then stick with it for all candidate triangles! That means, triangles in the cross pattern can be formed like this for counter clockwise ordering:

triangle 0 = P0: center, P1: right, P2: up
triangle 1 = P0: center, P1: down, P2: right
triangle 2 = P0: center, P1: up, P2: left
triangle 3 = P0: center, P1: left, P2: down

And with this technique, those edge artifacts are gone:

Now we are taking a lot of samples per pixel and also doing complicated math for all neighbors (position reconstruction) and branching in the shader, so it will introduce perfromance penalty. The good thing is that it can be optimized more in a compute shader, with group shared memory. We can store the depths and reconstructed positions for a whole 8×8 pixel tile in shared memory efficiently. We will also have an extra border of pixels stored around the tile. Additionally, we can avoid storing depths and reconstructed positions separately, and instead just store reconstructed positions in view space – the view space positions’ Z value stores linear distance, which we can use instead of depth to determine the best triangle candidate. I will demonstrate the full technique in HLSL shader code.

First, we create the group shared memory for the 8×8 pixel tile + one pixel border:

static const uint POSTPROCESS_BLOCKSIZE = 8; // 8x8 pixel tile
static const uint TILE_BORDER = 1;
groupshared float2 tile_XY[TILE_SIZE*TILE_SIZE]; // view space position XY
groupshared float tile_Z[TILE_SIZE*TILE_SIZE]; // view space position Z

The groupshared memory is storing view space positions. I separated the Z values into a deinterleaved array, because those will be loaded more frequently than XY components, when we determine the best corner points.

The compute shader is running in a 8×8 block size:

void main(uint3 DTid : SV_DispatchThreadID, uint3 Gid : SV_GroupID, uint3 GTid : SV_GroupThreadID, uint groupIndex : SV_GroupIndex)
  // ... shader body continues here...

First thing, we will preload the tile shared memory with the border region with view space positions:

const int2 tile_upperleft = Gid.xy * POSTPROCESS_BLOCKSIZE - TILE_BORDER;
  const uint2 pixel = tile_upperleft + unflatten2D(t, TILE_SIZE);
  const float2 uv = (pixel + 0.5f) * xPPResolution_rcp;
  const float depth = texture_depth.SampleLevel(sampler_linear_clamp, uv,0);
  const float3 position = reconstructPosition(uv, depth, InverseProjection);
  tile_XY[t] = position.xy;
  tile_Z[t] = position.z;

Some things to note from the above code block:
xPPResolution_rcp is a shader constant, containing the postprocess texture resolution reciprocal. That is: 1.0f / resolution.xy
– I am using InverseProjection matrix for the reconstructPosition. That will give the view space position, instead of world space position if I were using InverseViewProjection matrix.
– The unflatten2D() function converts 1D array index to 2D array index, and looks like this:

uint2 unflatten2D(uint idx, uint2 dim)
  return uint2(idx % dim.x, idx / dim.x);

We must synchronize at the end with GroupMemoryBarrierWithGroupSync() to wait all threads to finish populating its part of the group shared memory.

Next thing is to load the depth/view space Z values in the cross pattern and perform the comparisons:

const uint cross_idx[5] = {
  flatten2D(TILE_BORDER + GTid.xy, TILE_SIZE),			// 0: center
  flatten2D(TILE_BORDER + GTid.xy + int2(1, 0), TILE_SIZE),	// 1: right
  flatten2D(TILE_BORDER + GTid.xy + int2(-1, 0), TILE_SIZE),	// 2: left
  flatten2D(TILE_BORDER + GTid.xy + int2(0, 1), TILE_SIZE),	// 3: down
  flatten2D(TILE_BORDER + GTid.xy + int2(0, -1), TILE_SIZE),	// 4: up

const float center_Z = tile_Z[cross_idx[0]];

const uint best_Z_horizontal = abs(tile_Z[cross_idx[1]] - center_Z) < abs(tile_Z[cross_idx[2]] - center_Z) ? 1 : 2;
const uint best_Z_vertical = abs(tile_Z[cross_idx[3]] - center_Z) < abs(tile_Z[cross_idx[4]] - center_Z) ? 3 : 4;

Here, the flatten2D() function is resposible for converting from 2D array index to 1D array index. It can be implemented like this:

uint flatten2D(uint2 coord, uint2 dim)
  return coord.x + coord.y * dim.x;

At the end of this, we will know which neighbor is the best in horizontal direction (best_Z_horizontal) and which is best in vertical direction (best_Z_vertical). The horizontal can be either 1 (right) or 2 (left), while the vertical can be 3 (down) or 4 (up).

Finally, determine the best triangle corners with branching and create the triangle normal:

float3 P1 = 0, P2 = 0;
if (best_Z_horizontal == 1 && best_Z_vertical == 4)
  P1 = float3(tile_XY[cross_idx[1]], tile_Z[cross_idx[1]]);
  P2 = float3(tile_XY[cross_idx[4]], tile_Z[cross_idx[4]]);
else if (best_Z_horizontal == 1 && best_Z_vertical == 3)
  P1 = float3(tile_XY[cross_idx[3]], tile_Z[cross_idx[3]]);
  P2 = float3(tile_XY[cross_idx[1]], tile_Z[cross_idx[1]]);
else if (best_Z_horizontal == 2 && best_Z_vertical == 4)
  P1 = float3(tile_XY[cross_idx[4]], tile_Z[cross_idx[4]]);
  P2 = float3(tile_XY[cross_idx[2]], tile_Z[cross_idx[2]]);
else if (best_Z_horizontal == 2 && best_Z_vertical == 3)
  P1 = float3(tile_XY[cross_idx[2]], tile_Z[cross_idx[2]]);
  P2 = float3(tile_XY[cross_idx[3]], tile_Z[cross_idx[3]]);

const float3 P0 = float3(tile_XY[cross_idx[0]], tile_Z[cross_idx[0]]);
const float3 normal = normalize(cross(P2 - P0, P1 - P0));

Of course, computing the normals in view space means that you will either need to convert them to world-space, or do the SSAO in view space. I chose to do the rest of the shader in view space instead of converting, and reuse the view space position and normal for shooting the rays in view space.

And the normal buffer result is – although not perfect – is closer to what we want. There are still some errors along the edges, but in a reduced amount, and they don’t interfere with the SSAO:

However, I’ve had a slight mistake here, because the SSAO postprocess resolution is half of the depth buffer resolution. When I match the postprocess resolution with the depth buffer resolution, the normal reconstruction is actually nearly perfect (there are a few rogue pixels here and there on sharp corners, but you have to look for them really hard):

And this is with matching resolution, but not using the improvement, the outline is still present and will mess up the SSAO:

However, I don’t have to match their resolution in practice, because the SSAO results were already good with the improved reconstruction. At the end of the day, implementing the reconstruction improvement was worth it.

Well, that’s it, I hope you found it useful. I think it was definitely useful for me at least to document this for myself for future reference. 🙂

Take a look at SSAO implementation in Wicked Engine using this technique and feel free to use it for whatever you want.

I also saw this improvement technique described in a stackoverflow answer.

This is an other technique for normal reconstruction from Yuwen Wu. It discusses and solves a corner case that was not described here.

Mistakes, typos, questions? Let me know in the comments.

Thoughts on light culling: stream compaction vs flat bit arrays

I had my eyes set on the light culling using flat bit arrays technique for a long time now and finally decided to give it a try. Let me share my notes on why it seemed so interesting and why I replaced the stream compaction technique with this. I will describe both techniques and make comparisons between them. Wall of text incoming with occasional code pieces, in brain dump style.

(For this post, I considered a layout where the comparison between the two techniques for each point follow each other immediately, but decided against it, and instead there are two big blocks: stream compaction technique, then the flat bit array technique following it. The points are in similar order for both, so comparison could be easily made by having two browser windows opened for example. It wouldn’t be so simple to do the opposite thing.)

Stream compaction

Idea is based on Johan Andresson’s 2011 GDC talk: DirectX 11 rendering in Battlefield 3

Culling phase:

  • Each tile contains list of lights. Each entry is one uint that is an index into light array. So each tile contains max_lightcount_per_tile number of uints. For me it was 256 entries per tile, and decided at compile time.
    • Possible to compact even further in a way that the average light count per tile is specified (instead of max) when allocating tile buffer and have one more level of indirection. The other indirection here is introducing a “light grid” buffer/texture that would contain the offset and count into the “light index buffer”, while the “light index buffer” contains the light indices in a very compact stream (next tile list’s first entry immediately follows the last light index in the previous tile). This way one tile can contain an unbounded number of lights (well, still bounded on the global tile buffer allocation size). The construction of such structure is complicated and not worth the trouble in my experience (it uses even more atomics). You can see such technique presented in [this other blog]. I have implemented this technique, but in the end I chose to keep the hard coded max items per tile and have one less level of indirection.
  • Culling shader performs stream compaction, that is, using a light index list in groupshared memory and an atomic counter to append lights to the list. Something like this:
groupshared uint ArrayLength = 0;
groupshared uint Array[256];

void AppendLight(uint lightIndex)
    uint index;
    InterlockedAdd(ArrayLength, 1, index);
    if (index < 256)
        Array[index] = lightIndex;
  • Stream compaction can be scalarized with wave intrinsics so that 64x less amount of atomic operations are performed. The idea here is that we will have a per-wavefront bitmask containing set bits for all lanes that wanted to append. This is retrieved by WaveActiveBallot(IsLightCulledInCurrentThread()). This bitmask first gives us how many lanes wanted to append (countbits() function), and the first lane does an InterlockedAdd() with this value to the groupshared light counter. The rest of the lanes don’t do the atomic, so atomics are significantly reduced. The first lane’s result of the InterlockedAdd() returns the wavefront’s offset to the groupshared light index list and broadcasted to the other lanes via WaveReadLaneFirst() intrinsic function. The rest of the lanes need to just write their light index to the list, starting from the broadcasted offset, and doing a prefix sum on the ballot bitmask (WavePrefixSum()) that will give the real offset to the groupshared light index list.
  • Stream compaction is unordered because the item append order relies on one atomic counter. That is fine for lights which use additive blending where the order of operations doesn’t matter for the final output. But once we have decals or environment probes, those are not blended additively, so the undefined order of the items would result in flickering between tiles. To avoid that, we can do a bitonic sort in the compute shader. Doing a bitonic sort in the compute shader is a heavy weight operation, but we can keep track if there were any decals or probes in the tile and only do the sorting if there are. A bitonic sort that runs in a single thread group looks like this (from Gareth Thomas’s tile based particle rendering talk):
// Parameter is SV_GroupIndex
void BitonicSort( in uint localIdxFlattened )
    uint numArray = ArrayLength;
    uint numArrayPowerOfTwo = 2 << firstbithigh(numArray - 1);

    for( uint nMergeSize = 2; nMergeSize <= numArrayPowerOfTwo; nMergeSize = nMergeSize * 2 )
        for( uint nMergeSubSize = nMergeSize >> 1; nMergeSubSize > 0; nMergeSubSize = nMergeSubSize >> 1 )
            uint tmp_index = localIdxFlattened;
            uint index_low = tmp_index & ( nMergeSubSize - 1 );
            uint index_high = 2 * ( tmp_index - index_low );
            uint index = index_high + index_low;

            uint nSwapElem = nMergeSubSize == nMergeSize >> 1 ? index_high + ( 2 * nMergeSubSize - 1 ) - index_low : index_high + nMergeSubSize + index_low;
            if( nSwapElem < numArray && index < numArray )
                if( Array[ index ] < Array[ nSwapElem ] )
                    uint uTemp = Array[ index ];
                    Array[ index ] = Array[ nSwapElem ];
                    Array[ nSwapElem ] = uTemp;
  • The best culling granularity in my experience turned out to be having 16×16 pixel tiles and 16×16 threads in the culling threadgroup. A large thread group is beneficial, because we will have a large groupshared memory reservation, and the bitonic sort works best with a large thread count. Remember that even though we are only interested in the decal and probe ordering, we have to sort the whole entity array, including lights, because lights are in the same array as decals and probes. A large thread group and large groupshared memory usage means reduced occupancy of the shader, so parallel execution between thread groups is reduced.
  • We are still only in the culling shader phase, and an other difference is how the groupshared memory (also called LDS or local data share) is initialized at the beginning. The depth range min-max and the 2.5D culling mask are initialized, and the counter that is responsible to insert the light indices into the array is zeroed out. The LDS light array doesn’t need to be initialized.
  • At the end of the culling shader, we export the light list by having each thread export one light index. Also, the first thread exports a “tile statistics” uint that says how long is the list, how many decals,probes and lights are inside it.

Rendering phase:

  • We process the light list in the forward+ object shader or the tiled deferred compute shader trivially. We read the “tile statistics” uint first, and determine the decals count, probe count and light count in the tile. The following uints are indices into the global entity array that the camera can see, so just iterate through all of them until we reach the last one. Having the decal and probe count is handy, because for some objects, or particles, etc. we don’t want to apply decals, so we can just skip through them and start with the first light index instead. Also, if we process decals, and the top decal fully covers the ones under it, we can skip processing them and continue with the lights.
  • We have high divergence if we process like this inside a pixel shader (forward+), but we can help a bit by using wave intrinsics to scalarize it. It is not trivial and in the worst case it can lead to redundant processing of items. See more info in Drobot’s presentation, the section about “hierarchical container scalarization”.
  • The good thing about processing a compact light index list is that it is compact, so every light index that we process will be contributing to the tile.
Light culling in Sponza

Flat bit arrays

Idea from Michal Drobot’s 2017 Siggraph presentation: Improved culling for tiled and clustered rendering

Culling phase:

  • Each tile contains bit buckets. One bit bucket is a 32-bit uint, that is responsible to index up to 32 lights. So each tile contains max_lightcount_in_view/32 number of uints. That means that for the same amount of max lights per tile as stream compaction method, we use 32x less memory.
  • Culling shader simply computes which bucket can index the currently processed light, then computes which bit maps to the light index, then performs atomic OR operation to the groupshared light bucket list. The groupshared light bucket list is max_lightcount_in_view/32 number of units. The following code piece demonstrates this:
groupshared uint tile[256 / 32];

void AppendLight(uint lightIndex)
    const uint bucket_index = entityIndex / 32;
    const uint bucket_place = entityIndex % 32;
    InterlockedOr(tile[bucket_index], 1 << bucket_place);
  • Bitmask insertion could probably be scalarized in a way that no atomic operations would be necessary. I imagine we would just do a WaveActiveBitOr() instead of InterlockedOr() on the bucket, and if it is the first lane (WaveIsFirstLane()) it would just write the bucket without using atomics. (unfortunately I can’t get sm6.0 working on my current laptop, so I haven’t tried this yet)
  • Bitmask insertion is ordered, we will always get a deterministic order of items, which is the same order as the CPU assembled the global light array. If we have decals and environment probes, we only ensure that they are culled and added to the global light list (or should we say entity list at this point) before the lights (on the CPU). The reason is, because we want to apply decals before applying lights, so that they are lighted just like the surface underneath. Environment probes can be applied before or after the lights, because they will be added to the specular result. However, ordering between locally placed probes does matter, because they are blended amongst themselves with alpha blending. So in the end, we don’t have to sort in the culling shader with the flat bit array technique. That is a big relief and greatly simplifies the shader.
  • The best culling granularity for flat bit arrays turned out to be having 16×16 pixel tiles, however we can also reduce the thread group size to be 8×8 for example to have better occupancy, because there is no sort and little LDS use compared to the compaction technique. But then some shader complexity is added, because each thread will be responsible to work on a small subgroup of the tile. Consider that with 16×16 pixels and 16×16 threads, each thread can work on a single pixel when reading the depth buffer and creating the depth bounds and maybe a 2.5D culling bitmask (Takahiro Harada’s 2012 talk in Siggraph Asia, and also my blog about it). But when reducing the thread count to 8×8, each thread must load 2×2 pixel block instead. However, this also reduces the atomic contention for computing the min-max depth bounds or the 2.5D bitmask (4x times less atomic operations for these phases), because the operations inside the subgroup are performed as non atomic. The whole tile result then will be the result of atomic operations between the subgroups, so in the end that just means less atomics were performed. We don’t have to modify the culling part of the shader however, because that doesn’t work on a thread/pixel granularity, but thread/entity granularity instead. The last thing here worth to mention is that 8×8 thread group size seems like a perfect fit for AMD GCN architecture, which has 64 threads in a wavefront, so the barriers in the compute shader will be even omitted completely. On Nvidia, the wavefront (warp) size is 32 threads, but modifying the thread size to be for example 8×4 threads would also require redesigning how subgroups are processed, which I didn’t tackle for now. It would be an interesting experiment to see how different subgroup distributions would perform. In my implementation, a thread is operating on a 2×2 pixel block, but that could be rewritten so that a thread operates on a row of 4 pixels or something else. Maybe a 8×8 pixel block should be processed first, then move on to the next 8×8 pixel block, and do this 2×2 times. Need to try these and compare. An other idea here: the 2×2 pixel block is also nice for gathering from the depth buffer, we could use one GatherRed() instruction to load 4 samples for the whole subgroup, however that will require a custom code path that won’t work if we choose an other kind of distribution.
  • We don’t have a light array counter in the flat bit aray technique, but instead all the light buckets need to be set to zero, indicating that no lights are contained yet. This is done by each thread just zeroing out one bucket (one uint) until all of them are zeroed out. Apart from that, the common things such as tile depth range are also initialized like always.
  • At the end of the culling shader, each thread exports one bucket, until all buckets were written to global memory. There is no need to have a “tile statistics” entry, because the bitfield directly corresponds to all entities inside the main camera. So the tile statistics can be replaced with just a constant buffer value that contains how many decals/envprobes/lights are inside the main camera.

Rendering phase:

  • Processing the flat bit arrays relies on the firstbitlow() instrinsic in HLSL that gives us the first set bit’s index in the uint bucket, starting from the lowest bit. If the bucket is zero, we continue with the next bucket (thus skipping over 32 non-visible lights), otherwise call firstbitlow(bucket), compute the light index from the bucket index and the bit index, then process the light. Then remove the bit and continue to get the next first low bit until the bucket is not zero. If we want to skip decals, we can do that here too: we can trivially start from the specific bucket’s specific bit that we want to start iterating on (for example if we want to leave out the decals but there are 42 decals inside the camera (first light in entity 43), we can start processing from the second bucket, and mask off the first 10 bits (42-32=10). Here is a pseudo-code how we can extract light index bits from the buckets:
// retrieve tile index (for 16x16 block size):
const uint2 tileIndex = uint2(floor(pixel.xy / 16));

// compute flattened tile index because tile buffer is linear array:
//  g_tileCount = ScreenResoulution.xy / 16
//  g_bucketCountPerTile = max_light_count / 32
const uint flatTileIndex = (tileIndex.x + tileIndex.y * g_tileCount.x) * g_bucketCountPerTile;

for(uint bucket = 0; bucket < g_bucketCountPerTile; ++bucket)
    uint bucket_bits = Tiles[flatTileIndex + bucket];

    while(bucket_bits != 0)
        const uint bucket_bit_index = firstbitlow(bucket_bits);
        const uint entity_index = bucket * 32 + bucket_bit_index;
        bucket_bits ^= 1 << bucket_bit_index;
        Light light = LightArray[entity_index];
        // process light here...
  • Processing these in pixel shaders for forward+ will still yield high VGPR usage because of divergent access to the buckets. However, the scalarization of the buckets themselves is really elegantly done with a single line: bucket_bits = WaveReadLaneFirst(WaveActiveBitOr(bucket_bits)); After the buckets are scalarized, loading from the global light array (entity array) is also scalar.
  • A thing to note about flat bit arrays, is that there can be “holes” in the buckets, so let’s say lightIndex 2 and lightIndex 987 are the only ones contained inside the tile, then the buckets between bucket 0 and bucket 30 are zero, but they can’t be detected in a simple way, so the buckets will be read but not processed. With an other level of indirection, this effect could be reduced, for example having some coarse tiles that are culled with stream compaction, then the finer tile bit buckets would contain indices to those coarse tile light lists. I don’t expect this to be a huge problem, so I haven’t gone there. And it would require the sorting again too. Also, as pointed out by Matt Pettineo on Twitter, the indirection doesn’t have to use stream compaction, but could be an other bitmask, where each bit represents one bucket of 32 entities.
Light culling in Bistro


I’ll compare some results on the bistro scene. The configuration is for GPU times and measured with timestamp queries:

  • Common:
    • Nvidia GTX 1050 laptop GPU
    • Bistro scene (inside + outside)
    • 1000 lights, 0 decals, 0 probes
    • Forward+ rendering
    • 4096 max lights in frustum
    • non-scalarized versions (DX11)
  • Stream compaction method:
    • 255 max lights per tile (255 lights + 1 tile statistics)
    • light culling (4k): 5.6 ms
    • light culling (1080p): 1.4 ms
    • opaque geometry (4k): 43.6 ms
    • opaque geometry (1080p): 13 ms
  • Flat bit array method:
    • 4096 max lights per tile (4096 / 32 = 128 buckets per tile)
    • light culling (4k): 2.5 ms
    • light culling (1080p): 0.64 ms
    • opaque geometry (4k): 41.7 ms
    • opaque geometry (1080p): 12.6 ms

The flat bit array method is the winner here, no matter how I look at it. The biggest boost is in the culling shader that is greatly simplified, but the opaque geometry pass is consistently better as well. (4k means 3840×2160 resolution here, 1080p is full HD 1920×1080)

An other small test I made was for the decals, and I just used a simple cornell box, then placed 100 decals into one spot, overlapping with each other, 4k resolution and covering a small part of the screen:

  • Test: 100 small overlapping decals (4k)
  • Stream compaction method:
    • decal culling + sorting: 3.63 ms
    • opaque geometry: 3.67 ms
  • Flat bit arrays method:
    • decal culling: 1.33 ms
    • opaque geometry: 2.96 ms
that was the 100 small decals, red tile = 100 decals, black = 0 decals

It’s a win again, and the closer I zoomed the camera into the decals, the bigger the difference was between the two techniques. This is the result for full screen decal coverage, 100 overlapping decals, still 4k resolution:

  • Test: 100 full screen overlapping decals (4k)
  • Stream compaction method:
    • decal culling + sorting: 17.82 ms
    • opaque geometry: 84.57 ms
  • Flat bit arrays method:
    • decal culling: 2.12 ms
    • opaque geometry: 71.48 ms

To be honest, I did not expect that kind of speedup, so I am very pleased and can say time was well spent implementing this.

Other notes:

  • One interesting thing I noticed is that I first had one huge loop, and the decal + probe + light evaluation were all included in that single loop. Performance was horrible, especially on the Intel 620 GPU. It was something like +10 ms for opaque geometry pass on a scene with 4 big shadow casting lights, even if no decals or probes were processed at all. After that, I broken up the loop into 3 loops, one for decals, one for probes and one for lights. The performance jump was very noticable. This applies for both techniques, try to break up entities that are processed differently to achieve the best performance. This is due to the compiler can manage to do a better job reserving and reusing registers, so in the end you can have a reduced amount of registers used, even if there are more loops. This is because register lifetime can be constrained to loop scope (if they are not unrolled).
  • An other thing related to the Intel 620 GPU was the annoying fact that I noticed some small flickering of the tiles with the stream compaction technique that only happened rarely, even with a completely static scene. I could never figure it out, but it doesn’t happen with the flat bit array technique. Now I suspect that it might be related to the thread group size, because I also had bugs with tiled deferred compute shader if it was running with 16×16 thread group size and using a RW texture.
  • After getting to know the flat bit array technique, I also implemented a similar one for old-school forward rendering path. For that, the CPU can cull lights per object and provide a per draw call bitmask in a constant buffer. The bitmask is a uint4, where 1 uint represents 32 decals, 1 uint represents 32 probes, and 2 uints allow 64 lights to be evaluated for forward rendered objects. I never wanted to do this previously, because then a big constant buffer needs to be updated with light indices per draw call. An other candidate how to send this bitmask to the shader would be the instance buffer, but having it in the constant buffer means scalarized loads (and also don’t have to propagate the mask from vertex to pixel shader – less parameters to send is good to avoid parameter cache bottleneck). Unfortunately, culling efficiency is reduced if many instances are visible, because a constant buffer will be per draw call granularity instead of per instance.
  • It will be also interesting to try not using a culling compute shader, but rely on conservative rasterization (or MSAA), and draw the light proxies instead, then those would do atomic writes into the tile buffer in the pixel shader. Michal Drobot suggested that this can be faster than running the compute shader and doing software culling, because rasterization will make use of the zbuffer, and zbuffer tile acceleration (coarse conservative z-test in the hardware that supports it, not user controlled). A minor issue with this might be that not everything supports conservative rasterization or forced sample count setting for MSAA today (conservative rasterization can be approximated with MSAA).

Implementation samples:

As always, you can find my implementation for these in Wicked Engine. I have made two snapshot branches for the two separate techniques, and while the main branch will updated, these will remain like this to represent the state of this blog:

Some other light culling resources:

Simple job system using standard C++

After experimenting with the entity-component system this fall, I wanted to see how difficult it would be to put my other unused CPU cores to good use. I never really got into CPU multithreading seriously, so this is something new for me. The idea behind the entity-component system is both to make more efficient use of a single CPU by having similar data laid out linearly in memory (thus using cache prefetching when iterating), and also making it easier to write parallelized code (because data dependecies are more apparent). In this post I want to talk about the job system I came up with. It will not only make sense in the entity-component system, but generally it should perform well for any large data processing task. I wanted to remain in standard C++ (~11) realm, which I found that it is entirely possible. However, it can be extended with platform specific extensions if wanted. Let’s break this blog up into multiple parts:

  1. Interface – how do we want to use it (I hope you’re familiar with lambdas)
  2. Implementation – let’s see the code
  3. Performance – I guees this would be pointless without it
  4. Windows platform specific extension – optional, if you can still keep going
  5. Closure – all must come to an end

I think the best way to get my point across would be to just post the interface code here, because it is very short and make a few comments:

#include <functional>

// A Dispatched job will receive this as function argument:
struct JobDispatchArgs
	uint32_t jobIndex;
	uint32_t groupIndex;

namespace JobSystem
	// Create the internal resources such as worker threads, etc. Call it once when initializing the application.
	void Initialize();

	// Add a job to execute asynchronously. Any idle thread will execute this job.
	void Execute(const std::function<void()>& job);

	// Divide a job onto multiple jobs and execute in parallel.
	//	jobCount	: how many jobs to generate for this task.
	//	groupSize	: how many jobs to execute per thread. Jobs inside a group execute serially. It might be worth to increase for small jobs
	//	func		: receives a JobDispatchArgs as parameter
	void Dispatch(uint32_t jobCount, uint32_t groupSize, const std::function<void(JobDispatchArgs)>& job);

	// Check if any threads are working currently or not
	bool IsBusy();

	// Wait until all threads become idle
	void Wait();

How this would work? First, we would call JobSystem::Initialize() function somewhere in our application once (before we want to use the job system for the first time around). Probably the best place is when we initialize the application and before doing any resource loading, because the job system can also handle resource loading efficiently.
Then the simplest way we can make a job is by calling JobSystem::Execute(myTask); where myTask is just a simple function without parameters. I use it for example when initializing engine systems, like this:

JobSystem::Execute([] { /*do stuff 1*/ InitGraphics(); });
JobSystem::Execute([] { /*do stuff 2 which is not realted to stuff 1*/ InitPhysics(); });
JobSystem::Execute([] { /*do stuff 3 which is not related to the previous stuff*/ InitInput(); });
JobSystem::Execute([] { /*do stuff 4 which is not related to the previous stuff*/ InitNetwork(); });
JobSystem::Execute([] { /*you probably already get the idea*/ InitSound(); });

The above code creates 5 jobs, which are already probably utilizing 5 worker threads (if the Initialize() created that many workers, but later on that).
Then we can call JobSystem::Wait(); to simply wait for all the jobs to finish, before doing anything that depends on their completion. For example this code would utilize 3 threads:

JobSystem::Execute([] { /*do stuff 1*/ });
JobSystem::Execute([] { /*do stuff 2*/ });

Where the main thread is responsible to notify two worker threads that there are some jobs, and then wait. It also means that overall we are doing extra bookkeeping work by the main thread instead of just doing the jobs one by one. But this should be fine when looking at the big picture, because some work is a good fit to be executed in parallel (if there is no dependency between them). We should only do jobs in parallel if they are a good fit, a lot of times it is still just better to do it on the main thread. Oh and by dependency, I mean if there is any variable that is being written by a job and read by an other, then there is a dependency.

The IsBusy() function is a way for the main thread to just see if any worker threads are busy. It can be used if we don’t actually want to wait for work to finish, and the main thread has other meaningful work to do. For instance, drawing a loading screen is a good example.

The last one is the Dispatch function. This is the most interesting and most useful one in my opinion. I was inspired by how compute shaders work on the GPU, and wanted really bad to use something like that on the CPU. You can think of it like the Execute function, but it actually creates multiple jobs instead of one (jobCount argument). Let’s skip the groupSize argument now and return to it shortly. The third argument is the job itself that will receive a JobDispatchArgs argument that will identify which instance of the job is currently running, and which group (hang in there). To demonsrate how it will be used, look at this loop iterating some large dataset:

const uint32_t dataSize = 1000000;
Data dataSet[dataSize];
for(int i = 0; i < dataSize; ++i)

And let’s rewrite the same thing but this time using Dispatch, to offload it to multiple threads:

const uint32_t groupSize = 1;
JobSystem::Dispatch(dataSize, groupSize, [&dataSet] (JobDispatchArgs args) {

The above code does the same thing as the loop, but now several worker threads each operate on a single element of dataSet. Here is when the groupSize argument comes into play. We actually don’t want each thread to only work on a single element. Not only because of caching, but then 1000000 jobs need to be generated that would put a significant burden on the main thread which is creating the jobs. Instead, we should specify let’s say const uint32_t groupSize = 1000; or something like that. That means, one thread will operate on a batch of 1000 elements of the dataSet. In every case that we are putting a Dispatch in our application, the groupSize must be considered. Just like we have to when we are writing compute shaders – which are really similar to this and have the same purpose: performing work over a large dataset, where data is mostly independent of other data in the set. If our dataSet has elements with dependencies, then it might not be a good candidate – unless we can think of a clever way to utilize the groupSize argument. Because the thing with groupSize, is that with groupSize = 1000, it means that 1000 sub-jobs will run serially on the same thread – then there can be dependencies inside the group. But here it starts to get out of topic, so I recommend to get experienced in compute shaders and using groupshared memory, because the same principles could be applied here. I won’t go into handling groupshared memory in this post.

Now that we see how we want to use the job system, let’s implement the internal workings of it, using standard C++ constructs.

I will go over this mostly in the order I described the interface. But before getting into the first function which is Initialize(), define the internal state and include files, stuff like that:

#include "JobSystem.h"    // include our interface

#include <algorithm>    // std::max
#include <atomic>    // to use std::atomic<uint64_t>
#include <thread>    // to use std::thread
#include <condition_variable>    // to use std::condition_variable

namespace JobSystem
	uint32_t numThreads = 0;    // number of worker threads, it will be initialized in the Initialize() function
	ThreadSafeRingBuffer<std::function<void()>, 256> jobPool;    // a thread safe queue to put pending jobs onto the end (with a capacity of 256 jobs). A worker thread can grab a job from the beginning
	std::condition_variable wakeCondition;    // used in conjunction with the wakeMutex below. Worker threads just sleep when there is no job, and the main thread can wake them up
	std::mutex wakeMutex;    // used in conjunction with the wakeCondition above
	uint64_t currentLabel = 0;    // tracks the state of execution of the main thread
	std::atomic<uint64_t> finishedLabel;    // track the state of execution across background worker threads

	// ...And here will go all of the functions that we will implement

Mostly everything here is using standard C++, except the jobPool. It can also be done with a std::deque, but then also make it thread safe by using mutexes or some kind of synchronization primitives. Or if you are interested, here is my very simple thread safe queue (or ring buffer) implementation:

// Fixed size very simple thread safe ring buffer
template <typename T, size_t capacity>
class ThreadSafeRingBuffer
	// Push an item to the end if there is free space
	//	Returns true if succesful
	//	Returns false if there is not enough space
	inline bool push_back(const T& item)
		bool result = false;
		size_t next = (head + 1) % capacity;
		if (next != tail)
			data[head] = item;
			head = next;
			result = true;
		return result;

	// Get an item if there are any
	//	Returns true if succesful
	//	Returns false if there are no items
	inline bool pop_front(T& item)
		bool result = false;
		if (tail != head)
			item = data[tail];
			tail = (tail + 1) % capacity;
			result = true;
		return result;

	T data[capacity];
	size_t head = 0;
	size_t tail = 0;
	std::mutex lock; // this just works better than a spinlock here (on windows)

It is templated, so you can also use it for any type of data. The nice thing about it is that is very simple, only uses a fixed size memory allocation. It can’t reallocate memory, doesn’t do anything clever, so performance might be more reliable than std::deque. Also, when an element doesn’t fit because there is no more space, it returns false on push_back, or if it is empty and you want to remove an element, it returns false on pop_back. if you got false, you can just try again and maybe that time, it will be successful, because an other thread grabbed/put a job onto it.

Now we have everything we need, except the actual functions. Here is what Initialize() looks like:

void Initialize()
	// Initialize the worker execution state to 0:;

	// Retrieve the number of hardware threads in this system:
	auto numCores = std::thread::hardware_concurrency();

	// Calculate the actual number of worker threads we want:
	numThreads = std::max(1u, numCores);

	// Create all our worker threads while immediately starting them:
	for (uint32_t threadID = 0; threadID < numThreads; ++threadID)
		std::thread worker([] {

			std::function<void()> job; // the current job for the thread, it's empty at start.

			// This is the infinite loop that a worker thread will do 
			while (true)
				if (jobPool.pop_front(job)) // try to grab a job from the jobPool queue
					// It found a job, execute it:
					job(); // execute job
					finishedLabel.fetch_add(1); // update worker label state
					// no job, put thread to sleep
					std::unique_lock<std::mutex> lock(wakeMutex);

		// *****Here we could do platform specific thread setup...
		worker.detach(); // forget about this thread, let it do it's job in the infinite loop that we created above

So, in case the comments didn’t do it justice, let me recap: First, we initialize the worker-side label state, but don’t worry about it yet, it will make sense later (hopefully). Then C++ has the neat hardware_concurrency function that can give us how many cores (incuding hyperthreaded virtual cores) our CPU has, so we create that many worker threads in the for loop. Each thread is created with an infinite loop, that it will be doing until our application exits. But most of the time, each thread will just sleep (until the main threads wakes it up), to not spin the CPU wastefully.

Let’s see the Execute function next:

// This little helper function will not let the system to be deadlocked while the main thread is waiting for something
inline void poll()
	wakeCondition.notify_one(); // wake one worker thread
	std::this_thread::yield(); // allow this thread to be rescheduled

void Execute(const std::function<void()>& job)
	// The main thread label state is updated:
	currentLabel += 1;

	// Try to push a new job until it is pushed successfully:
	while (!jobPool.push_back(job)) { poll(); }

	wakeCondition.notify_one(); // wake one thread

Sorry if you got distracted by the poll function above, I did a function for it because it will be used in more places. I will explain, but first, ket’s digest the Execute function: The main thread’s label is incremented. This means, that one job is being submitted, so the JobSystem only becomes idle when the worker label reached the same value. In the previous Initialize function, you can see that in the infinite loop, after a job is executed, the worker label state is also incremented by one. But there it is an atomic operation, because all the workers will be updating that single label.
Then we just push the new job onto the end of the jobPool. As I said, my ringbuffer could be full, so we will try to push the job until it actually succeeds. If it doesn’t succeed, then we invoke the poll function, which will wake up one worker, and yield the main thread. Doing this, we can be sure that the jobPool will eventually be drained by the workers, and the main thread can then schedule the current job.
But maybe, the job was just pushed fine the first time around, and the poll() never executed. In that case, we just wake up any one worker thread that will start working on the job immediately.

I hope I am not putting anyone to sleep yet. Let’s follow with the IsBusy and Wait functions:

bool IsBusy()
	// Whenever the main thread label is not reached by the workers, it indicates that some worker is still alive
	return finishedLabel.load() < currentLabel;

void Wait()
	while (IsBusy()) { poll(); }

You can see how we are checking the state of the main thread and the state of the workers in the IsBusy function. This is also used by the Wait function if we just want to put the main thread to idle and drain all the pending jobs.

Now, with implementing the Dispatch function, you will get full insight into the Job System’s inner workings. This is the most complicated of them all, but if you look at it in a way, it is not very different from the Execute function, just does the same thing multiple times with some little twists:

void Dispatch(uint32_t jobCount, uint32_t groupSize, const std::function<void(JobDispatchArgs)>& job)
	if (jobCount == 0 || groupSize == 0)

	// Calculate the amount of job groups to dispatch (overestimate, or "ceil"):
	const uint32_t groupCount = (jobCount + groupSize - 1) / groupSize;

	// The main thread label state is updated:
	currentLabel += groupCount;

	for (uint32_t groupIndex = 0; groupIndex < groupCount; ++groupIndex)
		// For each group, generate one real job:
		auto& jobGroup = [jobCount, groupSize, job, groupIndex]() {

			// Calculate the current group's offset into the jobs:
			const uint32_t groupJobOffset = groupIndex * groupSize;
			const uint32_t groupJobEnd = std::min(groupJobOffset + groupSize, jobCount);

			JobDispatchArgs args;
			args.groupIndex = groupIndex;

			// Inside the group, loop through all job indices and execute job for each index:
			for (uint32_t i = groupJobOffset; i < groupJobEnd; ++i)
				args.jobIndex = i;

		// Try to push a new job until it is pushed successfully:
		while (!jobPool.push_back(jobGroup)) { poll(); }

		wakeCondition.notify_one(); // wake one thread

Take a deep breath, and let’s tackle this slowly. First, the jobCount (which you can think of as the “loop size”), can be a dynamic parameter, so if it is 0, just return immediately, because we don’t want to shedule 0 jobs.
Then the groupCount is calculated. Don’t confuse this with groupSize, that you specified as an argument to Dispatch. The groupCount is calculated from the jobCount and groupSize and tells us how many worker jobs (or groups) will be put onto the jobPool. There is an integer divide, but we must overestimate it, because let’s say that we want to have jobCount=3, groupSize=2. From that we want to create 2 actual worker jobs (groups), but the integer divide of jobCount/groupSize would give us only 1 group. We must instead schedule 2 groups, each with 2 jobs. But wait, that would be 4 jobs then, even though we specified jobCount=3. We will just discard the out of bounds jobs later.
Then the main thread label is updated with adding in the groupCount. That is how many workers must finish from this point for the JobSystem to become idle.
The following for loop goes over every group, creates a worker job for it with a lambda, passing in the required parameters to populate the JobDispatchArgs which our dispatched job needs to receive. That lambda is the one that will be executed by a worker thread. It contains logic to populate a JobDispatchArgs and run through all inner sub-jobs while making sure that no extra out of bounds iterations are performed. I think this might be the most complicated part if you are going to try to wrap your mind around someone else’s code. Hang in there!
After that group-worker-lambda thing, we do just like the Execute function, try to schedule the group and get the worker threads working immediately on it.

I’m not sure if I made sense of all this, but at this point the JobSystem is ready to be used actually! Shall we take a look of what kind of performance improvements we can expect right off the bat, then continue reading:

The first thing I tried, was that I started to use the JobSystem::Execute() in my Wicked Engine when initializing the main engine systems, so each system initialization would be performed in its own execute. The system initialization tasks were large independent tasks, operating on some arbitrary/different data, a good fit for the Execute function. We can make up a test case if we just Execute Spin functions that will perform infinite loop for some time and not let the CPU rest until the time passed. Let me give you some helpers for spin and a simple timer that you can use too in a small console application:

#include "JobSystem.h"

#include <iostream>
#include <chrono>
#include <string>

using namespace std;

void Spin(float milliseconds)
	milliseconds /= 1000.0f;
	chrono::high_resolution_clock::time_point t1 = chrono::high_resolution_clock::now();
	double ms = 0;
	while (ms < milliseconds)
		chrono::high_resolution_clock::time_point t2 = chrono::high_resolution_clock::now();
		chrono::duration<double> time_span = chrono::duration_cast<chrono::duration<double>>(t2 - t1);
		ms = time_span.count();

struct timer
	string name;
	chrono::high_resolution_clock::time_point start;

	timer(const string& name) : name(name), start(chrono::high_resolution_clock::now()) {}
		auto end = chrono::high_resolution_clock::now();
		cout << name << ": " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << " milliseconds" << endl;

int main()

	// Serial test
		auto t = timer("Serial test: ");

	// Execute test
		auto t = timer("Execute() test: ");
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });

    return 0;

The Spin function just emulates a working CPU for a set amount of milliseconds, the timer struct is a scoped timer which will record how much time has passed until the destructor was called – so I just made two scopes for the two tests in the main function. The results are for me:

Serial test: : 400 milliseconds
Execute() test: : 101 milliseconds

The machine I currently ran this test on is a dual core, but has 4 hardware threads, so the results make sense. If I increase the jobs to 6, the times are as following:

Serial test: : 600 milliseconds
Execute() test: : 200 milliseconds

The reason for this is that at most 4 jobs were executing in parallel, then the last two remaining jobs took over after that, also in parallel to each other. Looks like the results look like we intended!

Now, let’s test the Dispatch performance. I came up with this quick test case:

struct Data
	float m[16];
	void Compute(uint32_t value)
		for (int i = 0; i < 16; ++i)
			m[i] += float(value + i);
uint32_t dataCount = 1000000;

// Loop test:
	Data* dataSet = new Data[dataCount];
		auto t = timer("loop test: ");

		for (uint32_t i = 0; i < dataCount; ++i)
	delete[] dataSet;

// Dispatch test:
	Data* dataSet = new Data[dataCount];
		auto t = timer("Dispatch() test: ");

		const uint32_t groupSize = 1000;
		JobSystem::Dispatch(dataCount, groupSize, [&dataSet](JobDispatchArgs args) {
	delete[] dataSet;

Notice that I create the dataSet two times, once for each test. This is important to avoid reusing cached values for the second run of the test. And the results (optimizations disabled):

loop test: : 140 milliseconds
Dispatch() test: : 59 milliseconds

Neat. But when I enabled optimizations, I got other results:

loop test: : 28 milliseconds
Dispatch() test: : 31 milliseconds

That’s not cool, now the simple loop is faster! Then I tried increasing the groupSize to 10000:

loop test: : 28 milliseconds
Dispatch() test: : 15 milliseconds

Ok, now dispatch is nearly two times faster, still not 4 times like we would expect. After trying some other different groupSizes, I could not improve this much better. This behaviour can depend on several factors, such as cache, thread core switching, the operating system overtaking the threads, contention for the jobPool, or just the nature of the workload that is being executed, all of which needs much more investigation, possibly with a dedicated profiling tool. These days there are at least two big ones, Intel Vtune and AMD CodeXL, both of which I am unfamiliar with at the moment.

Windows platform specific extension
If we are on windows, we can try some Winapi functions to extend threading functionality. I will just name a few useful ones. You can put these in the Initialize function after the thread has been created (you can search for * to find that place). Also, I am doing an #include <Windows.h> and #include <sstream> and #include <assert.h> for this next part.

First, we must retrieve the native thread handle from the std::thread, like this:

HANDLE handle = (HANDLE)worker.native_handle();

Once we have that, we are free to pass it to the Winapi functions. For example, we can set a thread affinity mask, to put it on a dedicated hardware thread instead of having it for all possible cores and let the operating system switch them around on demand. This piece of code puts the thread we just created onto one core specified by threadID:

DWORD_PTR affinityMask = 1ull << threadID;
DWORD_PTR affinity_result = SetThreadAffinityMask(handle, affinityMask);
assert(affinity_result > 0);

So the affinityMask is a bitmask where each bit specifies a hardware core that is less than what hardware_concurrency gives us and start from thread 0. By default, the affinity mask should be 2^hardware_concurrecy() – 1 which is also returned by the SetThreadAffinityMask into the affinity_result variable.

We can increase thread priority like so:

BOOL priority_result = SetThreadPriority(handle, THREAD_PRIORITY_HIGHEST);
assert(priority_result != 0);

By default, each thread we create is THREAD_PRIORITY_DEFAULT. Modifying this could help threads not be overtaken by the operating system by lesser priority threads. I’ve found no way to increase performance with this yet, only decrease it.

I found this next one very useful. We can give a name to our thread so that it will show up in the Visual Studio debugger:

std::wstringstream wss;
wss << "JobSystem_" << threadID;
HRESULT hr = SetThreadDescription(handle, wss.str().c_str());

My general suggestion is not to just mess with platform specific extensions, unless you are sure that you are getting a benefit. In some cases, such as modifying priority can lead to great performance reduction too.

If you stayed till the end, thank you. This simple job system is ready to be used and can offer a generous speedup, but there is definitely room for improvements, especially for the Dispatch function. I am planning to give an in-depth look to Intel Vtune (they have a trial version now) and AMD CodeXL soon, and possibly follow up on this with some improvements. As always, you can take a look at the current implementation in my Wicked Engine.
The code in the engine is likely to change over time, but I also made a sample for this blog post alone which will stay like this. It contains all the code for the JobSystem and the tests I showed.

UPDATE: This job system doesn’t have the ability to run jobs from within jobs. The implementation in Wicked Engine was updated to support this with not much more added complexity.

Thank you for reading! If you find a mistake, or have suggestions, I would like to hear it!

Scalable GPU Fluid Simulation

Let’s take a look at how to efficiently implement a particle based fluid simulation for real time rendering. We will be running a Smooth Particle Hydrodynamics (SPH) simulation on the GPU. This post is intended for experienced developers and provide the general steps of implementation. It is not a step-by step tutorial, but rather introducing algorithms, data structures, ideas and optimization tips. There are multiple parts I will write about: computing SPH, N-body simulation, dynamic hashed grid acceleration structure. You will find example code pieces here written as HLSL compute shaders.

To be able to implement a simulator like this, you should already have a basic particle simulation/rendering up and running on the GPU. Like this. Then the fluid simulation step can just be inserted between particle emission and particle integration phase. Everything should just work from this point. At the end of the article, you should be able to do this on a low end GPU in real time (30 000 particles, 80 FPS, GTX 1050):


Smooth Particle Hydrodynamics

As a starting point, let’s get familiar with SPH. You can probably already find several papers on the subject. I chose this short paper. I would like to skip the in-depth theory behind it here and focus on implementation instead to keep it as short as possible. The basic idea is that we are approximating fluid with particles, with a finite range of affection. Each particle can influence every particle inside its range (smoothing radius). To simulate affection forces for each particle, we will compute the following for each particle: density -> pressure -> pressure-based forces -> viscosity-based forces -> constant gravitational force. After we computed these forces, we can use the combined force value and perform integration of velocity and update particle positions. In the paper we can find the following formulas that we must implement as shader code (I highlited and annotated key areas):

  • Density evaluation with Poly6 kernel smoothing function + pressure computationdensity_pressure
    • Let’s translate that into (pseudo-)code:
      particleA.density = 0; // we will compute this (float)
      // Precompute part of Poly6
      //  this should be constant buffer value
      //  h = smoothing radius parameter (float) (default = 1.0f)
      //  h9 = pow(h, 9)
      //  PI = 3.1415...
      const float Poly6_constant = (315.0f / (64.0f * PI * h9)); 
      for(all particleB that is either near particleA or is particleA)
        const float3 diff = particleA.position - particleB.position;
        const float r2 = dot(diff, diff);
        if(r2 < h2) // h2 = h*h
          const float W = Poly6_constant * pow(h2 - r2, 3);
          particleA.density += particleB.mass * W;
      // avoid negative pressure by clamping density to reference value:
      particleA.density = max(p0, particleA.density);
      // Compute pressure:
      //  K = pressure constant parameter (float) (default = 250.0f)
      //  p0 = reference density param (float) (default = 1.0f)
      particleA.pressure = K * (particleA.density - p0);
    • particleA correspons to the particle we are evaluating, particleB is the particle which is affecting the current particle. In the formulas, particleA is the “i” iterator, particleB is the “j” iterator. I found that names like this are just better readable in code.
    • The loop iteration is placeholder now, that will be addressed later how we should implement it.
    • Note that we not only evaluate nearby particles, but the particle itself, too. This is essentially just omitting that check inside the loop which is checking if the particle is itself or not. Without this detail, the simulation will be jerkier because essentially we avoid contribution from the particle’s own mass.
    • Note that we can avoid storing the pressure, because it can be computed from the particle itself. We can compute it before computing forces, instead of loading it, which will increase computation time but reduce memory access stalls.
  • Pressure-based force computation with Spiky kernel smoothing functionpressure_force
    • Looks like this in code:
      particleA.force1 = 0; // we will compute this (float3)
      // Precomputed part of Spiky kernel:
      //  h6 = pow(h, 6);
      //  again, it should be coming from constant buffer
      const float Spiky_constant = (-45 / (PI * h6));
      for(all particleB that is near particleA but is not particleA)
        const float3 diff = particleA.position - particleB.position;
        const float r2 = dot(diff, diff);
        const float r = sqrt(r2);
        if(r > 0 && r < h) // **avoid division by zero!
          const float3 rNorm = diff / r; // same as normalize(diff)
          const float W = Spiky_constant * pow(h - r, 2);
          particleA.force1 += (particleB.mass / particleA.mass) * 
            ((particleA.pressure + particleB.pressure) / 
             (2 * particleA.density * particleB.density)) * W * rNorm;
      particleA.force1 *= -1;
    • Very important that unlike in the previous step, we must avoid computing the current particle against itself! The simulation immediately blows up if you don’t check!
  • Viscosity-based force computation with Laplacian smoothing functionviscosity_force
    • In code:
      particleA.force2 = 0; // we will compute this (float3)
      // Most of it is the same as the previous snippet, 
      //  only the innermost calculations will differ
      for(all particleB that is near particleA but is not particleA)
       const float3 diff = particleA.position - particleB.position;
       const float r2 = dot(diff, diff);
       const float r = sqrt(r2);
       if(r > 0 && r < h) // **avoid division by zero!
         const float3 rNorm = diff / r;
         const float r3 = r2 * r;
         const float W = -(r3 / (2 * h3)) + (r2 / h2) + 
           (h / (2 * r)) - 1;
         particleA.force2 += (particleB.mass / particleA.mass) * 
           (1.0f / particleB.density) * 
           (particleB.velocity - particleA.velocity) * 
           W * rNorm;
      // e = viscosity constant (float) (default = 0.018f)
      particleA.force2 *= e;
    • All of this should be combined with the pressure-based force computations, into the same loop!
    • The viscosity constant will control how fluid our simulation is.
  • Forces integration
    • Consists of following steps:
      • Add force to velocity
      • Step position along velocity vector
      • Reset force
    • Using fixed timestep so that simulation will not blow up
    • This step must be separate from the SPH forces evaluation, because viscosity based force computation takes velocity into account, thus it cannot be modified while the forces are not computed yet!
    • They say that a code piece can say a thousand words:
      const float deltaTime = 0.016f; // fixed step
      const float3 G = float3(0, -9.8f, 0);
      particleA.velocity += deltaTime * ((particleA.force1 + particleA.force2) / particleA.density + G);
      particleA.position += deltaTime * particleA.velocity;
      particleA.force1 = 0;
      particleA.force2 = 0;
    • Of course, the particle should not be storing two distinct forces, they can be combined into one.

N-body simulation

N-body simulation stands for the simple algorithm of checking each particle against each other particle in the simulation. This is how we should implement SPH-based fluid simulation first for simplicity. This can be still done real time for a couple thousand particles on a GPU, so it is ideal for us to test out how the formulas work and find good default input values for the simulation. However, this algorithm has a complexity of O(n^2), so it will not scale well after a certain threshold. The SPH smoothing radius will give us opportunity to come up with a better algorithm to reduce complexity and make it scalable later. The first attempt to create an N-body simulation in compute shaders, would involve a double loop iterating through the particle list:

#define THREADCOUNT 256

[numthreads(THREADCOUNT, 1, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
  Particle particleA = particleBuffer[dispatchThreadID.x];
  for(uint i = 0; i < particleCount; ++i)
    Particle particleB = particleBuffer[i];
    resolveParticles(particleA, particleB);

Well, not *really* a double loop, but the outer loop is the compute kernel itself. But there are already several problems with this. First, is that the SPH force computation needs intermediate results of particle densities, but our compute shader is running thread groups in parallel, so we can’t just combine both in the same shader, as we can’t be sure that all the densities are computed by the time we want to evaluate forces. So we are splitting the simulation into two shaders: density evaluation and force evaluation. (Pressure evaluation can be done immediately before force computation because it can be computed from density) So far we have two shaders, each doing O(n^2) particle evaluations, each loading from global memory. This just screams slow. I also experimented with using shared memory to preload particles ahead of time and reduce latency. You can skip this step if you want, because the acceleration structure later will be a complete replacement, but this is a good exercise nonetheless and increases performance of the N-body simulation:

#define THREADCOUNT 256
groupshared Particle sharedParticles[THREADCOUNT];

Particle particleA = particleBuffer[dispatchThreadID.x];

uint numTiles = 1 + particleCount / THREADCOUNT;

for (uint tile = 0; tile < numTiles; ++tile)
  uint offset = tile * THREADCOUNT;
  sharedParticles[flattenedGroupIndex] = particleBuffer[offset + flattenedGroupIndex];

  for(uint i = 0; i < THREADCOUNT; ++i)
    Particle particleB = sharedParticles[i];
    resolveParticles(particleA, particleB);

Important to look out that we are doing thread sync operations inside a dynamic loop. You are not allowed to do this if the loop is potentially divergent across the thread group. To have it compile, ensure that the loop range is loaded from coherent memory such as constant buffer values or read only buffers. For example you can’t load particleCount from an unordered access view, because the loop termination depends on it and it can potentially be divergent across the thread group.

Obviously, it is a bit simplified, to be more readable. We eliminated a bunch of latency by avoiding accessing the main memory and using group shared memory instead. This is the general way to speed up N-body simulation. It should be already good (real time) for simple fluid simulations up to ~20000 particles on an average GPU. We still need to do it two times though, first for computing densities, then for forces.

So, with this implementation, our fluid simulation can be written as running two extra shaders before particle velocity integration:

  1. SPH Density evaluation
  2. SPH force computation

Pressure computation can either be in density evaluation phase and exported to memory/read back in force phase, or just computed in force phase. I chose to actually precompute it in the force phase, and store it into shared memory per “tile”.

So far we can simulate a moderate amount of particles in real time with N-body collisions (4000 particles):


Dynamic Hashed Grid

So, hopefully you have the fluid simulator up and running before coming to this part. 🙂 Let’s eliminate that nasty O(n^2) complexity! The very basic idea behind SPH is that there is a smoothing radius for each particle and only particles inside the smoothing radius should be affecting the current particle. If we subdivide space to a uniform grid, with a grid cell size of smoothing radius, and sort each particle into its corresponding cell, we can perform efficient constant lookups of smaller particle lists. Let’s say, that we have a particle somewhere. We can look up which grid cell it belongs to by taking its position, and dividing it by smoothing radius:

int3 cellIndex = floor(particleA.position / h);

Then we have to evaluate all the particles inside that grid cell and all the direct neighbour cells as well. In a three dimension grid, those are 3*3*3 = 27 cells. There are some problems that come to mind immediately:

  • We don’t want to just constrain the simulation to a finite grid, but we don’t have infinite memory
  • We don’t know how many particles can be inside one cell at once. We don’t want to impose limitation, because that could affect simulation results.
  • We can’t allocate memory with GPU code, so we will be using a fixed size pre-allocated pool.
  • We will need to build this data structure entirely on the GPU.

Fortunately, all of these are solvable with the dynamic hashed grid algorithm!

The first part we should solve is that each cell can contain an unbounded number of particles (well, bounded to the maximum number of allowed particles per simulation). I took this idea from a GDC 2008 presentation by Nvidia. I needed some time to wrap my mind around it, so hopefully I can describe it so that it will be easier for you. It is called “sorted grid” and it is described like this:


It can be tricky to digest, let me explain. Let’s start from left to right. We have a picture and three columns. The picture of the grid shows that we have grid cells 4, 6 and 9 containing some particles. It can be represented by the unsorted list (first column). It can be built on the GPU by iterating through the particle list and each particle writes its grid cell index into a list. But it is not yet usable to perform a constant time lookup. The second column is the sorted list. It’s the previous list, only the particle index list is now sorted by grid cell index. It is good for us, because now if we iterate through the particle index list, the corresponding grid cell indices are continuously laid out in ascending order. But we need the ability to look up an arbitrary cell in constant time and have a continuous array of particle list which it contains. We are nearly there. As the last step, we can process this list and create an “offset list” for each grid cell. We will have an array of grid cells each holding either

  • UINT32_MAX (0xFFFFFFFF) When the cell is empty
  • integer value >= 0 indicating the offset into sorted particle list

With this information, we can now say everything about a particle when we iterate through them:

  • Its corresponding grid cell
  • The grid cell’s offset into the sorted list
  • We know that the sorted list contains some number of particles in the same cell from the offset until we read a particle which is in an other cell.

As we can see, there is no need for memory allocation apart from the pre-allocation, so it is doable on the GPU. We should follows these steps:

  1. Create the unsorted listWe will be running a compute shader with the SV_DispatchThreadID indexing the particle index list, which indexes the main particle property buffer containing position, simulation properties, etc..We will have a cellIndexBuffer, which is the same size as the particle index buffer. It could also be a property of the particle structure, but this is better memory access pattern because it is only used in optional particle simulation path. Also, we can have a general purpose sorting shader more easily.
    uint particleIndex = particleIndexBuffer[dispatchThreadID.x];
    Particle particle = particleBuffer[particleIndex];
    int3 cellIndex = floor(particle.position / h);
    uint flatCellIndex = GetFlatCellIndex(cellIndex);
    cellIndexBuffer[particleIndex] = flatCellIndex;
  2. Create the sorted listRun a Bitonic Merge Sort Algorithm which performs comparisons on the cellIndexBuffer but swaps elements of the particleIndexBuffer. Bitonic sort is a well researched parallel sorting algorithm, and you can pull the implementation from AMD’s particle demo on GitHub.
  3. Create the offset listThe offset list will be recreated from scratch each frame, so first ensure that it is completely cleared to invalid offsets (0xFFFFFFFF).Then, we can exploit Atomic operations to efficiently compute the cell offsets. We will dispatch a compute shader that will read one particle index from SV_DispatchThreadID semantic. We can read the corresponding grid cell index or compute it. Loading is preferable here, because we can achieve better cache coherency by not loading position from the bloated particle buffer and converting it to cell index.The Atomic Min operation works because we have high values everywhere in the offset buffer initially, so we can store for each grid cell the minimum of all the threads that had a particle corresponding to that cell.
    uint particleIndex = particleIndexBuffer[dispatchThreadID.x];
    uint cellIndex = cellIndexBuffer[particleIndex];
    InterlockedMin(cellOffsetBuffer[cellIndex], dispatchThreadID.x);
  4. The structure is complete, all we have to do is iterate it!

Before going further, let’s address our remaining desire, that the grid should be “infinite”, not only constrained to a small area. This is a really simple extension, called hashing. The only thing that will change is the computation of the flat cell index. Let’s define the function

inline uint GetFlatCellIndex(in int3 cellIndex)
  const uint p1 = 73856093; // some large primes
  const uint p2 = 19349663;
  const uint p3 = 83492791;
  int n = p1 * cellIndex.x ^ p2*cellIndex.y ^ p3*cellIndex.z;
  return n;

The rest of the algorithm is unchanged, just ensure that wherever looking up a grid cell, we have to use the same function. Also, the choice of prime numbers and grid cell count should have a big impact on the efficiency of the hashing (because of hash collisions, multiple world positions can index the same physical cell), so we should tune this to our needs.

All that’s left now is iterating this structure to resolve the SPH simulation. It will be traversed two times, first for density evaluation, then for force computation. But the iteration itself is the same in both cases:

int3 cellIndex = floor(particleA.position / h);

for(int i = -1; i <= 1; ++i)
  for(int j = -1; j <= 1; ++j)
    for(int k = -1; k <= 1; ++k)
       int3 neighborIndex = cellIndex + int3(i, j, k);
       uint flatNeighborIndex = GetFlatCellIndex(neighborIndex);
       // look up the offset to the cell:
       uint neighborIterator = cellOffsetBuffer[flatNeighborIndex];

       // iterate through particles in the neighbour cell (if iterator offset is valid)
       while(neighborIterator != 0xFFFFFFFF && neighborIterator < particleCount)
         uint particleIndexB = particleIndexBuffer[neighborIterator];
         if(cellIndexBuffer[particleIndexB] != flatNeighborIndex)
           break;  // it means we stepped out of the neighbour cell list!

         // Here you can load particleB and do the SPH evaluation logic

         neighborIterator++;  // iterate...


This is it. It might be a good idea to eliminate the triple loop with a lookup table (but don’t unroll!). But it is better visualized like this for simplicity/code size. This will yield us a huge performance gain and we are able to simulate 100 000 particles on a low end GPU in real time ~30 FPS:


Apart from particle collision, at least for debug purposes, it is a good idea to implement planar collisions, just to keep the simulation in bounds to test how it can fill out a volume. Planar collision is simple, if a particle would go behind a plane, clamp its position and invert the particle velocity based on the plane reflection vector. Apart from that, we can implement rigid collisions with unsimulated particles (for example driven by animation system). But that is out of scope for today.

As with any technique, there are multiple possibilities to further optimize. I will not cover hardware level GPU optimizations here because I have not yet researched this area thoroughly for this and it is already a long article. My general direction would be first to profile for any obvious bottlenecks. Without doing so, I imagine that memory latency could be improved by using shared memory somehow for the grid based lookups. It is not trivial anymore like with N-body simulation, because the thread group can span arbitrary amount of different grid cells at once. Apart from that, we should always aim to stay in cache wherever we can, so deinterleaving certain less frequently used particle properties might be a good idea. Also, watch out for bank conflicts when writing to (shared) memory, you can do so by aligning your data to bank size, and interleave certain frequently used together properties together into a structure. Can we use wave intrinsics to speed up certain parts of the simulation? I have some ideas I still haven’t tested out.

You can see that there is more room for additional optimizations and I still haven’t talked about rendering the fluid. These are very good candidates for other blog posts in the future, so stay tuned!

As always, you are free to pull my implementation from my open source engine:

Also, my older article describes a general purpose GPU particle system which is a good candidate to fit a fluid simulator into. Or just to implement a cool large-scale particle system.

If you find any mistakes or have any questions, let me know in the comments!

UPDATE: There was a divide by zero bug, which is now fixed. In the force computation step, we must check if the distance between two particles is greater than zero, otherwise avoid computing the SPH forces between those two particles. It sometimes manifested as a flickering anomaly that looked more like a synchronization issue. Sorry for the inconvenience! I marked occurences in code samples with **

[Download sample model] for [Wicked Engine Editor]