Entity-component system

Here goes my idea of an entity-component system written in C++. I’ve been using this in my home-made game engine, Wicked Engine for exactly a year now and I am still very happy with it. The focus is on simplicity and performance, not adding many features.

This entity-component system (ECS from now on) is used to manage the game scene. That means storing game objects in memory, updating and rendering them, writing them to disk and loading them from disk to memory. The simple ECS that can handle these consists of:

  • Entity: a unique identifier (a number). Imagine it like an ID in a database.
  • Component: a data blob (a struct). This is the data that is associated with an entity. It is up to the user to create the data type, by creating a struct or a class in C++.
  • System: data transformation (a function). This is any function that operates on a number of components. The user will write simple functions that take components as inputs and produce components as outputs. It is just any C++ function, really.
  • ComponentManager: entity to component mapping container. Each entity can have one component held in it. The user can check if an entity and component exists, and query an entity’s component. This is the thing that we will implement here and give to the user as a tool to work with.

The ECS will not only be simple, but performance is a big concern. We will want to have a lot of game objects, and manage them as fast as we can. Today, memory access is often the slow part of an application. The best way to avoid fetching memory and waiting for it to arrive, is to make good use of the cache and load from cache as often as we can. By keeping data in linear arrays, we are in a very good position, because most of the time we will iterate through every component and do something with them. The CPU is good at this when data is accessed linearly and memory can be prefetched into cache ahead of time. An other nice feature is that linear arrays will lend themselves very well to multithreading; you can use parallel for loops very simply and efficiently to process a large number of data. The ComponentManager will be responsible to manage components in a linear fashion. This is also called in many places Data Oriented Design (DoD).

Let’s look at an example first, what kind of code the user of the ECS will need to write:

// Create a new unique entity:
Entity entity = CreateEntity();

// A component:
struct Position
  float x, y, z;

// ComponentManager to hold components:
ComponentManager<Position> positions;

// Check whether the entity has a position component:
positions.Contains(entity); // returns false

// Create a new Position component for the entity:
Position& position = positions.Create(entity);

// Check whether the entity has a position component:
positions.Contains(entity); // returns true now

// A system that moves every entities' position up by one:
void UpdatePositionsSystem(ComponentManager<Position>& positions)
  for(size_t i = 0; i < position.GetCount(); ++i)
    positions[i].y += 1;

Apart from the simple example above, the ECS will have some other functionality too, but that comes later, when we need it. Let’s implement what we have so far, one by one.


First, we have the CreateEntity(). This should give us a unique number identifier. The Entity is just a typedef for a 32-bit unsigned int:

#include <cstdint>
typedef uint32_t Entity;

In the simplest implementation for CreateEntity(), you could give a monotonically increasing number:

Entity CreateEntity()
  static Entity next = 0;
  return ++next;

However, that comes with a lot of issues if you will ever need to serialize the entities, write them to disk and load later. Just imagine starting your program, creating some entities, saving them, then close the program, start again, create some entities (starts from one) then loading the previous entities (those that also started from one). Your entities will no longer be unique in this case. We will address this later, in the Serialization part, for now you can use this, then later replace this function if you are interested in serialization.

By the way, we start from one, because we reserve zero as an invalid entity handle. This is useful, for example giving default value to entities that have not yet been created:

static const Entity INVALID_ENTITY = 0;

This is it for the Entity for now. The crux of the implementation will be after this, the ComponentManager.


The ComponentManager is a container that we will implement. It will be storing components and entities in linear arrays, for the previously mentioned performance reasons. By storing both components and entities, we have the ability to query a component’s entity or an entity’s component for a specific index position inside the ComponentManager. Very useful if we just iterate through all components and want to look up into an other type of ComponentManager for a component’s entity. Speaking of lookups, sometimes we want to just look up a specific entity’s component, without iteration through the whole array, for that a hash map will be perfect. This ComponentManager can be templated based on component type, so it can look sort of like a C++ standard container:

#include <vector>
#include <unordered_map>

template<typename Component>
class ComponentManager
  std::vector<Component> components;
  std::vector<Entity> entities;
  std::unordered_map<Entity, size_t> lookup;

We then start implementing the functionality for our previous example. With a Contains() method, we can check if an entity has this kind of component or not:

bool Contains(Entity entity) const
  return lookup.find(entity) != lookup.end();

Of course, first, we should be able to create components for entities with a Create() function:

Component& Create(Entity entity)
  // INVALID_ENTITY is not allowed!
  assert(entity != INVALID_ENTITY);

  // Only one of this component type per entity is allowed!
  assert(lookup.find(entity) == lookup.end());

  // Entity count must always be the same as the number of components!
  assert(entities.size() == components.size());
  assert(lookup.size() == components.size());

  // Update the entity lookup table:
  lookup[entity] = components.size();

  // New components are always pushed to the end:

  // Also push corresponding entity:

  return components.back();

I like to put in a lot of asserts wherever I can, so a debug version of the program will catch any issues. For simplicity’s sake, I like the limitation that an entity can have only one components of the same kind. If the user would ever want more components, more ComponentManager instances can be used, but I never wanted to have more than one component of the same kind for an entity.

To iterate through every component (or entity, because there is the same amount of components and entities), we can query the amount of components/entities:

size_t GetCount() const { return components.size(); }

Then to iterate in a linear fashion, we can get a component reference from the linear array by a size_t indexer and the [] operator:

Component& operator[](size_t index) { return components[index]; }

All of these methods are a part of the templated ComponentManager class, so they reside in the header file. In fact, all our implementation will be in the same header file as part of the ComponentManager class. So far, this is enough to use the simple example that I written above. However, this is not enough functionality. Next, I will give some other useful functions that the ComponentManager should have and use cases for them.

For example, I often iterate through components and use the [] operator to index a component. But what if I need the Entity for the same component? It is not required that the component data struct contain an Entity reference. Instead, the ComponentManager can give the corresponding Entity back to the user:

Entity GetEntity(size_t index) const { return entities[index]; }

The use case is if we have different components for one entity, and one component wants to look at an other component’s data:

ComponentManager<Position> positions;
// ... continued from previous example, entity has a position now...

struct Mass
 float value;
ComponentManager<Mass> masses;

for(size_t i = 0; i < positions.GetCount(); ++i)
  Position& position = positions[i];
  Entity entity = positions.GetEntity(i);
  Mass* mass = masses.GetComponent(entity);
  if(mass != nullptr)
    position.y += 1.0f / mass->value;

In the above example, in order to update the position, we need to get access to the mass, but that is stored in an other component. We check if the entity has a mass, and if it has, we update the position according to the mass. The GetComponent() function also needs to be implemented now:

Component* GetComponent(Entity entity)
  auto it = lookup.find(entity);
  if (it != lookup.end())
    return &components[it->second];
  return nullptr;

As you can see, this will return the pointer to a component if it exists, nullptr otherwise. This performs a hash map lookup, that is slower than linear access indexing, but we won’t use it as frequently. Or at least, we shouldn’t! We should lay out our data that map lookups like this will be minimized. It is often better to duplicate data in an ECS and keep the cache warm.

What about deleting game objects? This can be a frequently used functionality of the ECS, so we must implement it. We can remove the last element of a linear array trivially, we just decrease the array size by one (std::vector’s pop() function) and destroy the last object. Removing from the middle of the array is a bit more tricky. We replace the element to be removed with the last element in the array, and decrease the array’s size. This is done by the Remove() function (duh):

void Remove(Entity entity)
  auto it = lookup.find(entity);
  if (it != lookup.end())
    // Directly index into components and entities array:
    const size_t index = it->second;
    const Entity entity = entities[index];

    if (index < components.size() - 1)
      // Swap out the dead element with the last one:
      components[index] = std::move(components.back()); // try to use move
      entities[index] = entities.back();

      // Update the lookup table:
      lookup[entities[index]] = index;

    // Shrink the container:

As you can see, we also have to keep the lookup table in sync with the entity and component arrays. I also try to use C++ move semantics and avoid copying component data if possible. It will be possible if you use only default constructors and destructors, or write move constructors for your components yourself. Otherwise, copy will be used.

If you remove an element like this, it can be problematic however, if you rely on component ordering. There was at least one case where I relied on the order. However, so far we have a pretty well usable ECS already. The following sections of the blog will be about two bigger topics:

  • Scene graph hierarchy: the problem is that we rely on component ordering in the ECS hierarchy traversal
  • Serialization: the problem is that Entity uniqueness must persist after serialization/deserialization.

If you are interested in the above topics, stick around. 🙂

Scene Graph

For me, this was actually the most interesting part of implementing the ECS because this had the most question marks in my head about. In a usual pointer based scene graph tree implementation, we can just use recursion and traverse all tree nodes starting from the root node trivially. With the ECS implementation, this is done a bit differently. We have components here, stored in linear arrays, and we can’t use pointers to index components, because they can move around in the ComponentManager when they are being added and removed. So how to do it?

The basic scene graph is a tree of nodes, where each node has a spatial position, and all its children’s positions will be relative to their parent. This means, that we can have:

  • TransformComponent: position, scale, rotation, world matrix
  • HierarchyComponent: entity’s parent, parent’s inverse world matrix at the time of attaching

With these two components, and iterating components linearly from first to last, we are able to represent hierarchy in a scene. There is one other missing piece: we will rely on the order of components in the linear arrays. Because children can refer to their single parent, but parents have no knowledge of their (possibly many) children, parents must be placed before their children in the component arrays, so that when a child computes its transform world matrix, the parent is already up to date because it was computed earlier.

For example, let’s say that we have an Attach() function:

void Attach(Entity entity, Entity parent);

This function will operate on the HierarchyComponent array and add a new component at the end, that will be the child:

struct HierarchyComponent
  Entity parentID = INVALID_ENTITY;
  Matrix world_parent_inverse_bind;
ComponentManager<HierarchyComponent> hierarchy;

void Attach(Entity entity, Entity parent)
  hierarchy.Create(entity).parentID = parent;
  // ... to be continued

However, the child we are adding might have been already a parent of other children. In that case, we want to search for a component whose parent is the current entity which we are attaching and move it before it.

// ...continued

for (size_t i = 0; i < hierarchy.GetCount(); ++i)
  const HierarchyComponent& child = hierarchy[i];
  if (child.parentID == entity)
    hierarchy.MoveItem(hierarchy.GetCount() - 1, i);

But that’s not sufficient as someone pointed out in the comments, because if a whole sub-tree is reattached, not only the parent can get in the wrong place, but all children of that, so we have to carefully check every entity and move them if they are actually a parent of some component that’s before them. So replace the previous block with this more complicated one instead:

// ...continued

if (hierarchy.GetCount() > 1)
  for (size_t i = hierarchy.GetCount() - 1; i > 0; --i)
    Entity parent_candidate_entity = hierarchy.GetEntity(i);
    const HierarchyComponent& parent_candidate = hierarchy[i];
    for (size_t j = 0; j < i; ++j)
      const HierarchyComponent& child_candidate = hierarchy[j];

      if (child_candidate.parentID == parent_candidate_entity)
        hierarchy.MoveItem(i, j);
        ++i; // next outer iteration will check the same index again as parent candidate, however things were moved upwards, so it will be a different entity!

Explanation: We start from the end of the hierarchy (from the component we just added), and check if any previous components are its children or not. If we find a child before it, we move it before the child (but keep the ordering of all components after the child intact). We move towards the beginning of the hierarchy chain and for every component, check if there is a child before them and move if necessary.

The MoveItem() is a new method for the ComponentManager class. It will move an element to a specified index inside the array and moves all elements that are in the way by one position:

void MoveItem(size_t index_from, size_t index_to)
  assert(index_from < GetCount());
  assert(index_to < GetCount());
  if (index_from == index_to)

  // Save the moved component and entity:
  Component component = std::move(components[index_from]);
  Entity entity = entities[index_from];

  // Every other entity-component that's in the way gets moved by one and lut is kept updated:
  const int direction = index_from < index_to ? 1 : -1;
  for (size_t i = index_from; i != index_to; i += direction)
    const size_t next = i + direction;
    components[i] = std::move(components[next]);
    entities[i] = entities[next];
    lookup[entities[i]] = i;

  // Saved entity-component moved to the required position:
  components[index_to] = std::move(component);
  entities[index_to] = entity;
  lookup[entity] = index_to;

That could have been done in a different way, by allocating a temporary memory and moving all the prior elements in one go instead of one by one. The current solution instead avoids extra memory allocations.

After calling MoveItem() in the Attach() function, we are not sure anymore that references remained intact, so we query the currently added child:

HierarchyComponent& parentcomponent = *hierarchy.GetComponent(entity);

Then we continue to query the TransformComponent of the parent (create it if it doesn’t exist):

TransformComponent* transform_parent = transforms.GetComponent(parent);
if (transform_parent == nullptr)
  transform_parent = &transforms.Create(parent);
// Save the parent's inverse worldmatrix:
parentcomponent.world_parent_inverse_bind = MatrixInverse(transform_parent->worldmatrix);

This is it for the Attach() function. But sometimes, we also detach children from parents. The Detach function() could be like this:

void Scene::Component_Detach(Entity entity)
  const HierarchyComponent* parent = hierarchy.GetComponent(entity);

  if (parent != nullptr)
    TransformComponent* transform = transforms.GetComponent(entity);
    if (transform != nullptr)

So we look into the hierarchy array, if the entity is in it, we detach it by applying transformation (the current position, rotation, scale is converted to world matrix) and then removing the component from the hierarchy. Again, we need to remove it while keping the ordering sorted. For that, a Remove_KeepSorted() function was implemented:

void Remove_KeepSorted(Entity entity)
  auto it = lookup.find(entity);
  if (it != lookup.end())
    // Directly index into components and entities array:
    const size_t index = it->second;
    const Entity entity = entities[index];

    if (index < components.size() - 1)
      // Move every component left by one that is after this element:
      for (size_t i = index + 1; i < components.size(); ++i)
        components[i - 1] = std::move(components[i]);
      // Move every entity left by one that is after this element:
      for (size_t i = index + 1; i < entities.size(); ++i)
        entities[i - 1] = entities[i];
        lookup[entities[i - 1]] = i - 1;

    // Shrink the container:

So we remove an element from the middle, but instead of swapping in the last one, we move every element after this one to the left, so ordering is preserved.

So we can attach, detach components now. All of it makes sense when we iterate the hierarchy and update transforms in the right order. This is the Hierarchy Update System:

void RunHierarchyUpdateSystem(
  const ComponentManager<HierarchyComponent>& hierarchy,
  ComponentManager<TransformComponent>& transforms,
  for (size_t i = 0; i < hierarchy.GetCount(); ++i)
    const HierarchyComponent& parentcomponent = hierarchy[i];
    Entity entity = hierarchy.GetEntity(i);

    TransformComponent* transform_child = transforms.GetComponent(entity);
    TransformComponent* transform_parent = transforms.GetComponent(parentcomponent.parentID);
    if (transform_child != nullptr && transform_parent != nullptr)
      transform_child->UpdateTransform_Parented(*transform_parent, parentcomponent.world_parent_inverse_bind);

We iterate the hierarchy components, but also look up from transforms, and use the parent’s inverse bind matrix when updating the transform matrix.

This concludes the basic implementation of the ECS scene graph traversal. The Attach and Detach functions are more heavy than with a pointer based approach, but the traversal itself which is the commonly executed path, should be faster, if we have a flat hierarchy (eg. one parent has many children). If the hierarchy is more deep however, the pointer based tree traversal can beat the ECS implementation. This needs more investigation from me, but I was happy with keeping only the ECS hierarchy path for everything, instead of having also a pointer based scene graph, for example skeleton bones, which usually have deep hierarchy.

A closing side note: Even though hierarchy is contained in a linear array now, multithreading is still not possible with a parallel for loop in this specific case. Because the order of the components matter, the execution order also matters. You can still find ways to parallelize, for example by doing the whole hierarchy update on a separate thread, while also doing the material update system or other kind of unrelated systems in parallel with it.


One of my main objectives was to efficiently serialize and deserialize the whole entity component system in the engine. The ComponentManager lends itself nicely for this with its linear arrays. We can just easily dump the array contents (entity array and component array) into a file and write them to disk. Loading them is also very convenient. There is only one issue. Entity serialization needs some extra care to keep them unique. We can no longer use simple monotonically increasing IDs for Entities. Instead, let’s generate random numbers:

#include <random>

std::random_device rand_dev;
std::mt19937 generator(rand_dev());

Entity CreateEntity()
  std::uniform_int_distribution<int>  distr(1, INT_MAX);
  return distr(generator);

The above code uses C++ random generator to generate uniformly distributed random numbers between 1 and INT_MAX. That range is enough so that we are unlikely to get collisions very soon. But! there is an other problem as well: What if we want to load the same entity collection twice? Imagine that we serialized a model and saved to the disk. We want to load the model twice into the scene. The Entity numbers were randomized, but if we load them twice, they will no longer be unique!

I use the following solution: have a specific serialization post-process for Entities. We combine them with a random seed that is unique per deserialization pass. So instead of simply deserializing Entity like this:

Archive archive;
// ...
archive >> entity;

We use a slightly more complicated method:

uint32_t seed = get_random_number_once_per_serialization();
// ...
archive >> entity;
if (entity != INVALID_ENTITY && seed > 0)
  entity = ((entity << 1) ^ seed) >> 1;

Of course, I just combined that into a SerializeEntity(Entity entity, uint32_t seed); function. So I just make sure to always serialize Entities with that function.

Sometimes, you don’t want to process the entity IDs when serializing, for example when you don’t save them to disk. In that case, you can leave the seed as zero and entity IDs will remain unmodified.

I also gave the ComponentManager the ability to serialize itself, by having a Serialize() function. That will also accept the Entity seed. However, we must look out that in some cases, components can hold Enity members to reference something. This is the case with HierarchyComponent as discussed before, but other cases, like a Mesh Instance holding the Mesh Entity ID are possible too. If the ComponentManager’s Serialize function is used, then we must ensure that the Component array gets serialized and the components’ Entity reference members are seeded correctly. Thus, a Serialize(Archive archive, uint32_t seed) function will be necessary for Components in this case.

Well, it was a very long write up, I hope someone finds it useful. At least I kind of documented this thing for myself, so that’s already worth it.

My implementation of the ECS can be found here. It contains some additional functionality that was out of scope for this blog. The main use case is Scene management, most of the systems and components are implemented in the Scene System.

Let me know in the comments if you find mistakes or have questions, or just simply want to discuss something!

Improved normal reconstruction from depth

In a 3D renderer, we might want to read the scene normal vectors at some point, for example post processes. We can write them out using MRT – multiple render target outputs from the object rendering shaders and write the surface normals to a texture. But that normal map texture usually contains normals that have normal map perturbation applied on them, which some algorithms don’t like. Algorithms that need to ray trace/ray march from the surface in multiple directions around the normal direction are not good fit with these normal mapped textures, because there is a disconnect between what is in the normal map and what is the real surface triangle normal. SSAO for example falls into this category. We want to sample rays in the normal-oriented hemisphere from the surface and check those rays against the depth buffer. The depth buffer however, contains flat triangle geometry, that doesn’t include the normal map sufrace perturbation. Because of this, rays can go below the surface that the ray is started from, and that will result in some unwanted darkening.

We could write out the flat surface normals with MRT, that would give us the perfect solution. But we probably don’t want to have yet an other render target to write in our object shader pass, to conserve memory bandwidth.

There are ways to reconstruct the flat surface normals from the depth buffer. The idea is that we want to find the normal of the triangle that the depth buffer pixel belongs to. We can find out the slope of the triangle by first computing the pixel world position from the depth buffer, than using ddx(), ddy() and cross() to determine the normal direction:

float depth = texture_depth.SampleLevel(sampler_point_clamp, uv, 0).r;
float3 P = reconstructPosition(uv, depth, InverseViewProjection);
float3 normal = normalize(cross(ddx(P), ddy(P)));

By the way, there are multiple ways to reconstruct world position from the depth buffer. One of those ways is by using the native depth buffer and the inverse view projection matrix that was used to render that depth:

float3 reconstructPosition(in float2 uv, in float z, in float4x4 InvVP)
  float x = uv.x * 2.0f - 1.0f;
  float y = (1.0 - uv.y) * 2.0f - 1.0f;
  float4 position_s = float4(x, y, z, 1.0f);
  float4 position_v = mul(InvVP, position_s);
  return position_v.xyz / position_v.w;

This solution to compute normals is only usable in a pixel shader. In a compute shader you don’t have access to ddx() and ddy() functions. What you can do instead is to compute world positions of nearby pixels, and use them as triangle corners. If you have the triangle corners, you can use cross() to compute the triangle normal, which gives you the flat normal of the current pixel:

float2 uv0 = uv; // center
float2 uv1 = uv + float2(1, 0) / depth_dimensions; // right 
float2 uv1 = uv + float2(0, 1) / depth_dimensions; // top

float depth0 = texture_depth.SampleLevel(sampler_point_clamp, uv0, 0).r;
float depth1 = texture_depth.SampleLevel(sampler_point_clamp, uv1, 0).r;
float depth2 = texture_depth.SampleLevel(sampler_point_clamp, uv2, 0).r;

float3 P0 = reconstructPosition(uv0, depth0, InverseViewProjection);
float3 P1 = reconstructPosition(uv1, depth1, InverseViewProjection);
float3 P2 = reconstructPosition(uv2, depth2, InverseViewProjection);

float3 normal = normalize(cross(P2 - P0, P1 - P0));

At this point, you might call it a day and use your flat normals. However, you might also notice some edge cases – both of these techniques literally fail on object edges. Even though different object might be at nearby pixels, we are only naively using the depth buffer and make the assumption that nearby samples are forming actual triangles that exist in the scene.

The image above is a screenshot of the reconstructed normal buffer of the Cornell Box scene. Most notably, the edges of the boxes and the silhouette of the bunny became outlined, but that shouldn’t be the case. If we zoom in we see better:

If you use these flat normals, you can get strong artifacts with SSAO on the object edges:

Notice the left edge of the boxes and the bunny. Those normals caused havoc on those parts. Interestingly, the right edges for example don’t show the artifact. But that depends on which nearby samples you select and reconstruct the triangle corners from, you could select triangles that would fail on right edges and not fail on left edges. In more complicated scenes however, with more color variety and not so huge edges, these artifacts could remain hidden.

So what’s the solution? We should select the triangle that will be most correct for our center sample. The most correct triangle is the one whose corners are most probably on the same object as the center sample. We can approximate that by for example comparing depths of nearby samples to the center sample and picking the ones that are closest, and forming a triangle from those.

An effective way to do this is to sample the pixel neighbors in a cross pattern. That will give us the center sample, the left from it, the one above it, the one on the right and finally the one below it. From these five samples, four triangles can be formed that we could be interested in:

We will select the best triangle corner of the horizontal cross axis, and one best corner from the vertical cross axis by comparing their depth values to the center sample and selecting the closest one. For the base triangle corner, we will always use the center sample. When we decided on the best P0, P1 and P2 triangle corners, use the cross product just like before to determine the triangle normal:

float3 normal = normalize(cross(P2 - P0, P1 - P0));

Watch out however for the triangle winding order. If you assume counter clockwise order like me, then stick with it for all candidate triangles! That means, triangles in the cross pattern can be formed like this for counter clockwise ordering:

triangle 0 = P0: center, P1: right, P2: up
triangle 1 = P0: center, P1: down, P2: right
triangle 2 = P0: center, P1: up, P2: left
triangle 3 = P0: center, P1: left, P2: down

And with this technique, those edge artifacts are gone:

Now we are taking a lot of samples per pixel and also doing complicated math for all neighbors (position reconstruction) and branching in the shader, so it will introduce perfromance penalty. The good thing is that it can be optimized more in a compute shader, with group shared memory. We can store the depths and reconstructed positions for a whole 8×8 pixel tile in shared memory efficiently. We will also have an extra border of pixels stored around the tile. Additionally, we can avoid storing depths and reconstructed positions separately, and instead just store reconstructed positions in view space – the view space positions’ Z value stores linear distance, which we can use instead of depth to determine the best triangle candidate. I will demonstrate the full technique in HLSL shader code.

First, we create the group shared memory for the 8×8 pixel tile + one pixel border:

static const uint POSTPROCESS_BLOCKSIZE = 8; // 8x8 pixel tile
static const uint TILE_BORDER = 1;
groupshared float2 tile_XY[TILE_SIZE*TILE_SIZE]; // view space position XY
groupshared float tile_Z[TILE_SIZE*TILE_SIZE]; // view space position Z

The groupshared memory is storing view space positions. I separated the Z values into a deinterleaved array, because those will be loaded more frequently than XY components, when we determine the best corner points.

The compute shader is running in a 8×8 block size:

void main(uint3 DTid : SV_DispatchThreadID, uint3 Gid : SV_GroupID, uint3 GTid : SV_GroupThreadID, uint groupIndex : SV_GroupIndex)
  // ... shader body continues here...

First thing, we will preload the tile shared memory with the border region with view space positions:

const int2 tile_upperleft = Gid.xy * POSTPROCESS_BLOCKSIZE - TILE_BORDER;
  const uint2 pixel = tile_upperleft + unflatten2D(t, TILE_SIZE);
  const float2 uv = (pixel + 0.5f) * xPPResolution_rcp;
  const float depth = texture_depth.SampleLevel(sampler_linear_clamp, uv,0);
  const float3 position = reconstructPosition(uv, depth, InverseProjection);
  tile_XY[t] = position.xy;
  tile_Z[t] = position.z;

Some things to note from the above code block:
xPPResolution_rcp is a shader constant, containing the postprocess texture resolution reciprocal. That is: 1.0f / resolution.xy
– I am using InverseProjection matrix for the reconstructPosition. That will give the view space position, instead of world space position if I were using InverseViewProjection matrix.
– The unflatten2D() function converts 1D array index to 2D array index, and looks like this:

uint2 unflatten2D(uint idx, uint2 dim)
  return uint2(idx % dim.x, idx / dim.x);

We must synchronize at the end with GroupMemoryBarrierWithGroupSync() to wait all threads to finish populating its part of the group shared memory.

Next thing is to load the depth/view space Z values in the cross pattern and perform the comparisons:

const uint cross_idx[5] = {
  flatten2D(TILE_BORDER + GTid.xy, TILE_SIZE),			// 0: center
  flatten2D(TILE_BORDER + GTid.xy + int2(1, 0), TILE_SIZE),	// 1: right
  flatten2D(TILE_BORDER + GTid.xy + int2(-1, 0), TILE_SIZE),	// 2: left
  flatten2D(TILE_BORDER + GTid.xy + int2(0, 1), TILE_SIZE),	// 3: down
  flatten2D(TILE_BORDER + GTid.xy + int2(0, -1), TILE_SIZE),	// 4: up

const float center_Z = tile_Z[cross_idx[0]];

const uint best_Z_horizontal = abs(tile_Z[cross_idx[1]] - center_Z) < abs(tile_Z[cross_idx[2]] - center_Z) ? 1 : 2;
const uint best_Z_vertical = abs(tile_Z[cross_idx[3]] - center_Z) < abs(tile_Z[cross_idx[4]] - center_Z) ? 3 : 4;

Here, the flatten2D() function is resposible for converting from 2D array index to 1D array index. It can be implemented like this:

uint flatten2D(uint2 coord, uint2 dim)
  return coord.x + coord.y * dim.x;

At the end of this, we will know which neighbor is the best in horizontal direction (best_Z_horizontal) and which is best in vertical direction (best_Z_vertical). The horizontal can be either 1 (right) or 2 (left), while the vertical can be 3 (down) or 4 (up).

Finally, determine the best triangle corners with branching and create the triangle normal:

float3 P1 = 0, P2 = 0;
if (best_Z_horizontal == 1 && best_Z_vertical == 4)
  P1 = float3(tile_XY[cross_idx[1]], tile_Z[cross_idx[1]]);
  P2 = float3(tile_XY[cross_idx[4]], tile_Z[cross_idx[4]]);
else if (best_Z_horizontal == 1 && best_Z_vertical == 3)
  P1 = float3(tile_XY[cross_idx[3]], tile_Z[cross_idx[3]]);
  P2 = float3(tile_XY[cross_idx[1]], tile_Z[cross_idx[1]]);
else if (best_Z_horizontal == 2 && best_Z_vertical == 4)
  P1 = float3(tile_XY[cross_idx[4]], tile_Z[cross_idx[4]]);
  P2 = float3(tile_XY[cross_idx[2]], tile_Z[cross_idx[2]]);
else if (best_Z_horizontal == 2 && best_Z_vertical == 3)
  P1 = float3(tile_XY[cross_idx[2]], tile_Z[cross_idx[2]]);
  P2 = float3(tile_XY[cross_idx[3]], tile_Z[cross_idx[3]]);

const float3 P0 = float3(tile_XY[cross_idx[0]], tile_Z[cross_idx[0]]);
const float3 normal = normalize(cross(P2 - P0, P1 - P0));

Of course, computing the normals in view space means that you will either need to convert them to world-space, or do the SSAO in view space. I chose to do the rest of the shader in view space instead of converting, and reuse the view space position and normal for shooting the rays in view space.

And the normal buffer result is – although not perfect – is closer to what we want. There are still some errors along the edges, but in a reduced amount, and they don’t interfere with the SSAO:

However, I’ve had a slight mistake here, because the SSAO postprocess resolution is half of the depth buffer resolution. When I match the postprocess resolution with the depth buffer resolution, the normal reconstruction is actually nearly perfect (there are a few rogue pixels here and there on sharp corners, but you have to look for them really hard):

And this is with matching resolution, but not using the improvement, the outline is still present and will mess up the SSAO:

However, I don’t have to match their resolution in practice, because the SSAO results were already good with the improved reconstruction. At the end of the day, implementing the reconstruction improvement was worth it.

Well, that’s it, I hope you found it useful. I think it was definitely useful for me at least to document this for myself for future reference. 🙂

Take a look at SSAO implementation in Wicked Engine using this technique and feel free to use it for whatever you want.

I also saw this improvement technique described in a stackoverflow answer.

Mistakes, typos, questions? Let me know in the comments.

Thoughts on light culling: stream compaction vs flat bit arrays

I had my eyes set on the light culling using flat bit arrays technique for a long time now and finally decided to give it a try. Let me share my notes on why it seemed so interesting and why I replaced the stream compaction technique with this. I will describe both techniques and make comparisons between them. Wall of text incoming with occasional code pieces, in brain dump style.

(For this post, I considered a layout where the comparison between the two techniques for each point follow each other immediately, but decided against it, and instead there are two big blocks: stream compaction technique, then the flat bit array technique following it. The points are in similar order for both, so comparison could be easily made by having two browser windows opened for example. It wouldn’t be so simple to do the opposite thing.)

Stream compaction

Idea is based on Johan Andresson’s 2011 GDC talk: DirectX 11 rendering in Battlefield 3

Culling phase:

  • Each tile contains list of lights. Each entry is one uint that is an index into light array. So each tile contains max_lightcount_per_tile number of uints. For me it was 256 entries per tile, and decided at compile time.
    • Possible to compact even further in a way that the average light count per tile is specified (instead of max) when allocating tile buffer and have one more level of indirection. The other indirection here is introducing a “light grid” buffer/texture that would contain the offset and count into the “light index buffer”, while the “light index buffer” contains the light indices in a very compact stream (next tile list’s first entry immediately follows the last light index in the previous tile). This way one tile can contain an unbounded number of lights (well, still bounded on the global tile buffer allocation size). The construction of such structure is complicated and not worth the trouble in my experience (it uses even more atomics). You can see such technique presented in [this other blog]. I have implemented this technique, but in the end I chose to keep the hard coded max items per tile and have one less level of indirection.
  • Culling shader performs stream compaction, that is, using a light index list in groupshared memory and an atomic counter to append lights to the list. Something like this:
groupshared uint ArrayLength = 0;
groupshared uint Array[256];

void AppendLight(uint lightIndex)
    uint index;
    InterlockedAdd(ArrayLength, 1, index);
    if (index < 256)
        Array[index] = lightIndex;
  • Stream compaction can be scalarized with wave intrinsics so that 64x less amount of atomic operations are performed. The idea here is that we will have a per-wavefront bitmask containing set bits for all lanes that wanted to append. This is retrieved by WaveActiveBallot(IsLightCulledInCurrentThread()). This bitmask first gives us how many lanes wanted to append (countbits() function), and the first lane does an InterlockedAdd() with this value to the groupshared light counter. The rest of the lanes don’t do the atomic, so atomics are significantly reduced. The first lane’s result of the InterlockedAdd() returns the wavefront’s offset to the groupshared light index list and broadcasted to the other lanes via WaveReadLaneFirst() intrinsic function. The rest of the lanes need to just write their light index to the list, starting from the broadcasted offset, and doing a prefix sum on the ballot bitmask (WavePrefixSum()) that will give the real offset to the groupshared light index list.
  • Stream compaction is unordered because the item append order relies on one atomic counter. That is fine for lights which use additive blending where the order of operations doesn’t matter for the final output. But once we have decals or environment probes, those are not blended additively, so the undefined order of the items would result in flickering between tiles. To avoid that, we can do a bitonic sort in the compute shader. Doing a bitonic sort in the compute shader is a heavy weight operation, but we can keep track if there were any decals or probes in the tile and only do the sorting if there are. A bitonic sort that runs in a single thread group looks like this (from Gareth Thomas’s tile based particle rendering talk):
// Parameter is SV_GroupIndex
void BitonicSort( in uint localIdxFlattened )
    uint numArray = ArrayLength;
    uint numArrayPowerOfTwo = 2 << firstbithigh(numArray - 1);

    for( uint nMergeSize = 2; nMergeSize <= numArrayPowerOfTwo; nMergeSize = nMergeSize * 2 )
        for( uint nMergeSubSize = nMergeSize >> 1; nMergeSubSize > 0; nMergeSubSize = nMergeSubSize >> 1 )
            uint tmp_index = localIdxFlattened;
            uint index_low = tmp_index & ( nMergeSubSize - 1 );
            uint index_high = 2 * ( tmp_index - index_low );
            uint index = index_high + index_low;

            uint nSwapElem = nMergeSubSize == nMergeSize >> 1 ? index_high + ( 2 * nMergeSubSize - 1 ) - index_low : index_high + nMergeSubSize + index_low;
            if( nSwapElem < numArray && index < numArray )
                if( Array[ index ] < Array[ nSwapElem ] )
                    uint uTemp = Array[ index ];
                    Array[ index ] = Array[ nSwapElem ];
                    Array[ nSwapElem ] = uTemp;
  • The best culling granularity in my experience turned out to be having 16×16 pixel tiles and 16×16 threads in the culling threadgroup. A large thread group is beneficial, because we will have a large groupshared memory reservation, and the bitonic sort works best with a large thread count. Remember that even though we are only interested in the decal and probe ordering, we have to sort the whole entity array, including lights, because lights are in the same array as decals and probes. A large thread group and large groupshared memory usage means reduced occupancy of the shader, so parallel execution between thread groups is reduced.
  • We are still only in the culling shader phase, and an other difference is how the groupshared memory (also called LDS or local data share) is initialized at the beginning. The depth range min-max and the 2.5D culling mask are initialized, and the counter that is responsible to insert the light indices into the array is zeroed out. The LDS light array doesn’t need to be initialized.
  • At the end of the culling shader, we export the light list by having each thread export one light index. Also, the first thread exports a “tile statistics” uint that says how long is the list, how many decals,probes and lights are inside it.

Rendering phase:

  • We process the light list in the forward+ object shader or the tiled deferred compute shader trivially. We read the “tile statistics” uint first, and determine the decals count, probe count and light count in the tile. The following uints are indices into the global entity array that the camera can see, so just iterate through all of them until we reach the last one. Having the decal and probe count is handy, because for some objects, or particles, etc. we don’t want to apply decals, so we can just skip through them and start with the first light index instead. Also, if we process decals, and the top decal fully covers the ones under it, we can skip processing them and continue with the lights.
  • We have high divergence if we process like this inside a pixel shader (forward+), but we can help a bit by using wave intrinsics to scalarize it. It is not trivial and in the worst case it can lead to redundant processing of items. See more info in Drobot’s presentation, the section about “hierarchical container scalarization”.
  • The good thing about processing a compact light index list is that it is compact, so every light index that we process will be contributing to the tile.
Light culling in Sponza

Flat bit arrays

Idea from Michal Drobot’s 2017 Siggraph presentation: Improved culling for tiled and clustered rendering

Culling phase:

  • Each tile contains bit buckets. One bit bucket is a 32-bit uint, that is responsible to index up to 32 lights. So each tile contains max_lightcount_in_view/32 number of uints. That means that for the same amount of max lights per tile as stream compaction method, we use 32x less memory.
  • Culling shader simply computes which bucket can index the currently processed light, then computes which bit maps to the light index, then performs atomic OR operation to the groupshared light bucket list. The groupshared light bucket list is max_lightcount_in_view/32 number of units. The following code piece demonstrates this:
groupshared uint tile[256 / 32];

void AppendLight(uint lightIndex)
    const uint bucket_index = entityIndex / 32;
    const uint bucket_place = entityIndex % 32;
    InterlockedOr(tile[bucket_index], 1 << bucket_place);
  • Bitmask insertion could probably be scalarized in a way that no atomic operations would be necessary. I imagine we would just do a WaveActiveBitOr() instead of InterlockedOr() on the bucket, and if it is the first lane (WaveIsFirstLane()) it would just write the bucket without using atomics. (unfortunately I can’t get sm6.0 working on my current laptop, so I haven’t tried this yet)
  • Bitmask insertion is ordered, we will always get a deterministic order of items, which is the same order as the CPU assembled the global light array. If we have decals and environment probes, we only ensure that they are culled and added to the global light list (or should we say entity list at this point) before the lights (on the CPU). The reason is, because we want to apply decals before applying lights, so that they are lighted just like the surface underneath. Environment probes can be applied before or after the lights, because they will be added to the specular result. However, ordering between locally placed probes does matter, because they are blended amongst themselves with alpha blending. So in the end, we don’t have to sort in the culling shader with the flat bit array technique. That is a big relief and greatly simplifies the shader.
  • The best culling granularity for flat bit arrays turned out to be having 16×16 pixel tiles, however we can also reduce the thread group size to be 8×8 for example to have better occupancy, because there is no sort and little LDS use compared to the compaction technique. But then some shader complexity is added, because each thread will be responsible to work on a small subgroup of the tile. Consider that with 16×16 pixels and 16×16 threads, each thread can work on a single pixel when reading the depth buffer and creating the depth bounds and maybe a 2.5D culling bitmask (Takahiro Harada’s 2012 talk in Siggraph Asia, and also my blog about it). But when reducing the thread count to 8×8, each thread must load 2×2 pixel block instead. However, this also reduces the atomic contention for computing the min-max depth bounds or the 2.5D bitmask (4x times less atomic operations for these phases), because the operations inside the subgroup are performed as non atomic. The whole tile result then will be the result of atomic operations between the subgroups, so in the end that just means less atomics were performed. We don’t have to modify the culling part of the shader however, because that doesn’t work on a thread/pixel granularity, but thread/entity granularity instead. The last thing here worth to mention is that 8×8 thread group size seems like a perfect fit for AMD GCN architecture, which has 64 threads in a wavefront, so the barriers in the compute shader will be even omitted completely. On Nvidia, the wavefront (warp) size is 32 threads, but modifying the thread size to be for example 8×4 threads would also require redesigning how subgroups are processed, which I didn’t tackle for now. It would be an interesting experiment to see how different subgroup distributions would perform. In my implementation, a thread is operating on a 2×2 pixel block, but that could be rewritten so that a thread operates on a row of 4 pixels or something else. Maybe a 8×8 pixel block should be processed first, then move on to the next 8×8 pixel block, and do this 2×2 times. Need to try these and compare. An other idea here: the 2×2 pixel block is also nice for gathering from the depth buffer, we could use one GatherRed() instruction to load 4 samples for the whole subgroup, however that will require a custom code path that won’t work if we choose an other kind of distribution.
  • We don’t have a light array counter in the flat bit aray technique, but instead all the light buckets need to be set to zero, indicating that no lights are contained yet. This is done by each thread just zeroing out one bucket (one uint) until all of them are zeroed out. Apart from that, the common things such as tile depth range are also initialized like always.
  • At the end of the culling shader, each thread exports one bucket, until all buckets were written to global memory. There is no need to have a “tile statistics” entry, because the bitfield directly corresponds to all entities inside the main camera. So the tile statistics can be replaced with just a constant buffer value that contains how many decals/envprobes/lights are inside the main camera.

Rendering phase:

  • Processing the flat bit arrays relies on the firstbitlow() instrinsic in HLSL that gives us the first set bit’s index in the uint bucket, starting from the lowest bit. If the bucket is zero, we continue with the next bucket (thus skipping over 32 non-visible lights), otherwise call firstbitlow(bucket), compute the light index from the bucket index and the bit index, then process the light. Then remove the bit and continue to get the next first low bit until the bucket is not zero. If we want to skip decals, we can do that here too: we can trivially start from the specific bucket’s specific bit that we want to start iterating on (for example if we want to leave out the decals but there are 42 decals inside the camera (first light in entity 43), we can start processing from the second bucket, and mask off the first 10 bits (42-32=10). Here is a pseudo-code how we can extract light index bits from the buckets:
// retrieve tile index (for 16x16 block size):
const uint2 tileIndex = uint2(floor(pixel.xy / 16));

// compute flattened tile index because tile buffer is linear array:
//  g_tileCount = ScreenResoulution.xy / 16
//  g_bucketCountPerTile = max_light_count / 32
const uint flatTileIndex = (tileIndex.x + tileIndex.y * g_tileCount.x) * g_bucketCountPerTile;

for(uint bucket = 0; bucket < g_bucketCountPerTile; ++bucket)
    uint bucket_bits = Tiles[flatTileIndex + bucket];

    while(bucket_bits != 0)
        const uint bucket_bit_index = firstbitlow(bucket_bits);
        const uint entity_index = bucket * 32 + bucket_bit_index;
        bucket_bits ^= 1 << bucket_bit_index;
        Light light = LightArray[entity_index];
        // process light here...
  • Processing these in pixel shaders for forward+ will still yield high VGPR usage because of divergent access to the buckets. However, the scalarization of the buckets themselves is really elegantly done with a single line: bucket_bits = WaveReadLaneFirst(WaveActiveBitOr(bucket_bits)); After the buckets are scalarized, loading from the global light array (entity array) is also scalar.
  • A thing to note about flat bit arrays, is that there can be “holes” in the buckets, so let’s say lightIndex 2 and lightIndex 987 are the only ones contained inside the tile, then the buckets between bucket 0 and bucket 30 are zero, but they can’t be detected in a simple way, so the buckets will be read but not processed. With an other level of indirection, this effect could be reduced, for example having some coarse tiles that are culled with stream compaction, then the finer tile bit buckets would contain indices to those coarse tile light lists. I don’t expect this to be a huge problem, so I haven’t gone there. And it would require the sorting again too. Also, as pointed out by Matt Pettineo on Twitter, the indirection doesn’t have to use stream compaction, but could be an other bitmask, where each bit represents one bucket of 32 entities.
Light culling in Bistro


I’ll compare some results on the bistro scene. The configuration is for GPU times and measured with timestamp queries:

  • Common:
    • Nvidia GTX 1050 laptop GPU
    • Bistro scene (inside + outside)
    • 1000 lights, 0 decals, 0 probes
    • Forward+ rendering
    • 4096 max lights in frustum
    • non-scalarized versions (DX11)
  • Stream compaction method:
    • 255 max lights per tile (255 lights + 1 tile statistics)
    • light culling (4k): 5.6 ms
    • light culling (1080p): 1.4 ms
    • opaque geometry (4k): 43.6 ms
    • opaque geometry (1080p): 13 ms
  • Flat bit array method:
    • 4096 max lights per tile (4096 / 32 = 128 buckets per tile)
    • light culling (4k): 2.5 ms
    • light culling (1080p): 0.64 ms
    • opaque geometry (4k): 41.7 ms
    • opaque geometry (1080p): 12.6 ms

The flat bit array method is the winner here, no matter how I look at it. The biggest boost is in the culling shader that is greatly simplified, but the opaque geometry pass is consistently better as well. (4k means 3840×2160 resolution here, 1080p is full HD 1920×1080)

An other small test I made was for the decals, and I just used a simple cornell box, then placed 100 decals into one spot, overlapping with each other, 4k resolution and covering a small part of the screen:

  • Test: 100 small overlapping decals (4k)
  • Stream compaction method:
    • decal culling + sorting: 3.63 ms
    • opaque geometry: 3.67 ms
  • Flat bit arrays method:
    • decal culling: 1.33 ms
    • opaque geometry: 2.96 ms
that was the 100 small decals, red tile = 100 decals, black = 0 decals

It’s a win again, and the closer I zoomed the camera into the decals, the bigger the difference was between the two techniques. This is the result for full screen decal coverage, 100 overlapping decals, still 4k resolution:

  • Test: 100 full screen overlapping decals (4k)
  • Stream compaction method:
    • decal culling + sorting: 17.82 ms
    • opaque geometry: 84.57 ms
  • Flat bit arrays method:
    • decal culling: 2.12 ms
    • opaque geometry: 71.48 ms

To be honest, I did not expect that kind of speedup, so I am very pleased and can say time was well spent implementing this.

Other notes:

  • One interesting thing I noticed is that I first had one huge loop, and the decal + probe + light evaluation were all included in that single loop. Performance was horrible, especially on the Intel 620 GPU. It was something like +10 ms for opaque geometry pass on a scene with 4 big shadow casting lights, even if no decals or probes were processed at all. After that, I broken up the loop into 3 loops, one for decals, one for probes and one for lights. The performance jump was very noticable. This applies for both techniques, try to break up entities that are processed differently to achieve the best performance. This is due to the compiler can manage to do a better job reserving and reusing registers, so in the end you can have a reduced amount of registers used, even if there are more loops. This is because register lifetime can be constrained to loop scope (if they are not unrolled).
  • An other thing related to the Intel 620 GPU was the annoying fact that I noticed some small flickering of the tiles with the stream compaction technique that only happened rarely, even with a completely static scene. I could never figure it out, but it doesn’t happen with the flat bit array technique. Now I suspect that it might be related to the thread group size, because I also had bugs with tiled deferred compute shader if it was running with 16×16 thread group size and using a RW texture.
  • After getting to know the flat bit array technique, I also implemented a similar one for old-school forward rendering path. For that, the CPU can cull lights per object and provide a per draw call bitmask in a constant buffer. The bitmask is a uint4, where 1 uint represents 32 decals, 1 uint represents 32 probes, and 2 uints allow 64 lights to be evaluated for forward rendered objects. I never wanted to do this previously, because then a big constant buffer needs to be updated with light indices per draw call. An other candidate how to send this bitmask to the shader would be the instance buffer, but having it in the constant buffer means scalarized loads (and also don’t have to propagate the mask from vertex to pixel shader – less parameters to send is good to avoid parameter cache bottleneck). Unfortunately, culling efficiency is reduced if many instances are visible, because a constant buffer will be per draw call granularity instead of per instance.
  • It will be also interesting to try not using a culling compute shader, but rely on conservative rasterization (or MSAA), and draw the light proxies instead, then those would do atomic writes into the tile buffer in the pixel shader. Michal Drobot suggested that this can be faster than running the compute shader and doing software culling, because rasterization will make use of the zbuffer, and zbuffer tile acceleration (coarse conservative z-test in the hardware that supports it, not user controlled). A minor issue with this might be that not everything supports conservative rasterization or forced sample count setting for MSAA today (conservative rasterization can be approximated with MSAA).

Implementation samples:

As always, you can find my implementation for these in Wicked Engine. I have made two snapshot branches for the two separate techniques, and while the main branch will updated, these will remain like this to represent the state of this blog:

Some other light culling resources:

Simple job system using standard C++

After experimenting with the entity-component system this fall (sorry there is no blog for that yet), I wanted to see how difficult it would be to put my other unused CPU cores to good use. I never really got into CPU multithreading seriously, so this is something new for me. The idea behind the entity-component system is both to make more efficient use of a single CPU by having similar data laid out linearly in memory (thus using cache prefetching when iterating), and also making it easier to write parallelized code (because data dependecies are more apparent). In this post I want to talk about the job system I came up with. It will not only make sense in the entity-component system, but generally it should perform well for any large data processing task. I wanted to remain in standard C++ (~11) realm, which I found that it is entirely possible. However, it can be extended with platform specific extensions if wanted. Let’s break this blog up into multiple parts:

  1. Interface – how do we want to use it (I hope you’re familiar with lambdas)
  2. Implementation – let’s see the code
  3. Performance – I guees this would be pointless without it
  4. Windows platform specific extension – optional, if you can still keep going
  5. Closure – all must come to an end

I think the best way to get my point across would be to just post the interface code here, because it is very short and make a few comments:

#include <functional&gt;

// A Dispatched job will receive this as function argument:
struct JobDispatchArgs
	uint32_t jobIndex;
	uint32_t groupIndex;

namespace JobSystem
	// Create the internal resources such as worker threads, etc. Call it once when initializing the application.
	void Initialize();

	// Add a job to execute asynchronously. Any idle thread will execute this job.
	void Execute(const std::function<void()&gt;&amp; job);

	// Divide a job onto multiple jobs and execute in parallel.
	//	jobCount	: how many jobs to generate for this task.
	//	groupSize	: how many jobs to execute per thread. Jobs inside a group execute serially. It might be worth to increase for small jobs
	//	func		: receives a JobDispatchArgs as parameter
	void Dispatch(uint32_t jobCount, uint32_t groupSize, const std::function<void(JobDispatchArgs)&gt;&amp; job);

	// Check if any threads are working currently or not
	bool IsBusy();

	// Wait until all threads become idle
	void Wait();

How this would work? First, we would call JobSystem::Initialize() function somewhere in our application once (before we want to use the job system for the first time around). Probably the best place is when we initialize the application and before doing any resource loading, because the job system can also handle resource loading efficiently.
Then the simplest way we can make a job is by calling JobSystem::Execute(myTask); where myTask is just a simple function without parameters. I use it for example when initializing engine systems, like this:

JobSystem::Execute([] { /*do stuff 1*/ InitGraphics(); });
JobSystem::Execute([] { /*do stuff 2 which is not realted to stuff 1*/ InitPhysics(); });
JobSystem::Execute([] { /*do stuff 3 which is not related to the previous stuff*/ InitInput(); });
JobSystem::Execute([] { /*do stuff 4 which is not related to the previous stuff*/ InitNetwork(); });
JobSystem::Execute([] { /*you probably already get the idea*/ InitSound(); });

The above code creates 5 jobs, which are already probably utilizing 5 worker threads (if the Initialize() created that many workers, but later on that).
Then we can call JobSystem::Wait(); to simply wait for all the jobs to finish, before doing anything that depends on their completion. For example this code would utilize 3 threads:

JobSystem::Execute([] { /*do stuff 1*/ });
JobSystem::Execute([] { /*do stuff 2*/ });

Where the main thread is responsible to notify two worker threads that there are some jobs, and then wait. It also means that overall we are doing extra bookkeeping work by the main thread instead of just doing the jobs one by one. But this should be fine when looking at the big picture, because some work is a good fit to be executed in parallel (if there is no dependency between them). We should only do jobs in parallel if they are a good fit, a lot of times it is still just better to do it on the main thread. Oh and by dependency, I mean if there is any variable that is being written by a job and read by an other, then there is a dependency.

The IsBusy() function is a way for the main thread to just see if any worker threads are busy. It can be used if we don’t actually want to wait for work to finish, and the main thread has other meaningful work to do. For instance, drawing a loading screen is a good example.

The last one is the Dispatch function. This is the most interesting and most useful one in my opinion. I was inspired by how compute shaders work on the GPU, and wanted really bad to use something like that on the CPU. You can think of it like the Execute function, but it actually creates multiple jobs instead of one (jobCount argument). Let’s skip the groupSize argument now and return to it shortly. The third argument is the job itself that will receive a JobDispatchArgs argument that will identify which instance of the job is currently running, and which group (hang in there). To demonsrate how it will be used, look at this loop iterating some large dataset:

const uint32_t dataSize = 1000000;
Data dataSet[dataSize];
for(int i = 0; i < dataSize; ++i)

And let’s rewrite the same thing but this time using Dispatch, to offload it to multiple threads:

const uint32_t groupSize = 1;
JobSystem::Dispatch(dataSize, groupSize, [&dataSet] (JobDispatchArgs args) {

The above code does the same thing as the loop, but now several worker threads each operate on a single element of dataSet. Here is when the groupSize argument comes into play. We actually don’t want each thread to only work on a single element. Not only because of caching, but then 1000000 jobs need to be generated that would put a significant burden on the main thread which is creating the jobs. Instead, we should specify let’s say const uint32_t groupSize = 1000; or something like that. That means, one thread will operate on a batch of 1000 elements of the dataSet. In every case that we are putting a Dispatch in our application, the groupSize must be considered. Just like we have to when we are writing compute shaders – which are really similar to this and have the same purpose: performing work over a large dataset, where data is mostly independent of other data in the set. If our dataSet has elements with dependencies, then it might not be a good candidate – unless we can think of a clever way to utilize the groupSize argument. Because the thing with groupSize, is that with groupSize = 1000, it means that 1000 sub-jobs will run serially on the same thread – then there can be dependencies inside the group. But here it starts to get out of topic, so I recommend to get experienced in compute shaders and using groupshared memory, because the same principles could be applied here. I won’t go into handling groupshared memory in this post.

Now that we see how we want to use the job system, let’s implement the internal workings of it, using standard C++ constructs.

I will go over this mostly in the order I described the interface. But before getting into the first function which is Initialize(), define the internal state and include files, stuff like that:

#include "JobSystem.h"    // include our interface

#include <algorithm&gt;    // std::max
#include <atomic&gt;    // to use std::atomic<uint64_t&gt;
#include <thread&gt;    // to use std::thread
#include <condition_variable&gt;    // to use std::condition_variable

namespace JobSystem
	uint32_t numThreads = 0;    // number of worker threads, it will be initialized in the Initialize() function
	ThreadSafeRingBuffer<std::function<void()&gt;, 256&gt; jobPool;    // a thread safe queue to put pending jobs onto the end (with a capacity of 256 jobs). A worker thread can grab a job from the beginning
	std::condition_variable wakeCondition;    // used in conjunction with the wakeMutex below. Worker threads just sleep when there is no job, and the main thread can wake them up
	std::mutex wakeMutex;    // used in conjunction with the wakeCondition above
	uint64_t currentLabel = 0;    // tracks the state of execution of the main thread
	std::atomic<uint64_t&gt; finishedLabel;    // track the state of execution across background worker threads

	// ...And here will go all of the functions that we will implement

Mostly everything here is using standard C++, except the jobPool. It can also be done with a std::deque, but then also make it thread safe by using mutexes or some kind of synchronization primitives. Or if you are interested, here is my very simple thread safe queue (or ring buffer) implementation:

// Fixed size very simple thread safe ring buffer
template <typename T, size_t capacity&gt;
class ThreadSafeRingBuffer
	// Push an item to the end if there is free space
	//	Returns true if succesful
	//	Returns false if there is not enough space
	inline bool push_back(const T&amp; item)
		bool result = false;
		size_t next = (head + 1) % capacity;
		if (next != tail)
			data[head] = item;
			head = next;
			result = true;
		return result;

	// Get an item if there are any
	//	Returns true if succesful
	//	Returns false if there are no items
	inline bool pop_front(T&amp; item)
		bool result = false;
		if (tail != head)
			item = data[tail];
			tail = (tail + 1) % capacity;
			result = true;
		return result;

	T data[capacity];
	size_t head = 0;
	size_t tail = 0;
	std::mutex lock; // this just works better than a spinlock here (on windows)

It is templated, so you can also use it for any type of data. The nice thing about it is that is very simple, only uses a fixed size memory allocation. It can’t reallocate memory, doesn’t do anything clever, so performance might be more reliable than std::deque. Also, when an element doesn’t fit because there is no more space, it returns false on push_back, or if it is empty and you want to remove an element, it returns false on pop_back. if you got false, you can just try again and maybe that time, it will be successful, because an other thread grabbed/put a job onto it.

Now we have everything we need, except the actual functions. Here is what Initialize() looks like:

void Initialize()
	// Initialize the worker execution state to 0:

	// Retrieve the number of hardware threads in this system:
	auto numCores = std::thread::hardware_concurrency();

	// Calculate the actual number of worker threads we want:
	numThreads = std::max(1u, numCores);

	// Create all our worker threads while immediately starting them:
	for (uint32_t threadID = 0; threadID < numThreads; ++threadID)
		std::thread worker([] {

			std::function<void()&gt; job; // the current job for the thread, it's empty at start.

			// This is the infinite loop that a worker thread will do 
			while (true)
				if (jobPool.pop_front(job)) // try to grab a job from the jobPool queue
					// It found a job, execute it:
					job(); // execute job
					finishedLabel.fetch_add(1); // update worker label state
					// no job, put thread to sleep
					std::unique_lock<std::mutex&gt; lock(wakeMutex);

		// *****Here we could do platform specific thread setup...
		worker.detach(); // forget about this thread, let it do it's job in the infinite loop that we created above

So, in case the comments didn’t do it justice, let me recap: First, we initialize the worker-side label state, but don’t worry about it yet, it will make sense later (hopefully). Then C++ has the neat hardware_concurrency function that can give us how many cores (incuding hyperthreaded virtual cores) our CPU has, so we create that many worker threads in the for loop. Each thread is created with an infinite loop, that it will be doing until our application exits. But most of the time, each thread will just sleep (until the main threads wakes it up), to not spin the CPU wastefully.

Let’s see the Execute function next:

// This little helper function will not let the system to be deadlocked while the main thread is waiting for something
inline void poll()
	wakeCondition.notify_one(); // wake one worker thread
	std::this_thread::yield(); // allow this thread to be rescheduled

void Execute(const std::function<void()&gt;&amp; job)
	// The main thread label state is updated:
	currentLabel += 1;

	// Try to push a new job until it is pushed successfully:
	while (!jobPool.push_back(job)) { poll(); }

	wakeCondition.notify_one(); // wake one thread

Sorry if you got distracted by the poll function above, I did a function for it because it will be used in more places. I will explain, but first, ket’s digest the Execute function: The main thread’s label is incremented. This means, that one job is being submitted, so the JobSystem only becomes idle when the worker label reached the same value. In the previous Initialize function, you can see that in the infinite loop, after a job is executed, the worker label state is also incremented by one. But there it is an atomic operation, because all the workers will be updating that single label.
Then we just push the new job onto the end of the jobPool. As I said, my ringbuffer could be full, so we will try to push the job until it actually succeeds. If it doesn’t succeed, then we invoke the poll function, which will wake up one worker, and yield the main thread. Doing this, we can be sure that the jobPool will eventually be drained by the workers, and the main thread can then schedule the current job.
But maybe, the job was just pushed fine the first time around, and the poll() never executed. In that case, we just wake up any one worker thread that will start working on the job immediately.

I hope I am not putting anyone to sleep yet. Let’s follow with the IsBusy and Wait functions:

bool IsBusy()
	// Whenever the main thread label is not reached by the workers, it indicates that some worker is still alive
	return finishedLabel.load() < currentLabel;

void Wait()
	while (IsBusy()) { poll(); }

You can see how we are checking the state of the main thread and the state of the workers in the IsBusy function. This is also used by the Wait function if we just want to put the main thread to idle and drain all the pending jobs.

Now, with implementing the Dispatch function, you will get full insight into the Job System’s inner workings. This is the most complicated of them all, but if you look at it in a way, it is not very different from the Execute function, just does the same thing multiple times with some little twists:

void Dispatch(uint32_t jobCount, uint32_t groupSize, const std::function<void(JobDispatchArgs)&gt;&amp; job)
	if (jobCount == 0 || groupSize == 0)

	// Calculate the amount of job groups to dispatch (overestimate, or "ceil"):
	const uint32_t groupCount = (jobCount + groupSize - 1) / groupSize;

	// The main thread label state is updated:
	currentLabel += groupCount;

	for (uint32_t groupIndex = 0; groupIndex < groupCount; ++groupIndex)
		// For each group, generate one real job:
		auto&amp; jobGroup = [jobCount, groupSize, job, groupIndex]() {

			// Calculate the current group's offset into the jobs:
			const uint32_t groupJobOffset = groupIndex * groupSize;
			const uint32_t groupJobEnd = std::min(groupJobOffset + groupSize, jobCount);

			JobDispatchArgs args;
			args.groupIndex = groupIndex;

			// Inside the group, loop through all job indices and execute job for each index:
			for (uint32_t i = groupJobOffset; i < groupJobEnd; ++i)
				args.jobIndex = i;

		// Try to push a new job until it is pushed successfully:
		while (!jobPool.push_back(jobGroup)) { poll(); }

		wakeCondition.notify_one(); // wake one thread

Take a deep breath, and let’s tackle this slowly. First, the jobCount (which you can think of as the “loop size”), can be a dynamic parameter, so if it is 0, just return immediately, because we don’t want to shedule 0 jobs.
Then the groupCount is calculated. Don’t confuse this with groupSize, that you specified as an argument to Dispatch. The groupCount is calculated from the jobCount and groupSize and tells us how many worker jobs (or groups) will be put onto the jobPool. There is an integer divide, but we must overestimate it, because let’s say that we want to have jobCount=3, groupSize=2. From that we want to create 2 actual worker jobs (groups), but the integer divide of jobCount/groupSize would give us only 1 group. We must instead schedule 2 groups, each with 2 jobs. But wait, that would be 4 jobs then, even though we specified jobCount=3. We will just discard the out of bounds jobs later.
Then the main thread label is updated with adding in the groupCount. That is how many workers must finish from this point for the JobSystem to become idle.
The following for loop goes over every group, creates a worker job for it with a lambda, passing in the required parameters to populate the JobDispatchArgs which our dispatched job needs to receive. That lambda is the one that will be executed by a worker thread. It contains logic to populate a JobDispatchArgs and run through all inner sub-jobs while making sure that no extra out of bounds iterations are performed. I think this might be the most complicated part if you are going to try to wrap your mind around someone else’s code. Hang in there!
After that group-worker-lambda thing, we do just like the Execute function, try to schedule the group and get the worker threads working immediately on it.

I’m not sure if I made sense of all this, but at this point the JobSystem is ready to be used actually! Shall we take a look of what kind of performance improvements we can expect right off the bat, then continue reading:

The first thing I tried, was that I started to use the JobSystem::Execute() in my Wicked Engine when initializing the main engine systems, so each system initialization would be performed in its own execute. The system initialization tasks were large independent tasks, operating on some arbitrary/different data, a good fit for the Execute function. We can make up a test case if we just Execute Spin functions that will perform infinite loop for some time and not let the CPU rest until the time passed. Let me give you some helpers for spin and a simple timer that you can use too in a small console application:

#include "JobSystem.h"

#include <iostream&gt;
#include <chrono&gt;
#include <string&gt;

using namespace std;

void Spin(float milliseconds)
	milliseconds /= 1000.0f;
	chrono::high_resolution_clock::time_point t1 = chrono::high_resolution_clock::now();
	double ms = 0;
	while (ms < milliseconds)
		chrono::high_resolution_clock::time_point t2 = chrono::high_resolution_clock::now();
		chrono::duration<double&gt; time_span = chrono::duration_cast<chrono::duration<double&gt;&gt;(t2 - t1);
		ms = time_span.count();

struct timer
	string name;
	chrono::high_resolution_clock::time_point start;

	timer(const string&amp; name) : name(name), start(chrono::high_resolution_clock::now()) {}
		auto end = chrono::high_resolution_clock::now();
		cout << name << ": " << chrono::duration_cast<chrono::milliseconds&gt;(end - start).count() << " milliseconds" << endl;

int main()

	// Serial test
		auto t = timer("Serial test: ");

	// Execute test
		auto t = timer("Execute() test: ");
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });
		JobSystem::Execute([] { Spin(100); });

    return 0;

The Spin function just emulates a working CPU for a set amount of milliseconds, the timer struct is a scoped timer which will record how much time has passed until the destructor was called – so I just made two scopes for the two tests in the main function. The results are for me:

Serial test: : 400 milliseconds
Execute() test: : 101 milliseconds

The machine I currently ran this test on is a dual core, but has 4 hardware threads, so the results make sense. If I increase the jobs to 6, the times are as following:

Serial test: : 600 milliseconds
Execute() test: : 200 milliseconds

The reason for this is that at most 4 jobs were executing in parallel, then the last two remaining jobs took over after that, also in parallel to each other. Looks like the results look like we intended!

Now, let’s test the Dispatch performance. I came up with this quick test case:

struct Data
	float m[16];
	void Compute(uint32_t value)
		for (int i = 0; i < 16; ++i)
			m[i] += float(value + i);
uint32_t dataCount = 1000000;

// Loop test:
	Data* dataSet = new Data[dataCount];
		auto t = timer("loop test: ");

		for (uint32_t i = 0; i < dataCount; ++i)
	delete[] dataSet;

// Dispatch test:
	Data* dataSet = new Data[dataCount];
		auto t = timer("Dispatch() test: ");

		const uint32_t groupSize = 1000;
		JobSystem::Dispatch(dataCount, groupSize, [&amp;dataSet](JobDispatchArgs args) {
	delete[] dataSet;

Notice that I create the dataSet two times, once for each test. This is important to avoid reusing cached values for the second run of the test. And the results (optimizations disabled):

loop test: : 140 milliseconds
Dispatch() test: : 59 milliseconds

Neat. But when I enabled optimizations, I got other results:

loop test: : 28 milliseconds
Dispatch() test: : 31 milliseconds

That’s not cool, now the simple loop is faster! Then I tried increasing the groupSize to 10000:

loop test: : 28 milliseconds
Dispatch() test: : 15 milliseconds

Ok, now dispatch is nearly two times faster, still not 4 times like we would expect. After trying some other different groupSizes, I could not improve this much better. This behaviour can depend on several factors, such as cache, thread core switching, the operating system overtaking the threads, contention for the jobPool, or just the nature of the workload that is being executed, all of which needs much more investigation, possibly with a dedicated profiling tool. These days there are at least two big ones, Intel Vtune and AMD CodeXL, both of which I am unfamiliar with at the moment.

Windows platform specific extension
If we are on windows, we can try some Winapi functions to extend threading functionality. I will just name a few useful ones. You can put these in the Initialize function after the thread has been created (you can search for * to find that place). Also, I am doing an #include <Windows.h> and #include <sstream> and #include <assert.h> for this next part.

First, we must retrieve the native thread handle from the std::thread, like this:

HANDLE handle = (HANDLE)worker.native_handle();

Once we have that, we are free to pass it to the Winapi functions. For example, we can set a thread affinity mask, to put it on a dedicated hardware thread instead of having it for all possible cores and let the operating system switch them around on demand. This piece of code puts the thread we just created onto one core specified by threadID:

DWORD_PTR affinityMask = 1ull << threadID;
DWORD_PTR affinity_result = SetThreadAffinityMask(handle, affinityMask);
assert(affinity_result &gt; 0);

So the affinityMask is a bitmask where each bit specifies a hardware core that is less than what hardware_concurrency gives us and start from thread 0. By default, the affinity mask should be 2^hardware_concurrecy() – 1 which is also returned by the SetThreadAffinityMask into the affinity_result variable.

We can increase thread priority like so:

BOOL priority_result = SetThreadPriority(handle, THREAD_PRIORITY_HIGHEST);
assert(priority_result != 0);

By default, each thread we create is THREAD_PRIORITY_DEFAULT. Modifying this could help threads not be overtaken by the operating system by lesser priority threads. I’ve found no way to increase performance with this yet, only decrease it.

I found this next one very useful. We can give a name to our thread so that it will show up in the Visual Studio debugger:

std::wstringstream wss;
wss << "JobSystem_" << threadID;
HRESULT hr = SetThreadDescription(handle, wss.str().c_str());

My general suggestion is not to just mess with platform specific extensions, unless you are sure that you are getting a benefit. In some cases, such as modifying priority can lead to great performance reduction too.

If you stayed till the end, thank you. This simple job system is ready to be used and can offer a generous speedup, but there is definitely room for improvements, especially for the Dispatch function. I am planning to give an in-depth look to Intel Vtune (they have a trial version now) and AMD CodeXL soon, and possibly follow up on this with some improvements. As always, you can take a look at the current implementation in my Wicked Engine.
The code in the engine is likely to change over time, but I also made a sample for this blog post alone which will stay like this. It contains all the code for the JobSystem and the tests I showed.

Thank you for reading! If you find a mistake, or have suggestions, I would like to hear it!

Scalable GPU Fluid Simulation

Let’s take a look at how to efficiently implement a particle based fluid simulation for real time rendering. We will be running a Smooth Particle Hydrodynamics (SPH) simulation on the GPU. This post is intended for experienced developers and provide the general steps of implementation. It is not a step-by step tutorial, but rather introducing algorithms, data structures, ideas and optimization tips. There are multiple parts I will write about: computing SPH, N-body simulation, dynamic hashed grid acceleration structure. You will find example code pieces here written as HLSL compute shaders.

To be able to implement a simulator like this, you should already have a basic particle simulation/rendering up and running on the GPU. Like this. Then the fluid simulation step can just be inserted between particle emission and particle integration phase. Everything should just work from this point. At the end of the article, you should be able to do this on a low end GPU in real time (30 000 particles, 80 FPS, GTX 1050):


Smooth Particle Hydrodynamics

As a starting point, let’s get familiar with SPH. You can probably already find several papers on the subject. I chose this short paper. I would like to skip the in-depth theory behind it here and focus on implementation instead to keep it as short as possible. The basic idea is that we are approximating fluid with particles, with a finite range of affection. Each particle can influence every particle inside its range (smoothing radius). To simulate affection forces for each particle, we will compute the following for each particle: density -> pressure -> pressure-based forces -> viscosity-based forces -> constant gravitational force. After we computed these forces, we can use the combined force value and perform integration of velocity and update particle positions. In the paper we can find the following formulas that we must implement as shader code (I highlited and annotated key areas):

  • Density evaluation with Poly6 kernel smoothing function + pressure computationdensity_pressure
    • Let’s translate that into (pseudo-)code:
      particleA.density = 0; // we will compute this (float)
      // Precompute part of Poly6
      //  this should be constant buffer value
      //  h = smoothing radius parameter (float) (default = 1.0f)
      //  h9 = pow(h, 9)
      //  PI = 3.1415...
      const float Poly6_constant = (315.0f / (64.0f * PI * h9)); 
      for(all particleB that is either near particleA or is particleA)
        const float3 diff = particleA.position - particleB.position;
        const float r2 = dot(diff, diff);
        if(r2 < h2) // h2 = h*h
          const float W = Poly6_constant * pow(h2 - r2, 3);
          particleA.density += particleB.mass * W;
      // avoid negative pressure by clamping density to reference value:
      particleA.density = max(p0, particleA.density);
      // Compute pressure:
      //  K = pressure constant parameter (float) (default = 250.0f)
      //  p0 = reference density param (float) (default = 1.0f)
      particleA.pressure = K * (particleA.density - p0);
    • particleA correspons to the particle we are evaluating, particleB is the particle which is affecting the current particle. In the formulas, particleA is the “i” iterator, particleB is the “j” iterator. I found that names like this are just better readable in code.
    • The loop iteration is placeholder now, that will be addressed later how we should implement it.
    • Note that we not only evaluate nearby particles, but the particle itself, too. This is essentially just omitting that check inside the loop which is checking if the particle is itself or not. Without this detail, the simulation will be jerkier because essentially we avoid contribution from the particle’s own mass.
    • Note that we can avoid storing the pressure, because it can be computed from the particle itself. We can compute it before computing forces, instead of loading it, which will increase computation time but reduce memory access stalls.
  • Pressure-based force computation with Spiky kernel smoothing functionpressure_force
    • Looks like this in code:
      particleA.force1 = 0; // we will compute this (float3)
      // Precomputed part of Spiky kernel:
      //  h6 = pow(h, 6);
      //  again, it should be coming from constant buffer
      const float Spiky_constant = (-45 / (PI * h6));
      for(all particleB that is near particleA but is not particleA)
        const float3 diff = particleA.position - particleB.position;
        const float r2 = dot(diff, diff);
        const float r = sqrt(r2);
        if(r > 0 && r < h) // **avoid division by zero!
          const float3 rNorm = diff / r; // same as normalize(diff)
          const float W = Spiky_constant * pow(h - r, 2);
          particleA.force1 += (particleB.mass / particleA.mass) * 
            ((particleA.pressure + particleB.pressure) / 
             (2 * particleA.density * particleB.density)) * W * rNorm;
      particleA.force1 *= -1;
    • Very important that unlike in the previous step, we must avoid computing the current particle against itself! The simulation immediately blows up if you don’t check!
  • Viscosity-based force computation with Laplacian smoothing functionviscosity_force
    • In code:
      particleA.force2 = 0; // we will compute this (float3)
      // Most of it is the same as the previous snippet, 
      //  only the innermost calculations will differ
      for(all particleB that is near particleA but is not particleA)
       const float3 diff = particleA.position - particleB.position;
       const float r2 = dot(diff, diff);
       const float r = sqrt(r2);
       if(r > 0 && r < h) // **avoid division by zero!
         const float3 rNorm = diff / r;
         const float r3 = r2 * r;
         const float W = -(r3 / (2 * h3)) + (r2 / h2) + 
           (h / (2 * r)) - 1;
         particleA.force2 += (particleB.mass / particleA.mass) * 
           (1.0f / particleB.density) * 
           (particleB.velocity - particleA.velocity) * 
           W * rNorm;
      // e = viscosity constant (float) (default = 0.018f)
      particleA.force2 *= e;
    • All of this should be combined with the pressure-based force computations, into the same loop!
    • The viscosity constant will control how fluid our simulation is.
  • Forces integration
    • Consists of following steps:
      • Add force to velocity
      • Step position along velocity vector
      • Reset force
    • Using fixed timestep so that simulation will not blow up
    • This step must be separate from the SPH forces evaluation, because viscosity based force computation takes velocity into account, thus it cannot be modified while the forces are not computed yet!
    • They say that a code piece can say a thousand words:
      const float deltaTime = 0.016f; // fixed step
      const float3 G = float3(0, -9.8f, 0);
      particleA.velocity += deltaTime * ((particleA.force1 + particleA.force2) / particleA.density + G);
      particleA.position += deltaTime * particleA.velocity;
      particleA.force1 = 0;
      particleA.force2 = 0;
    • Of course, the particle should not be storing two distinct forces, they can be combined into one.

N-body simulation

N-body simulation stands for the simple algorithm of checking each particle against each other particle in the simulation. This is how we should implement SPH-based fluid simulation first for simplicity. This can be still done real time for a couple thousand particles on a GPU, so it is ideal for us to test out how the formulas work and find good default input values for the simulation. However, this algorithm has a complexity of O(n^2), so it will not scale well after a certain threshold. The SPH smoothing radius will give us opportunity to come up with a better algorithm to reduce complexity and make it scalable later. The first attempt to create an N-body simulation in compute shaders, would involve a double loop iterating through the particle list:

#define THREADCOUNT 256

[numthreads(THREADCOUNT, 1, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
  Particle particleA = particleBuffer[dispatchThreadID.x];
  for(uint i = 0; i < particleCount; ++i)
    Particle particleB = particleBuffer[i];
    resolveParticles(particleA, particleB);

Well, not *really* a double loop, but the outer loop is the compute kernel itself. But there are already several problems with this. First, is that the SPH force computation needs intermediate results of particle densities, but our compute shader is running thread groups in parallel, so we can’t just combine both in the same shader, as we can’t be sure that all the densities are computed by the time we want to evaluate forces. So we are splitting the simulation into two shaders: density evaluation and force evaluation. (Pressure evaluation can be done immediately before force computation because it can be computed from density) So far we have two shaders, each doing O(n^2) particle evaluations, each loading from global memory. This just screams slow. I also experimented with using shared memory to preload particles ahead of time and reduce latency. You can skip this step if you want, because the acceleration structure later will be a complete replacement, but this is a good exercise nonetheless and increases performance of the N-body simulation:

#define THREADCOUNT 256
groupshared Particle sharedParticles[THREADCOUNT];

Particle particleA = particleBuffer[dispatchThreadID.x];

uint numTiles = 1 + particleCount / THREADCOUNT;

for (uint tile = 0; tile < numTiles; ++tile)
  uint offset = tile * THREADCOUNT;
  sharedParticles[flattenedGroupIndex] = particleBuffer[offset + flattenedGroupIndex];

  for(uint i = 0; i < THREADCOUNT; ++i)
    Particle particleB = sharedParticles[i];
    resolveParticles(particleA, particleB);

Important to look out that we are doing thread sync operations inside a dynamic loop. You are not allowed to do this if the loop is potentially divergent across the thread group. To have it compile, ensure that the loop range is loaded from coherent memory such as constant buffer values or read only buffers. For example you can’t load particleCount from an unordered access view, because the loop termination depends on it and it can potentially be divergent across the thread group.

Obviously, it is a bit simplified, to be more readable. We eliminated a bunch of latency by avoiding accessing the main memory and using group shared memory instead. This is the general way to speed up N-body simulation. It should be already good (real time) for simple fluid simulations up to ~20000 particles on an average GPU. We still need to do it two times though, first for computing densities, then for forces.

So, with this implementation, our fluid simulation can be written as running two extra shaders before particle velocity integration:

  1. SPH Density evaluation
  2. SPH force computation

Pressure computation can either be in density evaluation phase and exported to memory/read back in force phase, or just computed in force phase. I chose to actually precompute it in the force phase, and store it into shared memory per “tile”.

So far we can simulate a moderate amount of particles in real time with N-body collisions (4000 particles):


Dynamic Hashed Grid

So, hopefully you have the fluid simulator up and running before coming to this part. 🙂 Let’s eliminate that nasty O(n^2) complexity! The very basic idea behind SPH is that there is a smoothing radius for each particle and only particles inside the smoothing radius should be affecting the current particle. If we subdivide space to a uniform grid, with a grid cell size of smoothing radius, and sort each particle into its corresponding cell, we can perform efficient constant lookups of smaller particle lists. Let’s say, that we have a particle somewhere. We can look up which grid cell it belongs to by taking its position, and dividing it by smoothing radius:

int3 cellIndex = floor(particleA.position / h);

Then we have to evaluate all the particles inside that grid cell and all the direct neighbour cells as well. In a three dimension grid, those are 3*3*3 = 27 cells. There are some problems that come to mind immediately:

  • We don’t want to just constrain the simulation to a finite grid, but we don’t have infinite memory
  • We don’t know how many particles can be inside one cell at once. We don’t want to impose limitation, because that could affect simulation results.
  • We can’t allocate memory with GPU code, so we will be using a fixed size pre-allocated pool.
  • We will need to build this data structure entirely on the GPU.

Fortunately, all of these are solvable with the dynamic hashed grid algorithm!

The first part we should solve is that each cell can contain an unbounded number of particles (well, bounded to the maximum number of allowed particles per simulation). I took this idea from a GDC 2008 presentation by Nvidia. I needed some time to wrap my mind around it, so hopefully I can describe it so that it will be easier for you. It is called “sorted grid” and it is described like this:


It can be tricky to digest, let me explain. Let’s start from left to right. We have a picture and three columns. The picture of the grid shows that we have grid cells 4, 6 and 9 containing some particles. It can be represented by the unsorted list (first column). It can be built on the GPU by iterating through the particle list and each particle writes its grid cell index into a list. But it is not yet usable to perform a constant time lookup. The second column is the sorted list. It’s the previous list, only the particle index list is now sorted by grid cell index. It is good for us, because now if we iterate through the particle index list, the corresponding grid cell indices are continuously laid out in ascending order. But we need the ability to look up an arbitrary cell in constant time and have a continuous array of particle list which it contains. We are nearly there. As the last step, we can process this list and create an “offset list” for each grid cell. We will have an array of grid cells each holding either

  • UINT32_MAX (0xFFFFFFFF) When the cell is empty
  • integer value >= 0 indicating the offset into sorted particle list

With this information, we can now say everything about a particle when we iterate through them:

  • Its corresponding grid cell
  • The grid cell’s offset into the sorted list
  • We know that the sorted list contains some number of particles in the same cell from the offset until we read a particle which is in an other cell.

As we can see, there is no need for memory allocation apart from the pre-allocation, so it is doable on the GPU. We should follows these steps:

  1. Create the unsorted listWe will be running a compute shader with the SV_DispatchThreadID indexing the particle index list, which indexes the main particle property buffer containing position, simulation properties, etc..We will have a cellIndexBuffer, which is the same size as the particle index buffer. It could also be a property of the particle structure, but this is better memory access pattern because it is only used in optional particle simulation path. Also, we can have a general purpose sorting shader more easily.
    uint particleIndex = particleIndexBuffer[dispatchThreadID.x];
    Particle particle = particleBuffer[particleIndex];
    int3 cellIndex = floor(particle.position / h);
    uint flatCellIndex = GetFlatCellIndex(cellIndex);
    cellIndexBuffer[particleIndex] = flatCellIndex;
  2. Create the sorted listRun a Bitonic Merge Sort Algorithm which performs comparisons on the cellIndexBuffer but swaps elements of the particleIndexBuffer. Bitonic sort is a well researched parallel sorting algorithm, and you can pull the implementation from AMD’s particle demo on GitHub.
  3. Create the offset listThe offset list will be recreated from scratch each frame, so first ensure that it is completely cleared to invalid offsets (0xFFFFFFFF).Then, we can exploit Atomic operations to efficiently compute the cell offsets. We will dispatch a compute shader that will read one particle index from SV_DispatchThreadID semantic. We can read the corresponding grid cell index or compute it. Loading is preferable here, because we can achieve better cache coherency by not loading position from the bloated particle buffer and converting it to cell index.The Atomic Min operation works because we have high values everywhere in the offset buffer initially, so we can store for each grid cell the minimum of all the threads that had a particle corresponding to that cell.
    uint particleIndex = particleIndexBuffer[dispatchThreadID.x];
    uint cellIndex = cellIndexBuffer[particleIndex];
    InterlockedMin(cellOffsetBuffer[cellIndex], dispatchThreadID.x);
  4. The structure is complete, all we have to do is iterate it!

Before going further, let’s address our remaining desire, that the grid should be “infinite”, not only constrained to a small area. This is a really simple extension, called hashing. The only thing that will change is the computation of the flat cell index. Let’s define the function

inline uint GetFlatCellIndex(in int3 cellIndex)
  const uint p1 = 73856093; // some large primes
  const uint p2 = 19349663;
  const uint p3 = 83492791;
  int n = p1 * cellIndex.x ^ p2*cellIndex.y ^ p3*cellIndex.z;
  return n;

The rest of the algorithm is unchanged, just ensure that wherever looking up a grid cell, we have to use the same function. Also, the choice of prime numbers and grid cell count should have a big impact on the efficiency of the hashing (because of hash collisions, multiple world positions can index the same physical cell), so we should tune this to our needs.

All that’s left now is iterating this structure to resolve the SPH simulation. It will be traversed two times, first for density evaluation, then for force computation. But the iteration itself is the same in both cases:

int3 cellIndex = floor(particleA.position / h);

for(int i = -1; i <= 1; ++i)
  for(int j = -1; j <= 1; ++j)
    for(int k = -1; k <= 1; ++k)
       int3 neighborIndex = cellIndex + int3(i, j, k);
       uint flatNeighborIndex = GetFlatCellIndex(neighborIndex);
       // look up the offset to the cell:
       uint neighborIterator = cellOffsetBuffer[flatNeighborIndex];

       // iterate through particles in the neighbour cell (if iterator offset is valid)
       while(neighborIterator != 0xFFFFFFFF && neighborIterator < particleCount)
         uint particleIndexB = particleIndexBuffer[neighborIterator];
         if(cellIndexBuffer[particleIndexB] != flatNeighborIndex)
           break;  // it means we stepped out of the neighbour cell list!

         // Here you can load particleB and do the SPH evaluation logic

         neighborIterator++;  // iterate...


This is it. It might be a good idea to eliminate the triple loop with a lookup table (but don’t unroll!). But it is better visualized like this for simplicity/code size. This will yield us a huge performance gain and we are able to simulate 100 000 particles on a low end GPU in real time ~30 FPS:


Apart from particle collision, at least for debug purposes, it is a good idea to implement planar collisions, just to keep the simulation in bounds to test how it can fill out a volume. Planar collision is simple, if a particle would go behind a plane, clamp its position and invert the particle velocity based on the plane reflection vector. Apart from that, we can implement rigid collisions with unsimulated particles (for example driven by animation system). But that is out of scope for today.

As with any technique, there are multiple possibilities to further optimize. I will not cover hardware level GPU optimizations here because I have not yet researched this area thoroughly for this and it is already a long article. My general direction would be first to profile for any obvious bottlenecks. Without doing so, I imagine that memory latency could be improved by using shared memory somehow for the grid based lookups. It is not trivial anymore like with N-body simulation, because the thread group can span arbitrary amount of different grid cells at once. Apart from that, we should always aim to stay in cache wherever we can, so deinterleaving certain less frequently used particle properties might be a good idea. Also, watch out for bank conflicts when writing to (shared) memory, you can do so by aligning your data to bank size, and interleave certain frequently used together properties together into a structure. Can we use wave intrinsics to speed up certain parts of the simulation? I have some ideas I still haven’t tested out.

You can see that there is more room for additional optimizations and I still haven’t talked about rendering the fluid. These are very good candidates for other blog posts in the future, so stay tuned!

As always, you are free to pull my implementation from my open source engine:

Also, my older article describes a general purpose GPU particle system which is a good candidate to fit a fluid simulator into. Or just to implement a cool large-scale particle system.

If you find any mistakes or have any questions, let me know in the comments!

UPDATE: There was a divide by zero bug, which is now fixed. In the force computation step, we must check if the distance between two particles is greater than zero, otherwise avoid computing the SPH forces between those two particles. It sometimes manifested as a flickering anomaly that looked more like a synchronization issue. Sorry for the inconvenience! I marked occurences in code samples with **

[Download sample model] for [Wicked Engine Editor]

Thoughts on Skinning and LDS


I’m letting out some thoughts on using LDS memory as a means to optimize a skinning compute shader. Consider the following workload: each thread is responsible for animating a single vertex, so it loads the vertex position, normal, bone indices and bone weights from a vertex buffer. After this, it starts doing the skinning: for each bone index, load a bone matrix from a buffer in VRAM then multiply the vertex positions and normals by these matrices and weight them by the vertex bone weights. Usually a vertex will contain 4 bone indices and 4 corresponding weights. Which means that for each vertex we are loading 4 matrices from VRAM. Each matrix is 3 float4 vectors, so 48 bytes of data. We have thousands of vertices for each model we will animate, but only a couple of hundred bones usually. So should each vertex load 4 bone matrices from a random place in the bone array?

Instead, we can utilize the LDS the following way: At the beginning of the shader, when the vertex information is being loaded, each thread also loads one matrix from the bone array and stores it inside the LDS. We must also synchronize the group to ensure the LDS bone array has been complete. After all the memory has been read from VRAM, we continue processing the skinning: iterate through all bone indices for the vertex, and load the corresponding bone matrix from LDS, then transform the vertex and blend by the bone weights. We just eliminated a bunch of memory latency from the shader.

Consider what happens when you have a loop which iterates through the bone indices for a vertex: First you load the bone and you want to immediately use it for the skinning computation, then repeat. Loading from a VRAM buffer causes significant latency until the data is ready to be used. If we unroll the loop, the load instructions can be rearranged and padded with ALU instructions that don’t depend on those to hide latency a bit. But unrolling the loop increases register allocations (VGPR = Vector General Purpose Register, for storing unique data to the thread; buffer loads consume VGPR unless they are known to be common to the group at compile time, then they can be placed to scalar registers – SGPR) and can result in lower occupancy as we have a very limited budget of them. We also want a dynamic loop instead because maybe a vertex has fewer bones than the maximum, so processing could exit early. So having tight dynamic loop with VRAM load then immediately ALU instructions is maybe not so good. But once that loop only accesses LDS, the latency can be significantly reduced, and performance should increase.

But LDS does not come for free and it is also a limited resource, like VGPR. Let’s look at the AMD GCN architecture: We have a maximum of 64 KB of LDS for a single compute unit (CU), though HLSL only lets us use 32 KB in a shader. If a shader uses the whole 32 KB, it means that the shader can only be running two instances of itself on the CU. We have a bone data structure which is a 3×4 float matrix, 48 bytes. We can fit 682 bones into LDS and still have two instances of the compute shader operate in parallel. But most of the time we hardly have skeletons consisting of that many bones. In my experience, less than 100 bones should be enough for most cases, but we surely won’t use more than say 256 bones for a highly detailed model, either in real time apps. So say that our shader will declare an LDS bone array of 256 bones, and the thread group size is also 256, so each thread will load one bone into LDS. 256*48 bytes = 12 KB. This means that 5 instances of this shader could be running in parallel on a CU, so 5*256 = 1280 vertices processed. That is if we don’t exceed the max VGPR count of 65536 for a CU. In this case it means that a single shader must at maximum fit into the  51 VGPR limit (calculated as 65536 VGPR / 1280 threads). Most cases we will easily fit into even a 128 bone limit, so an LDS bone array size of 128 and thread group size of 128 threads will just be enough and be much easier on the GPU.

However, I can imagine a scenario, which could be worse with the LDS method, if there is a complicated skeleton, but small mesh referencing only a few of the bones. In this case when there is one skeleton for multiple meshes, maybe we should combine the meshes into a single vertex buffer and use an index buffer to separate between them, so this way a single dispatch could animate all the meshes, while they can be divided into several draw calls when needed.

It sounds like a good idea to utilize LDS in a skinning shader and it is a further potential improvement over skinning in vertex/geometry shader and doing stream out. But as with anything on the GPU this should be profiled on the target hardware which you are developing on. Right now my test cases were unfortunately maybe not the best assets there are, and a puny integrated Intel mobile GPU, but even so I could find a small performance improvement with this method.

Thank you for reading, you can find my implementation on GitHUB! Please tell me if there is any incorrect information.

UPDATE: I’ve made a simple test application with heavy vertex count skinned models, and toggleable LDS skinning: Download WickedEngineLDSSkinningTest (You will probably need Windows 10 to run it and DirectX 11)

Further reading:

Large thread groups on GCN

Easy Transparent Shadow Maps


Supporting transparencies with traditional shadow mapping is straight forward and allows for nice effects but as with anything related to rendering transparents with rasterization, there are corner cases.

The implementation is really simple once you have implemented shadow mapping for opaque objects. After the opaque shadow pass, we must render the transparents into a color buffer, but reject samples which would be occluded by opaques, so using a depth read-only depth stencil state. The transparents should be blended multiplicatively. Sorting does not matter with a multiply blend state. In bullet points:

  1. Render opaque objects into depth stencil texture from light’s point of view
  2. Bind render target for shadow color filter: R8G8B8A8_UNORM works good
  3. Clear render target to 1,1,1,0 (RGBA) color
  4. Apply depth stencil state with depth read, but no write
  5. Apply multiplicative blend state eg:
    • SrcBlend = BLEND_DEST_COLOR
    • DestBlend = BLEND_ZERO
    • BlendOp = BLEND_OP_ADD
  6. Render transparents in arbitrary order

When reading shadow maps in shading passes, we only need to multiply the lighting value with the transparent shadow map color filter if the pixel is inside the light. There is a slight problem with this approach, that you will notice immediately. Transparent objects now receive their own colored self shadow too. The simplest fix is to just disable the colored part of the shadow calculation for transparent objects. We can already produce nice effects with this, this is not a huge price to pay.

See it in action, transparents are rendered without colored self-shadows:


But they receive shadows from opaque objects just fine:


There is a technique which would allow us to render colored shadows on top of transparents too. This involves keeping an additional shadow depth map for transparencies. The flow of this technique is like this (from a Blizzard presentation):

  1. Render opaque shadow map
  2. Render transparent shadow map
    • To a separate depth texture!
    • Depth writes ON
  3. Clear shadow color filter texture (like in the simple approach)
  4. Render transparent again to color filter render target
    • But use the opaque shadow map’s depth stencil
    • depth read ON
    • depth write OFF

And in the shading step, now there will be two shadow map checks, one for the opaque, one for the transparent shadow maps. Only multiply the light with the shadow filter color texture when the transparent shadow check fails.

This will eliminate false self shadows from transparent objects. But unfortunately now when a transparent receives a colored shadow, its own transparent shadow color will also contribute to itself.

What’s more interesting, are the additional effects we can achieve with transparent shadow maps:

Textured shadows, which can be used as a projector for example, just put a transparent textured geometry in front of a light:


And underwater refraction caustics:


The caustics are implemented as adding additional light to the surface, so even if the surface below the water would be in the water’s shadow it would add in the caustic effect. For this I am using the transparent color filter render target’s alpha channel with an additive blend mode. The water shadow shader renders the slope of the water normal map into the alpha channel, eg. this:

float4 outputColor.a = 1 – saturate(dot(normalMap.rgb * 2 – 1, float3(0, 0, 1)));

Which could probably use a more sophisticated method, which I didn’t bother to find yet. So in the light evaluation shaders, I add the transparent color filter alpha channel to the pixelIsInShadow value multiplied by some causticStrength modulator to render the caustics. Caustics are added because there can be brighter caustics than the light brightness, because in reality they are generated when multiple rays (photons) arrive at the same location from different sources.

That’s it, I think this is a worthwhile technique to include in a game engine. Even the simplest implementation can be used for many interesting effects. Thank you for reading!

Related article:

StarCraft 2 Effects and techniques