Bindless Descriptors

The Vulkan and DX12 graphics devices now support bindless descriptors in Wicked Engine. Earlier and in DX11, it was only possible to access textures, buffers (resource descriptors) and samplers in the shaders by binding them to specific slots. First, the binding model limitations will be described briefly, then the bindless model will be discussed.

Binding model

Slots are available in limited amounts, so a shader could only access up to 128 shader resource views (read only resources), 8 unordered access views (read/write resources), 16 constant buffers and 16 samplers. These limits were configurable in the DX12 and Vulkan devices before now, but it still meant that shaders had to declare resources on hard coded resource slots, so it was not flexible enough. Just to give an example, this is how you access a texture on a slot by a shader:

Texture2D<float4> texture : register(t27);
// ...
float4 color = texture.Sample(sampler_linear_clamp, uv); 

To bind the texture on the CPU side, do this:

Texture texture;
// ...
device->BindResource(PS, &texture, 27, cmd);

The other problem is the CPU cost of binding descriptors, which is the main cause of CPU performance problems in rendering code. In some cases, the CPU code will bind a lot of descriptors, especially when rendering the scene with complex shaders that require many inputs like textures, buffers, etc. Binding descriptors is CPU intensive because the implementation will copy all descriptors used by a shader into a contiguous memory area resembling the layout that the shader expects (which will be mostly different for every shader). In the Vulkan and DX12 implementations it was possible to make some simplified assumptions regarding descriptor binding, so they would perform more optimally compared to DX11. These simplifications included:

  • using shader reflection, determine for each shader exactly how many descriptors will be needed, instead of creating a layout that fits all
  • only use one set of slots for all shader stages combined, so no separate slots for vertex and pixel shader
  • put some constant buffers to “root descriptor” instead of descriptor table

But the number one highest CPU costing function that still always showed up was the CopyDescriptorsSimple (DX12) or vkUpdateDescriptorSets (Vulkan).

Bindless model

The bindless model solves both the flexibility and CPU performance problems. Instead of declaring slots in shaders, just say that we access the entire descriptor heap as an unbounded array and access a specific one by a simple index:

Texture2D<float4> bindless_textures[] : register(t0, space1);
// ...
float4 color = bindless_textures[27].Sample(sampler_linear_clamp, uv);

The great thing is that the index value “27” in this example can come from anywhere, for example a material structure:

struct Material
{
  int texture_basecolormap_index;
  int texture_normalmap_index;
  int texture_emissivemap_index;
  // ...
};

Which means, instead of binding all textures required by the material on the CPU, we just bind the material constant buffer for example (previously you’d need to bind the material constant buffer and all textures as well). The shader will be able to access all textures referenced by the material trivially. From the CPU side there is only one thing we need to do, to get access to the descriptor index, so the value of “texture_basecolormap_index” for example. For that, you can now use this API on the application side:

Texture texture;
// ...
int index = device->GetDescriptorIndex(&texture, SRV);

The GetDescriptorIndex() API can be used after the resource/sampler was created (for example by the CreateTexture function). If the resource wasn’t created yet, or the device doesn’t support bindless, this API will return the value -1. This is important, because shaders must make sure to not use any descriptor which was not created. For example, it is common that a material doesn’t contain a normal map, so the texture_normalmap_index would be -1. In this case the shader must use branching to avoid reading this texture:

if(material.texture_normalmap_index >=0)
{
  normal = bindless_textures[material.texture_normalmap_index].Sample(sampler_linear_clamp, uv);
}

To avoid branching in shaders, you can choose to provide a dummy texture (like a 1×1 white texture or something similar) if you detect that one of the texture indices is -1.

If the shader tries to access an invalid resource, undefined behaviour will occur, most likely a GPU hang. To help detecting this, you can start the application with the debugdevice and gpuvalidation command line arguments and the Visual Studio debugger attached. If a problem is detected, the debugging will break and the error messages will be posted to the Visual Studio output window.

Using the binding model alongside the bindless model is supported and makes a lot of sense. You wouldn’t want to access your per frame constant buffer in a bindless way, that is less convenient (but still possible). This is also useful to transition to the bindless model in small steps, replacing your shaders with bindless one step at a time. The only thing to keep in mind, that space0 (or if no space binding is provided) is reserved for descriptor bindings in Wicked Engine, so use explicit space1 and greater for bindless declarations.

You can go even further than always binding a material constant buffer, for example you can have all your materials inside a StructuredBuffer, or just bindless buffers. If you also reference all your vertex buffers in the bindless way, it means you will not have to bind anything separately for any of your scene objects. At some point you will need some way to provide descriptor indices to the shaders, and the simplest facility for that is the push constants. This is the most efficient way to set a small number of constant values for shaders. To use it, declare a push constant block in the shader using the PUSHCONSTANT macro (which provides consistent interface for both Vulkan and DX12):

struct MyPushConstants
{
  int mesh;
  int material;
  int instances;
  uint instance_offset;
};
PUSHCONSTANT(push, MyPushConstants);

From the CPU side, set the push constants for the next draw/dispatch call like this:

MyPushConstants push;
push.mesh = device->GetDescriptorIndex(&meshBuffer, SRV);
// fill the rest of push structure...
device->PushConstants(&push, sizeof(push), cmd);

This API is designed for simple, per draw call small data uploads (up to 128 bytes). Each shader can use one push constant block for simplicity. This doesn’t exhibit the same CPU cost as binding a small constant buffer, and has potentially better GPU performance (less memory indirection cost), so prefer this way of specifying per draw data. However, this is only available in DX12 and Vulkan devices and will have no effect in DX11. Check whether push constants can be used by checking bindless support:

if(device->CheckCapability(GRAPHICSDEVICE_CAPABILITY_BINDLESS_DESCRIPTORS))
{
  // OK to use push constants and bindless
}

By having this flexibility of specifying Mesh and Material structures to be used on GPU, it becomes possible to bind no per draw call vertex/instance/material buffers and let CPU performance will be significantly improved.

Subresources

Subresources are fully supported in bindless just like in non-bindless. The CreateSubresource() API is used to create a subresource on a resource and returns an integer ID, and the GetDescriptorIndex() API will accept this as an optional parameter if you want to query a specific subresource’s own descriptor index. By default, the whole resource is always viewed. Because the subresources have the same lifetime as their main resource, this doesn’t require any special consideration. Small example of retrieving the descriptor index of MipLevel=5 SRV (shader resource view) for a texture:

int mip5 = device->CreateSubresource(&texture, SRV, 0, 1, 5, 1);
int descriptor = device->GetDescriptorIndex(&texture, SRV, mip5);

Constant Buffer problem in Vulkan

A corner case to look out for is the bindless constant buffers. There is no problem with this declaration in DX12:

ConstantBuffer<MyType> bindless_cbuffers : register(b0, space1);

Which makes it easy to use bindless constant buffers, however in Vulkan you can run into unexpected limits regarding constant buffer descriptor count limit. This limit can be queried from here: VkPhysicalDeviceVulkan12Properties::maxDescriptorSetUpdateAfterBindUniformBuffers

For example on a RTX 2060 GPU I got an “unlimited” amount for this, but on GTX 1070 I got a limit of 15, which is not sufficient at all for bindless usage. At the same time I had no problem using bindless constant buffers in DX12 on the same GPU. In this case I had to rewrite the bindless constant buffers as bindless ByteAddressBuffers instead. With the ByteAddressBuffer, you can make use of the Load<T>(address) templated function in HLSL6.0+ to get the same logical result as if reading from a constant buffer. An example how bindless constant buffer can be rewritten:

// Bindless constant buffer:
ConstantBuffer<ShaderMesh> bindless_meshes : register(b0,space2);
ShaderMesh mesh = bindless_meshes[42];

// Can be rewritten as:
ByteAddressBuffer bindless_buffers[] : register(t0, space2);
ShaderMesh mesh = bindless_buffers[42].Load<ShaderMesh>(0);

The ByteAddressBuffer is also nice because less amount of descriptor tables must be bound by the API implementation layer, while supporting more buffer types. I haven’t noticed a performance difference compared to constant buffers on the GPUs/scenes I tested on.

Raytracing

Bindless makes it easier to support raytracing too, as using local root signatures is no longer required. Instead we can provide the mesh descriptor index via the top level acceleration structure InstanceID (which is basically userdata) and InstanceID() intrinsic function that is available in raytracing hit-shaders. The raytracing tier 1.1 (and Vulkan) also provides the GeometryIndex() intrinsic in the shader, which is useful to query a subset of a mesh. For example a flexible bindless scene description in a raytracing shader can look like this:

Texture2D<float4> bindless_textures[] : register(t0, space1);
ByteAddressBuffer bindless_buffers[] : register(t0, space2);
StructuredBuffer<ShaderMeshSubset> bindless_subsets[] : register(t0, space3);

// Then inside the hit shader:
ShaderMesh mesh = bindless_buffers[InstanceID()].Load<ShaderMesh>(0);
ShaderMeshSubset subset = bindless_subsets[mesh.subsetbuffer][GeometryIndex()];
ShaderMaterial material = bindless_buffers[subset.material].Load<ShaderMaterial>(0);
// continue shading...

This example works well, but watch out as there are a lot of dependent resource dereferencing like this, so there should be further experimentation with flattening some hierarchy, which could improve shader performance. For example: now to read a MeshSubset, you need to load the Mesh first, but in most cases there is only 1 subset per mesh, so it could be better to duplicate mesh data for each subset and load them as one. Remember that Mesh and Subset data here are just descriptor indices to reference vertex buffers and such, so the data duplication would not be very excessive.

The above example shows that a variety of bindless resources are declared, which is a bit questionable. Consider the StructuredBuffer case, for every type of StructuredBuffer, now we have to declare a bindless resource with the appropriate type. Internally the DX12 and Vulkan API implementations will bind a descriptor table (descriptor set in Vulkan) for each declaration to satisfy this that views the entire descriptor heap (Vulkan: descriptor pool). This means that the same heap will be bound multiple times, for each bindless declaration which just doesn’t make a lot of sense (Update: As Simon Coenen pointed out in the comments, it is valid to create one descriptor table with multiple unbounded ranges, so there won’t be a need for more than 2 SetDescriptorTable calls per pipeline. Thanks!). DX12 addresses this with the shader model 6.6 feature, which lets you just index a global resource descriptor heap or sampler descriptor heap and cast the result to the appropriate descriptor type: https://devblogs.microsoft.com/directx/in-the-works-hlsl-shader-model-6-6/

This makes sense in DX12, but Vulkan needs to create separate descriptor pools for every distinct resource type (storage buffer, uniform buffer, sampled image, etc…), so I’m not yet sure how it would be possible to make the two APIs compatible like this.

NonUniformResourceIndex

If the descriptor index is not a uniform value, you will need to make it one before you use it to index into a bindless heap. You can use the NonUniformResourceIndex() function in HLSL for this. If the value comes from a constant buffer, push constant, or WaveReadFirstLane() then the value is scalar. Otherwise, the NonUniformResourceIndex() will ensure that the scalarization happens correctly for each divergent lane. It is unfortunate that it can be missed easily, because the HLSL compiler doesn’t complain about it. Apparently it is valid to use a divergent index on some hardware, which seems to be the case for me on the Nvidia. To read more about this, visit this blog: Direct3D 12 – Watch out for non-uniform resource index! (asawicki.info)

Under the hood

Perhaps it’s worth to describe briefly how all this is implemented with Vulkan and DX12 graphics APIs. In DX12, there are two descriptor heaps at any time that are visible by shaders: one for samplers only, the other is for constant buffers, shader resource views and unordered access views (I call the second one the “resource descriptor heap”). It is not recommended to switch shader visible heaps at any time, so at application start they are created to fit the max amount of descriptors: 2048 samplers and 1 million resource descriptors (tier1 limit of DX12 model). These heaps will hold both bindless descriptors and binding slot layouts alike. The way it works is to just split the heap into two parts: the lower half will contain the bindless descriptors, the upper half of the heap is used as a lock-free ring buffer allocator and if a draw call requires new binding layout, it will be dynamically allocated. I really like the DX12 way of just giving you access to the heap and a way to address and copy descriptors within it, leaves a lot of freedom for implementation. For the bindless allocation, a simple free list is used. When shaders use bindless descriptors (as seen from shader reflection), then the sampler or resource heap will be automatically set to the appropriate root parameter just before the draw/dispatch is called – the user will not need to care about this.

In Vulkan, support for bindless relies on the VkPhysicalDeviceVulkan12Features::descriptorIndexing feature. I decided to create one descriptor pool per descriptor type for bindless, so that’s 8 pools currently that can hold a large number of descriptors. These are created once (using VK_DESCRIPTOR_POOL_CREATE_UPDATE_AFTER_BIND_BIT flag) and the full descriptor sets are allocated immediately. A custom free list allocator is responsible to keep track of which descriptors are free (in a similar same way as in DX12). The binding layouts are allocated from a separate descriptor pool that can support multiple different types of descriptors, and doesn’t support freeing descriptors – the whole pool will always be reset when the GPU finished with it. Based on shader reflection, the device layer will detect which descriptor sets need to be bound where. The downside with Vulkan is that register space declaration maps into descriptor set index. So if you declare a bindless table in space12 for example and no other, the device layer will need to bind dummy descriptor sets starting from set=0, ending with set=11. DX12 doesn’t have this problem, you can declare something in space687 and not care about it, since the root signature handles the mapping. But if you plan to use Vulkan, the advice is to find the minimum space number you can use.

In both of these implementations, allocating bindless descriptors will happen the first time you create the resource, or when creating a subresource, and these are immediately accessible for shaders. Their lifetime is bound to the resource itself, so the user mostly doesn’t need to care about it.

Difficulty of moving to bindless

Because it is possible to use bindless and the old school binding model together, it becomes really easy to move to bindless in baby steps – or even larger steps. The fun problem is coming up with new flexible data structures that make sense. The less fun part is supporting DX11 at the same time when bindless is used. This will be slightly intrusive depending on how far you are going. I decided to go full bindless in the most frequently used shaders – the scene rendering “object shaders” – when using DX12 or Vulkan. This means that the CPU part that renders all meshes will have a large if(bindless) branching. If bindless is supported, only a push constant block is being set, otherwise a material constant buffer and all textures are bound, and it seems manageable so far. The “object shaders” are using a compile time #ifdef path for bindless and all the texture accesses and vertex buffer reads are redefined to read from bindless buffers, a simplified example:

#ifdef BINDLESS
Texture2D<float4> bindless_textures[] : register(t0, space1);
SamplerState bindless_samplers[] : register(t0, space2);
ByteAddressBuffer bindless_buffers[] : register(t0, space3);
PUSHCONSTANT(push, ObjectPushConstants);

inline ShaderMaterial GetMaterial()
{
  return bindless_buffers[push.material].Load<ShaderMaterial>(0);
}

#define texture_basecolormap  bindless_textures[GetMaterial().texture_basecolormap_index]


#else // Non-bindless path below:
cbuffer materialCB : register(b0)
{
  ShaderMaterial g_xMaterial;
};

inline ShaderMaterial GetMaterial()
{
  return g_xMaterial;
}

Texture2D<float4> texture_basecolormap : register(t0);

#endif // BINDLESS

Thankfully the shaders were already using branching to avoid reading material textures that are not loaded, so guarding against uninitialized descriptors was not a problem that needed to be tackled in bindless. It’s pretty important to have this branching because bindless will not set “null descriptors” (descriptors are declared volatile), and referencing a descriptor that’s uninitialized can result in GPU hangs.

The vertex input struct is using functions to get vertex position and other properties and those functions will be redefined based on bindless. In DX11, input layouts are still used to provide vertex data, but in bindless I fetch everything by SV_VertexID and SV_InstanceID system provided input semantics:

struct VertexInput
{
#ifdef BINDLESS
  // Data coming from bindless fetching:
  uint vertexID : SV_VertexID;
  uint instanceID : SV_InstanceID;

  float4 GetPosition()
  {
    return float4(bindless_buffers[push.vertexbuffer_pos_wind].Load<float3>(vertexID * 16), 1);
  }

#else
  // Data coming from input layout:
  float4 pos : POSITION_NORMAL_WIND;

  float4 GetPosition()
  {
    return float4(pos.xyz, 1);
  }

#endif // BINDLESS
}

Wicked Engine has ubershader that includes mostly all “object shader” code in one file and relies on compile time defines to enable specific paths. With that it was not a huge effort to move to bindless.

Performance

Expect big savings in CPU performance with bindless. In large scenes this can be easily be worth lots of milliseconds. The GPU time can be worse though as the new data structures will probably use more indirections. I didn’t manage to arrive at a consistent conclusion here unfortunately, but also didn’t discover any big performance problems. It’s always true that the bindless model just runs faster overall. Some render passes saw some slight reduction in GPU performance, but some ran faster. The difference between Vulkan and DX12 is also not trivial. I seemed to have more CPU gains with DX12, while the Vulkan GPU times were usually better.

Debugging

All main graphics debuggers that I use (Renderdoc, Nsight, Pix) now support bindless pretty well. Using bindless in these tools can be quite a bit slower than not using them, so there is room for improvement. Enabling the GPU Based Validation layer is very useful as it will report uninitialized descriptors that are accessed by shaders. However this will make the application very slow in my experience, so it’s not a joy to use it.

Closing thoughts

Looking into the future, more systems will be moved to bindless where it makes sense, especially texture atlases. I think texture atlases are now “deprecated” with bindless. There is just so much effort in updating them and ensuring MIP levels are correct. Not to mention the limitation about the texture formats, since you can’t put different texture formats into the same atlas, and it is really difficult to update atlases with block compressed formats. There is also the GLTF model format which adds multiple new textures to be sampled for each extension, which begins to stress the bind slot limits, especially with some material blending shaders. It will be a lot easier to add new textures to materials now.

There will be a lot of places too where bindless doesn’t provide a benefit, for example global per frame resource bindings. I’m still not comfortable just dropping DX11, as it’s very stable now and supports a lot of older GPUs with acceptable performance. For the newer graphics API, I’d rather focus on working with cutting edge features and not worrying so much about backwards compatibility at the moment.

Thank you for reading!

Some other resources:
http://alextardif.com/Bindless.html
https://therealmjp.github.io/posts/bindless-texturing-for-deferred-rendering-and-decals/
Bindless Deferred Decals in The Surge 2 (slideshare.net)

Variable Rate Shading: first impressions

Variable Rate Shading (VRS) is a new DX12 feature introduced recently, that can be used to control shading rate. To be more precise, it is used to reduce shading rate, as opposed to the Multi Sampling Anti Aliasing (MSAA) technique which is used to increase it. When using MSAA, every pixel gets allocated multiple samples in the render target, but unless multiple triangles touch it, it will be only shaded once. VRS on the other hand doesn’t allocate multiple samples per pixel, instead it can broadcast one shaded pixel to nearby pixels, and only shade a group of pixels once. The shading rate means how big is the group of pixels that can get shaded as one.

Basics

DirectX 12 lets the developer specify the shading rate as a block of pixels, it can be 1×1 (default, most detailed), 1×2, 2×1, 2×2 (least detailed) in the basic hardware implementation. Optionally, hardware can also support 2×4, 4×2, 4×4 pixel group at an additional capability level. The granularity of the shading rate selection can be controlled per draw call by the basic Tier1 VRS hardware. Controlling by draw call is already a huge improvement over MSAA, because that means shading rate is not consistent across the screen. To set the shading rate, it couldn’t be easier:

commandlist5->RSSetShadingRate(D3D12_SHADING_RATE_2X2, nullptr); // later about second parameter

That’s it, unlike MSAA, we don’t need to do any resolve passes, it just works as is.

The Tier2 VRS feature level lets the developer specify the shading rate granularity even per triangle by using the SV_ShadingRate HLSL semantic for a uint shader input parameter. The SV_ShadingRate can be written as output from the vertex shader, domain shader, geometry shader and mesh shader. In all of the cases, the shading rate will be set per primitive, not per vertex, even though vertex and domain shaders only support the per vertex execution model. The triangle will receive the shading rate of the provoking vertex, which is the first vertex of the three vertices that make up the triangle. The pixel shader can also read the shading rate as an input parameter, which could be helpful in visualizing the rate.

The Tier2 VRS implementation also supports controlling the shading rate by a screen aligned texture. The screen aligned texture is a R8_UINT formatted texture, which contains the shading rate information per tile. A tile can be 8×8, 16×16 or 32×32 pixel block, it can be queried from DX12 as part of the D3D12_FEATURE_DATA_D3D12_OPTIONS6 structure:

D3D12_FEATURE_DATA_D3D12_OPTIONS6 features_6;
device->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS6, &features_6, sizeof(features_6));
features_6.VariableShadingRateTier; // shading rate image and per primitive selection only on tier2
features_6.ShadingRateImageTileSize; // tile size will be 8, 16 or 32
features_6.AdditionalShadingRatesSupported; // Whether 2x4, 4x2 and 4x4 rate is supported

Which means, that the shading rate image resolution will be:

width = (screen_width + tileSize - 1) / tileSize;
height = (screen_height + tileSize - 1) / tileSize;

To bind the shading rate image, a call exists:

commandlist5->RSSetShadingRateImage(texture);

The shading rate image will need to be written from a compute shader through an Unordered Access View (RWTexture2D<uint>). Before binding it with the RSSetShadingRateImage command, it needs to be in the D3D12_RESOURCE_STATE_SHADING_RATE_SOURCE.

So there are multiple ways to set the shading rate: RSSetShadingRate, SV_ShadingRate, RSSetShadingRateImage, but which one will be in effect? This can be specified through the second parameter of the RSSetShadingRate() call with an array of combiners. The combiners can specify which shading rate selector will be chosen. For example, one that specified the least detailed shading rate (D3D12_SHADING_RATE_COMBINER_MAX), or the most detailed (D3D12_SHADING_RATE_COMBINER_MIN), or by other logic. Right now, I just want to apply the coarsest shading rate that was selected at all times, so I call this at the beginning of every command list:

D3D12_SHADING_RATE_COMBINER combiners[] =
{
	D3D12_SHADING_RATE_COMBINER_MAX,
	D3D12_SHADING_RATE_COMBINER_MAX,
};
GetDirectCommandList(cmd)->RSSetShadingRate(D3D12_SHADING_RATE_1X1, combiners);

Next, I’d like to show some of the potential effects these can play into.

  • Materials
    For example, if there is an expensive material or not very important, lower the shading rate of it. It can easily be a per material setting that can be set at authoring time.
Comparison of native (full rate) and variable rate shading (4×4 reduction). Texture sampling quality is reduced with VRS.
Lighting quality is reduced with VRS (below) compared to full resolution shading (above), but geometry edges are retained.
  • Particle systems
    When drawing off screen particles into a low resolution render target, we can save performance easily, but it will be difficult to compose back the particles and retain smooth edges with the geometry in the depth buffer. Instead, we can choose to render with full resolution, and reduce shading rate. We can also keep using hardware depth testing this way and improve performance.
4×4 shading rate reduction for the particles. Large overlapping particles with lighting must reduce shading rate or rendered lower resolution for good performance. (ground plane also using VRS here)
  • Objects in the distance
    Objects in distance can easily reduce shading rate by draw call or per primitive rate selection.
  • Objects behind motion blur or depth of field
    Fast moving or out of focus objects can be shaded more coarsely and the shading rate image features can be used for this. In my first integration, I am using a compute shader that dispatches one thread group for each tile and each thread in the group reads a pixel’s velocity until all pixels in the tile are read. Each pixel’s velocity is mapped to a shading rate and then the most detailed one is determined via atomic operation inside the tile. The shader internally must write values that are taken from D3D12_SHADING_RATE struct, so in order to keep an API independent implementations, these values are not hard coded, but provided in a constant buffer. [My classification shader source code here, but it will probably change in the future]
Shading rate classification by velocity buffer and moving camera.
  • Alpha testing
    For vegetations, we often want to do alpha testing, and often the depth prepass is used with alpha testing. In that case, we don’t have to alpha test in the second pass when we render colors and use more expensive pixel shaders, because the depth buffer is computed and we can rely on depth testing against previous results. Then the idea is that we will be able to reduce shading rate for the alpha tested vegetation only in the second pass, while retaining high resolution alpha testing quality from the depth prepass.
Vegetation particle system using alpha testing

Problems:

  • One of my observations is that when zoomed in to a reduced shading rate object on the screen, we may see some blockiness, as if it used point/nearest neighbor sampling method, but after some thinking it makes sense, because only a single pixel value is broadcasted to to all neighbors, and no filtering or resolving takes place.
  • Also, mip level selection will be different in coarse shaded regions, because derivatives are larger when using larger pixel blocks. This is because samples are farther away from each other. For me personally, it doesn’t matter because the result is blocky anyways. This should be applied at places where the users will notice these less likely anyway. I am not sure how I would handle it with the Tier2 features, but the per draw call rate selection could be balanced with setting a lower mip LOD bias for the samplers in the draw call when there is coarse shading selected.
  • Although off screen particles can retain more correct composition with the depth buffer with VRS, if we are rendering soft particles (blending computed in the shader from linear depth buffer and particle plane difference), the soft regions where the depth test is not happening will produce some blockiness:
There are particles in the foreground, not depth tested and instead blended in the shader which causes blockiness with VRS enabled (left)
  • Classification
    There are many more aspects to consider when classifying tiles for the image based shading rate selection. Right now, the simple thing to try was select increasingly coarser shading rates with increasing minimum title velocity. The other things to consider would be that I can think of and heard of: depth of field focus, depth discontinuity, visible surface detail. All these are most likely fed from the previous frame and reprojected with the current camera matrices. Tweaking and trying all of these will have to wait for me and probably depending on the kind of game/experience one is making. Strategy games will likely not care about motion blur, unlike racing games.

Performance

Enabling the VRS gets me a significant performance boost, especially when applying to large geometries, such as the floor in Sponza (that also uses an expensive parallax occlusion mapping shader for displacement mapping), or the large billboard particles that are overlapping and using an expensive lighting shader. Some performance results using RTX 2060, 4k resolution:

  • Classification:
    0.18ms – from velocity buffer only
  • Forward rendering:
    5.6ms – stationary camera (full res shading)
    1.8ms – moving camera (variable rate shading)
    4ms – only floor (with parallax occlusion mapping) set to constant 4×4 shading rate
  • Motion blur:
    0.75ms – stationary camera (plus curtain moving on screen)
    3.6ms – moving camera
left: visualizing variable shading rate
right: motion blur amount
Motion blur increases cost when blur amount increases, but VRS reduces cost at the same time
  • Billboard particle system (large particles close to camera)
    4ms – unlit, full resolution shading
    3ms – unlit, 4×4 shading rate
    24.7ms – shaded, full resolution shading
    3.4ms – shaded, 4×4 shading rate
particle performance test – very large particles on screen, overlapping, sampling shadow map and lighting calculation

Thanks for reading, you can read about VRS in more detail in the DX12 specs.

As Philip Hammer called out on Twitter, the Nvidia VRS extension is also available in Vulkan, OpenGL and DX11:

UPDATE:

Vulkan now has the cross vendor extension for variable rate shading, called the KHR_fragment_shading_rate. This is somewhat different from the DX12 specs, as the shading rate image needs to be a render pass attachment instead of a separate binding. This is different from the former VK_NV_shading_rate which was closer to DX12 in that regard. The new one follows a more Vulkan-like approach. For the example of implementation, you can look at Wicked Engine’s Vulkan interface.

Tile-based optimization for post processing

One way to optimize heavy post processing shaders is to determine which parts of the screen could use a simpler version. The simplest form of this is use branching in the shader code to early exit or switch to a variant with reduced sample count or computations. This comes with a downside that even the parts where early exit occur, must allocate as many hardware resources (registers or groupshared memory) as the heaviest path. Also, branching in the shader code can be expensive when the branch condition is not uniform (for example: constant buffer values or SV_GroupID are uniform, but texture coordinates or SV_DispatchThreadID are not), because multiple threads can potentially branch differently and cause divergence in execution and additional instructions.

Instead of branching, we can consider to have multiple different shaders with different complexities and use the DispatchIndirect functionality of DX11+ graphics APIs. The idea is to first determine which parts of the screen will require what kind of complexity, this will be done per tile. A tile could be for example at 32×32 or 16×16 pixels in size, but it could depend on the kind of post process. What do I mean by complexity? Let’s take depth of field for example. Depth of field is used to focus on an object and make it sharp, while making the foreground and background blurry. The part of the screen that will appear completely sharp will use early exit shader, without doing any blurring. The other parts of the screen will instead perform a blur. For a more complex depth of field effect, such as the one described in Siggraph 2014 by Jorge Jimenez as “scatter-as-you-gather” depth of field (such a great presentation!), the blur can be separated into two kinds. A simple blur can be used where the whole tile contains similar CoC (Circle of Confusion – the amount of blur, basically), and a more expensive weighted blur will be used in the tiles where there are vastly different CoC values. With this knowledge we can design a prepass shader that classifies tiles as how complex shader they require. After that, we just DispatchIndirect multiple times back to back – one for each complexity and using different shaders. This requies a lot of setup work, so it’s a good idea to leave it for last step in the optimizations.

The implementation will consist of the following steps:

1) Tile classification.

Usually you are already gathering some per tile information for these kind of post processes. For example for the Depth of field I put the classification in the same shader that computes the min-max CoC per tile. This shader writes to 3 tile list in my case:

  • early exit tiles (when tile maximum CoC is small enough)
  • cheap tiles (maxCoC – minCoc is small enough)
  • expensive tiles (everything else)

Take a look at this example:

Focusing on the character face…
(The cute model is made by Sketchfab user woopoodle)
The face is classified as early out (blue), while parts of the geometry that is facing the camera are cheap (green). The rest is expensive (red), because the CoC range is large
Focusing on the statue in the background…
The lion head in the back is early exit (blue that’s barely visible), there are a lot of cheap tiles (green), because the CoC is capped to a maximum value, while the character silhouette is expensive (red), because it contains focused and blurred pixels as well (or just high CoC difference)

For motion blur, we can also do similar things, but using min-max of velocity magnitudes.
For screen space reflection, we could classify by reflectivity and roughness.
For tiled deferred shading, we can classify tiles for different material types. Uncharted 4 did this as you can find here: http://advances.realtimerendering.com/s2016/index.html
Possibly other techniques can benefit as well.
The tile classification shader is implemented via stream compaction, like this:

static const uint POSTPROCESS_BLOCKSIZE = 8;

static const uint TILE_STATISTICS_OFFSET_EARLYEXIT = 0;
static const uint TILE_STATISTICS_OFFSET_CHEAP = TILE_STATISTICS_OFFSET_EARLYEXIT + 4;
static const uint TILE_STATISTICS_OFFSET_EXPENSIVE = TILE_STATISTICS_OFFSET_CHEAP + 4;

RWByteAddressBuffer tile_statistics;
RWStructuredBuffer<uint> tiles_earlyexit;
RWStructuredBuffer<uint> tiles_cheap;
RWStructuredBuffer<uint> tiles_expensive;

[numthreads(POSTPROCESS_BLOCKSIZE, POSTPROCESS_BLOCKSIZE, 1)]
void main(uint3 DTid : SV_DispatchThreadID) // this runs one thread per tile
{
  // ...
  const uint tile = (DTid.x & 0xFFFF) | ((DTid.y & 0xFFFF) << 16); // pack current 2D tile index to uint

  uint prevCount;
  if (max_coc < 0.4f)
  {
    tile_statistics.InterlockedAdd(TILE_STATISTICS_OFFSET_EARLYEXIT, 1, prevCount);
    tiles_earlyexit[prevCount] = tile;
  }
  else if (abs(max_coc - min_coc) < 0.2f)
  {
    tile_statistics.InterlockedAdd(TILE_STATISTICS_OFFSET_CHEAP, 1, prevCount);
    tiles_cheap[prevCount] = tile;
  }
  else
  {
    tile_statistics.InterlockedAdd(TILE_STATISTICS_OFFSET_EXPENSIVE, 1, prevCount);
    tiles_expensive[prevCount] = tile;
  }
}

2) Kick indirect jobs.

If your tile size is the same as the POSTPROCESS_BLOCKSIZE that will render the post process, you could omit this step and just stream compact to the indirect argument buffers inside the classification shader itself. But in my case I am using 32×32 pixel tiles, while the thread count of the compute shaders is 8×8. So the “kick jobs” shader will compute the actual dispatch count and write indirect argument buffers. This shader will also be resposible to reset the counts for the next frame. It is using only 1 thread:

static const uint DEPTHOFFIELD_TILESIZE = 32;

static const uint INDIRECT_OFFSET_EARLYEXIT = TILE_STATISTICS_OFFSET_EXPENSIVE + 4;
static const uint INDIRECT_OFFSET_CHEAP = INDIRECT_OFFSET_EARLYEXIT + 4 * 3;
static const uint INDIRECT_OFFSET_EXPENSIVE = INDIRECT_OFFSET_CHEAP + 4 * 3;

#define sqr(a)		((a)*(a))

RWByteAddressBuffer tile_statistics;
RWStructuredBuffer<uint> tiles_earlyexit;
RWStructuredBuffer<uint> tiles_cheap;
RWStructuredBuffer<uint> tiles_expensive;

[numthreads(1, 1, 1)]
void main(uint3 DTid : SV_DispatchThreadID)
{
  // Load statistics:
  const uint earlyexit_count = tile_statistics.Load(TILE_STATISTICS_OFFSET_EARLYEXIT);
  const uint cheap_count = tile_statistics.Load(TILE_STATISTICS_OFFSET_CHEAP);
  const uint expensive_count = tile_statistics.Load(TILE_STATISTICS_OFFSET_EXPENSIVE);

  // Reset counters:
  tile_statistics.Store(TILE_STATISTICS_OFFSET_EARLYEXIT, 0);
  tile_statistics.Store(TILE_STATISTICS_OFFSET_CHEAP, 0);
  tile_statistics.Store(TILE_STATISTICS_OFFSET_EXPENSIVE, 0);

  // Create indirect dispatch arguments:
  const uint tile_replicate = sqr(DEPTHOFFIELD_TILESIZE / POSTPROCESS_BLOCKSIZE); // for all tiles, we will replicate this amount of work
  tile_statistics.Store3(INDIRECT_OFFSET_EARLYEXIT, uint3(earlyexit_count * tile_replicate, 1, 1));
  tile_statistics.Store3(INDIRECT_OFFSET_CHEAP, uint3(cheap_count * tile_replicate, 1, 1));
  tile_statistics.Store3(INDIRECT_OFFSET_EXPENSIVE, uint3(expensive_count * tile_replicate, 1, 1));
}

Note that the tile_statistics buffer will also be used as the indirect argument buffer for DispatchIndirect later, but using offsets into the buffer. The value tile_replicate is key to have different tile size than the threadcount of the post processing shaders (POSTPROCESS_BLOCKSIZE). Essentially we will dispatch multiple thread groups per tile to account for this difference. However, TILESIZE should be evenly divisible by POSTPROCESS_BLOCKSIZE to keep the code simple.

3) Use DispatchIndirect

BindComputeShader(&computeShaders[CSTYPE_DEPTHOFFIELD_MAIN_EARLYEXIT]);
DispatchIndirect(&buffer_tile_statistics, INDIRECT_OFFSET_EARLYEXIT);

BindComputeShader(&computeShaders[CSTYPE_DEPTHOFFIELD_MAIN_CHEAP]);
DispatchIndirect(&buffer_tile_statistics, INDIRECT_OFFSET_CHEAP);

BindComputeShader(&computeShaders[CSTYPE_DEPTHOFFIELD_MAIN]);
DispatchIndirect(&buffer_tile_statistics, INDIRECT_OFFSET_EXPENSIVE);

Note that if you are using DX12 or Vulkan, you don’t even need to synchronize between these indirect executions because they touch mutually exclusive parts of the screen. Unfortunately, DX11 will always wait for a compute shader to finish before starting the next which is a slight inefficiency in this case.

4) Execute post process

You will need to determine which tile and which pixel are you currently shading. I will refer to tile which is the 32 pixel wide large tile taht was classified and subtile that is the 8 pixel wide small tile that corresponds to the thread group size. You can see what I mean on this drawing:

It is also called “Tileception”

So first we read from the corresponding tile list (early exit shader reads from early exit tile list, cheap shader from cheap tile list, and so on…) and unpack the tile coordinate like this:

// flattened array index to 2D array index
inline uint2 unflatten2D(uint idx, uint dim)
{
  return uint2(idx % dim, idx / dim);
}

RWStructuredBuffer<uint> tiles;

[numthreads(POSTPROCESS_BLOCKSIZE * POSTPROCESS_BLOCKSIZE, 1, 1)]
void main(uint3 Gid : SV_GroupID, uint3 GTid : SV_GroupThreadID)
{
  const uint tile_replicate = sqr(DEPTHOFFIELD_TILESIZE / POSTPROCESS_BLOCKSIZE);
  const uint tile_idx = Gid.x / tile_replicate;
  const uint tile_packed = tiles[tile_idx];
  const uint2 tile = uint2(tile_packed & 0xFFFF, (tile_packed >> 16) & 0xFFFF);

After we got the tile, we can continue to compute the pixel we want to shade:

  const uint subtile_idx = Gid.x % tile_replicate;
  const uint2 subtile = unflatten2D(subtile_idx, DEPTHOFFIELD_TILESIZE / POSTPROCESS_BLOCKSIZE);
  const uint2 subtile_upperleft = tile * DEPTHOFFIELD_TILESIZE + subtile * POSTPROCESS_BLOCKSIZE;
  const uint2 pixel = subtile_upperleft + unflatten2D(GTid.x, POSTPROCESS_BLOCKSIZE);

Note that we are running a one dimensional kernel instead of 2 dimensional. I think this simplifies the implementation because the tile lists are also one dimensional, but we need to use the unflatten2D helper function to convert 1D array index to 2D when computing the pixel coordinate. Also note that this code adds instructions to the shader, but up until the last line, those instructions are not divergent (because they are relying on the uniform SV_GroupID semantic), so they can be considered cheap as they are not using the most precious hardware resources, the VGPR (Vector General Purpose Registers).

After we have the pixel coordinate, we can continue writing the regular post processing code as usual.

Interestingly, I have only seen performance benefit with this optimization when the loops were unrolled in the blurring shaders. Otherwise, they were not bottlenecked by register usage, but the dynamic loops were not performing very well. After that, I have experienced an improvement with the tiling optimization, about 0.3 milliseconds were saved in Depth of Field at 4k resolution on Nvidia GTX 1070 GPU.

Probably further tests need to be conducted and trying different tile sizes and classification threshold values. However, this is already a good one to have in your graphics optimization toolbox.

You can find my implementation of this in WickedEngine for the motion blur and depth of field effects.

Let me know if you spot any mistakes or have feedback in the comments!