Should we get rid of Vertex Buffers?

TLDR: If your only platform to support is a recent AMD GPU, or console, then yes. ūüôā

I am working on a “game engine” nowadays but mainly focusing on the rendering aspect. I wanted to get rid of some APIs lately in my graphics wrapper to be more easier to use, because I just hate the excessive amount of state setup (I am using DirectX 11 like rendering commands). Looking at the current code, my observation is that there are many of those ugly vertex buffer and input layout management pieces that just feel so unnecessary when there are so flexible memory management operations available to the shader developers nowadays.

My idea is that why should we declare input layouts, bind them before rendering, then also binding appropriate vertex buffers to appropriate shaders before rendering? At least in my pipeline, if I already know the shader I am using, I already know the required “input layout” of the “vertices” it should process, so why not just read them in the shader as I see fit right there?

For example we already have Access to ByteAddressBuffer from which is trivial to read vertex buffer data (unless it is a typed data for example RGBA8_UNORM format but it can still be converted easily). We can save ourselves a call to IASetInputLayout and instead of IASetVertexBuffers with a stride and offset, we can just bind it as VSSetShaderResources and do the reading in the beginning of the vertex shader. I find it easier, more to the point and also more efficient because we avoid loading from typed buffers and one less call to the API.

So I began rewriting my scene rendering code to make use of custom fetching the vertex buffers. First I used typed buffer loads, but that left me with subpar performance with Nvidia GPUs (GTX 960 and GTX 1070). I posted on Gamedev.net and others suggested me that I should be using raw buffers (byteaddressbuffer) instead. So I did. The results were like exactly the same on my GTX 1070 GPU and on an other GTX 960. The AMD RX 470 however was performing nearly exactly the same as it was before. The GCN achitecture abandoned the fixed function pipeline when fetching vertex buffers and it uses regular memory operations as it seems.

Not long ago I’ve had some look at the current generation of console development SDKs and there it is even recommended practice to read the vertex data yourself, they even provide API calls to “emulate” regular vertex buffer usage (at least on the PS4) though if you inspect the final compiled shaders, you will even find vertex fetching code in them.

I assembled a little benchmark in my engine on the sponza scene on an AMD and NVIDIA GPU, take a look (sorry for formatting issues, I can barely use wordpress it seems):

Program: Wicked Engine Editor
API: DX11
Test scene: Sponza

– 3 shadow cascades (2D) – 3 scene render passes

– 1 spotlight shadow (2D) – 1 scene render pass

– 4 pointlight shadows (Cubemap) – 4 scene render passes

– Z prepass – 1 scene render pass

– Opaque pass – 1 scene render pass

Timing method: DX11 timestamp queries
Methods:

– InputLayout : The default hardware vertex buffer usage with CPU side input layout declarations. The instance buffers are bound as vertex buffers with each render call.

– CustomFetch (typed buffer): Vertex buffers are bound as shader resource views with DXGI_FORMAT_R32G32B32A32_FLOAT format. Instance buffers are bound as Structured Buffers holding a 4×4 matrix each.

– CustomFetch (RAW buffer 1): Vertex buffers are bound as shader resource views with a MiscFlag of D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS. In the shader the buffers are addressed in byte offsets from the beginning of the buffer. Instance buffers are bound as Structured Buffers holding a 4×4 matrix each.

– CustomFetch (RAW buffer 2): Even instancing information is retrieved from raw buffers instead of structured buffers.ShadowPass and ZPrepass: These are using 3 buffers max:

– position (float4)

– UV (float4) // only for alpha tested

– instance buffer

OpaquePass: This is using 6 buffers:

– position (float4)

– normal (float4)

– UV (float4)

– previous frame position VB (float4)

– instance buffer (float4x4)

– previous frame instance buffer (float4x3)

RESULTS:

GPU     Method        ShadowPass    ZPrepass   OpaquePass   All GPU
NVidia GTX 960  InputLayout       4.52 ms     0.37 ms    6.12 ms    15.68 ms

NVidia GTX 960  CustomFetch (typed buffer)   18.89 ms    1.31 ms    8.68 ms    33.58 ms

NVidia GTX 960  CustomFetch (RAW buffer 1)   18.29 ms    1.35 ms    8.62 ms    33.03 ms

NVidia GTX 960  CustomFetch (RAW buffer 2)   18.42 ms    1.32 ms    8.61 ms    33.18 ms

AMD RX 470   InputLayout       7.43 ms     0.29 ms    3.06 ms    14.01 ms

AMD RX 470   CustomFetch (typed buffer)   7.41 ms     0.31 ms    3.12 ms    14.08 ms

AMD RX 470   CustomFetch (RAW buffer 1)   7.50 ms     0.29 ms    3.07 ms    14.09 ms

AMD RX 470   CustomFetch (RAW buffer 2)   7.56 ms     0.28 ms    3.09 ms    14.15 ms

Sadly, it seems that we can not get rid of the vertexbuffer/inputlayout APIs of DX11 when developing for PC platform because NVIDIA GPUs are much less performant with this method of custom vertex fetching. But what about mobile platforms? I have a Windows Phone build of Wicked Engine, and I want to test it on the Snapdragon 808 GPU but it seems like a bit of extra work to set it up on mobile so I will do it later probably. I somewhat already quite disappointed though because my engine is a designed for PC-like high performance usage so the mobile set up will have to wait a bit.
So the final note: If current gen consoles are your only platform, you can fetch your vertex data by hand with no problem and probably even more optimally by bypassing the typed conversions/or some other magic. If you are developing on PC, you have to keep the vertex buffer APIs intact yet which can be a pain as it will require a more ambigous shader syntax. And why in the hell should we declare strings (eg. TEXCOORD4) in the input layout is completely beyond me and annoying as hell.

How to Resolve an MSAA DepthBuffer

If you want to implement MSAA (multisampled antialiasing) rendering, you need to render into multismpled render targets. When you want to read an anti aliased rendertarget as a shader resource, first you need to resolve it. Resolving means copying it to a non multisampled texture and averaging the subsamples (in D3D11 it is performed by calling ResolveSubresource on the device context). You can quickly find out, that it doesn’t work that way for a depthbuffer.

When you specify D3D11_BIND_DEPTHSTENCIL when creating a texture, and later try to resolve it, the D3D11 debug layer throws an error, telling you that you can’t do that. You must do the resolve by hand in a shader.

I chose the compute shader to do the job, because there is less state setup involved. I am doing a min operation on the depth buffer while reading it to get the closest one of the samples to the camera. I think most applications want to do this, but you could also get the 0th sample or the maximum, depending on the computation needs.

Texture2DMS<float> input : register(t0);

RWTexture2D<float> output : register(u0);

[numthreads(16, 16, 1)]

void main(uint3 dispatchThreadId : SV_DispatchThreadID)

{

uint2 dim;

uint sampleCount;

input.GetDimensions(dim.x, dim.y, sampleCount);

if (dispatchThreadId.x > dim.x || dispatchThreadId.y > dim.y)

{

return;

}

float result = 1;

for (uint i = 0; i < sampleCount; ++i)

{

result = min(result, input.Load(dispatchThreadId.xy, i).r);

}

output[dispatchThreadId.xy] = result;

}

I call this compute shader like this:

Dispatch(ceil(screenWidth/16.0f), ceil(screenHeigh/16.0f), 1)

That’s the simplest shader I could do, it just loops over all the samples, and does a min operation on them.

When dispatching a compute shader with parameters like this, the dispatchThreadID gives us a direct pixel coodinate. Because there could be cases when the resolution is not dividable by the threadcount, we should make sure to discard the out of boundary texture accesses.

It could also be done with a pixel shader, but I wanted to avoid the state setup of it. In the pixel shader, we woud need to bind rasterizer, depthstencil, and blend states, and even input layouts, vertex buffers or primitive topologies unless we abuse the immediate constant buffer. I want ot avoid state setup whenever possibe because it increases CPU overhead and we can do better here.

However, I’ve heard that calling a compute shader in the middle of a rasterization pipeline can incur additional pipeline overhead, I’ve yet to witness it (comment if you can prove it).

If I’d like to do a custom resolve for an other type of texture, I would keep the shader as it is, but would change the min operation only for an other one, for example an average, or max, etc…

That is all I wanted to keep this fairly short.

Abuse the immediate constant buffer!

Very often, I need to draw simple geometries, like cubes, and I want to do the minimal amount of graphics state setup. With this technique, you don’t have to set up a vertex buffer or¬†input layout, which means, we don’t have to write the boilerplate resource¬†creation code for them, and don’t have to call the binding code, which also¬†lightens the API overhead.

An immediate constant buffer differs from a regular constant buffer in a few aspects:

  • There is a reserved constant buffer slot for them, and there can be only one of them at the same time.
  • They are created automatically from the static const variables in your hlsl code.
  • They can not be updated from the API.

So when I declare a vertex array inside a shader, for example, like this:

static const float4 CUBE[]={

float4(-1.0,1.0,1.0,1.0),

float4(-1.0,-1.0,1.0,1.0),

float4(-1.0,-1.0,-1.0,1.0),

float4(1.0,1.0,1.0,1.0),

float4(1.0,-1.0,1.0,1.0),

float4(-1.0,-1.0,1.0,1.0),

float4(1.0,1.0,-1.0,1.0),

float4(1.0,-1.0,-1.0,1.0),

float4(1.0,-1.0,1.0,1.0),

float4(-1.0,1.0,-1.0,1.0),

float4(-1.0,-1.0,-1.0,1.0),

float4(1.0,-1.0,-1.0,1.0),

float4(-1.0,-1.0,1.0,1.0),

float4(1.0,-1.0,1.0,1.0),

float4(1.0,-1.0,-1.0,1.0),

float4(1.0,1.0,1.0,1.0),

float4(-1.0,1.0,1.0,1.0),

float4(-1.0,1.0,-1.0,1.0),

float4(-1.0,1.0,-1.0,1.0),

float4(-1.0,1.0,1.0,1.0),

float4(-1.0,-1.0,-1.0,1.0),

float4(-1.0,1.0,1.0,1.0),

float4(1.0,1.0,1.0,1.0),

float4(-1.0,-1.0,1.0,1.0),

float4(1.0,1.0,1.0,1.0),

float4(1.0,1.0,-1.0,1.0),

float4(1.0,-1.0,1.0,1.0),

float4(1.0,1.0,-1.0,1.0),

float4(-1.0,1.0,-1.0,1.0),

float4(1.0,-1.0,-1.0,1.0),

float4(-1.0,-1.0,-1.0,1.0),

float4(-1.0,-1.0,1.0,1.0),

float4(1.0,-1.0,-1.0,1.0),

float4(1.0,1.0,-1.0,1.0),

float4(1.0,1.0,1.0,1.0),

float4(-1.0,1.0,-1.0,1.0),

};

…and if¬†I want to draw this cube, then the simplest vertex shader should look like this:

float4 main(uint vID : SV_VERTEXID) : SV_Position

{

return mul(CUBE[vID], g_xTransform);

}

(where g_xTransform is the World*View*Projection matrix from a regular constant buffer)

I would then call the Draw from the DX11 API with a vertexcount of 36 because that is the array length of the CUBE vertex array. The shader automatically gets the SV_VERTEXID semantic from the input assembler, which directly indexes into the vertex array. I find this technique very clean both from the C++ side and the shader side, so I use it very frequently.

A few example use-cases:

  • Deferred light geometries
  • Light volume geometries
  • Occlusion culling occludees
  • Decals
  • Skybox/skysphere
  • Light probe debug geometries

If you need vertex arrays like this for some other simple meshes:

That’s it, cheers!

Smooth Lens Flare in the Geometry Shader

This is a historical feature from the Wicked Engine, meaning it was implemented a few years ago, but at the time it was a big step for me.

flare.gif

I wanted to implement simple textured lens flares but at the time all I could find was by using occlusion queries to determine if a lens flare should be visible or not. A simpler solution was needed for me. At the time I was already using the geometry shader for billboard particles, so I wanted to make further use of them here. I also wanted it to smoothly transition from fully visible, to invisible, withouth it popping when the light source goes behind an occluder. It is also my first blog post so I wanted to start with something simple.

The idea is that for a light source emitting a lensflare which is on the screen, I don’t check its visibility by occlusion query, but drawing a single vertex for it (for each flare). The vertex goes through a pass through vertex shader, then arrives at the geometry shader stage, where the occlusion is detected by checking the light source against the scene’s depth buffer. A simple solution is checking the light source’s screenspace position Z value to the depth at the XY value. This will not yield smooth results though. If the pixel is occluded, then the flare is visible, else it is not. It could be enough in cases where the geometry is predictable, like buildings, for example. However it looks extremely cheap when it is vegetation that occludes the flare, because it consists of many holes, which could be swaying in the wind making the flare flicker.

For smoothening out the popping, I use the technique which is used for the PCF shadow softening. Namely, check all the depth values in the current depth’s surroundings then average them to measure the occlusion. Thus you get the opacity value by dividing the not occluded sample count by the number of taken samples.

If there is at least one value in the surroundings which is not occluded (opacity > 0), then I spawn the flare billboards with the corresponding textures.

Prior to the shader, I project the light’s World position onto the screen with the appropriate viewprojection matrix, and send the projected light position to the shader.

Here comes the geometry shader (Can’t I format here better?):


// constant buffer

CBUFFER(LensFlareCB, CBSLOT_OTHER_LENSFLARE)

{

float4  xSunPos; // light position (projected)

float4  xScreen; // screen dimensions

};

struct InVert

{

float4 pos     : SV_POSITION;

nointerpolation uint vid : VERTEXID;

};

struct VertextoPixel{

float4 pos     : SV_POSITION;

float3 texPos    : TEXCOORD0; // texture coordinates (xy) + offset(z)

nointerpolation uint   sel : TEXCOORD1; // texture selector

nointerpolation float4 opa : TEXCOORD2; // opacity + padding

};

// Append a screen space quad to the output stream:

inline void append(inout TriangleStream<VertextoPixel> triStream, VertextoPixel p1, uint selector, float2 posMod, float2 size)

{

float2 pos = (xSunPos.xy-0.5)*float2(2,-2);

float2 moddedPos = pos*posMod;

float dis = distance(pos,moddedPos);

p1.pos.xy=moddedPos+float2(-size.x,-size.y);

p1.texPos.z=dis;

p1.sel=selector;

p1.texPos.xy=float2(0,0);

triStream.Append(p1);

p1.pos.xy=moddedPos+float2(-size.x,size.y);

p1.texPos.xy=float2(0,1);

triStream.Append(p1);

p1.pos.xy=moddedPos+float2(size.x,-size.y);

p1.texPos.xy=float2(1,0);

triStream.Append(p1);

p1.pos.xy=moddedPos+float2(size.x,size.y);

p1.texPos.xy=float2(1,1);

triStream.Append(p1);

}

// pre-baked offsets

// These values work well for me, but should be tweakable

static const float mods[] = { 1,0.55,0.4,0.1,-0.1,-0.3,-0.5 };

[maxvertexcount(4)]

void main(point InVert p[1], inout TriangleStream<VertextoPixel> triStream)

{

VertextoPixel p1 = (VertextoPixel)0;

// Determine flare size from texture dimensions

float2 flareSize=float2(256,256);

switch(p[0].vid){

case 0:

texture_0.GetDimensions(flareSize.x,flareSize.y);

break;

case 1:

texture_1.GetDimensions(flareSize.x,flareSize.y);

break;

case 2:

texture_2.GetDimensions(flareSize.x,flareSize.y);

break;

case 3:

texture_3.GetDimensions(flareSize.x,flareSize.y);

break;

case 4:

texture_4.GetDimensions(flareSize.x,flareSize.y);

break;

case 5:

texture_5.GetDimensions(flareSize.x,flareSize.y);

break;

case 6:

texture_6.GetDimensions(flareSize.x,flareSize.y);

break;

default:break;

};

// determine depthmap dimensions (could be screen dimensions from the constantbuffer)

float2 depthMapSize;

texture_depth.GetDimensions(depthMapSize.x,depthMapSize.y);

flareSize /= depthMapSize;

// determine the flare opacity:

// These values work well for me, but should be tweakable

const float2 step = 1.0f / (depthMapSize*xSunPos.z);

const float2 range = 10.5f * step;

float samples = 0.0f;

float accdepth = 0.0f;

for (float y = -range.y; y <= range.y; y += step.y)

{

for (float x = -range.x; x <= range.x; x += step.x)

{

samples += 1.0f;

// texture_depth is non-linear depth (but it could work for linear too with linear reference value)

// SampleCmpLevelZero also makes a comparison by using a LESS_EQUAL comparison sampler

// It compares the reference value (xSunPos.z) to the depthmap value.

// Returns 0.0 if all samples in a bilinear kernel are greater than reference value

// Returns 1.0 if all samples in a bilinear kernel are less or equal than refernce value

// Can return in between values based on bilinear filtering

accdepth += (texture_depth.SampleCmpLevelZero(sampler_cmp_depth, xSunPos.xy + float2(x, y), xSunPos.z).r);

}

}

accdepth /= samples;

p1.pos = float4(0, 0, 0, 1);

p1.opa = float4(accdepth, 0, 0, 0);

// Make a new flare if it is at least partially visible:

if( accdepth>0 )

append(triStream,p1,p[0].vid,mods[p[0].vid],flareSize);

}

The pixel shader just samples the appropriate texture with the texture

coordinates:


struct VertextoPixel{

float4 pos     : SV_POSITION;

float3 texPos    : TEXCOORD0;

nointerpolation uint   sel : TEXCOORD1;

nointerpolation float4 opa : TEXCOORD2;

};

float4 main(VertextoPixel PSIn) : SV_TARGET

{

float4 color=0;

// todo: texture atlas or array

switch(PSIn.sel)

{

case 0:

color = texture_0.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);

break;

case 1:

color = texture_1.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);

break;

case 2:

color = texture_2.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);

break;

case 3:

color = texture_3.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);

break;

case 4:

color = texture_4.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);

break;

case 5:

color = texture_5.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);

break;

case 6:

color = texture_6.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);

break;

default:break;

};

color *= 1.1 - saturate(PSIn.texPos.z);

color *= PSIn.opa.x;

return color;

}

That’s it, I hope¬†it was useful. ūüôā

Welcome brave developer!

This is a blog containing development insight to my game engine, Wicked Engine. Feel free to rip off any code, example, techinque from here, as you could also do it from the open source engine itself: https://github.com/turanszkij/WickedEngine

I want to post info from historical features as well as new ones. I try to select the ones which are sparsely blogged on the web or just feel like sharing it. I don’t intend to write complete tutorials, but sharing ideas instead, while providing¬†minimalistic code samples.

Happy coding!