Should we get rid of Vertex Buffers?

TLDR: If your only platform to support is a recent AMD GPU, or console, then yes. 🙂

I am working on a “game engine” nowadays but mainly focusing on the rendering aspect. I wanted to get rid of some APIs lately in my graphics wrapper to be more easier to use, because I just hate the excessive amount of state setup (I am using DirectX 11 like rendering commands). Looking at the current code, my observation is that there are many of those ugly vertex buffer and input layout management pieces that just feel so unnecessary when there are so flexible memory management operations available to the shader developers nowadays.

My idea is that why should we declare input layouts, bind them before rendering, then also binding appropriate vertex buffers to appropriate shaders before rendering? At least in my pipeline, if I already know the shader I am using, I already know the required “input layout” of the “vertices” it should process, so why not just read them in the shader as I see fit right there?

For example we already have Access to ByteAddressBuffer from which is trivial to read vertex buffer data (unless it is a typed data for example RGBA8_UNORM format but it can still be converted easily). We can save ourselves a call to IASetInputLayout and instead of IASetVertexBuffers with a stride and offset, we can just bind it as VSSetShaderResources and do the reading in the beginning of the vertex shader. I find it easier, more to the point and also more efficient because we avoid loading from typed buffers and one less call to the API.

So I began rewriting my scene rendering code to make use of custom fetching the vertex buffers. First I used typed buffer loads, but that left me with subpar performance with Nvidia GPUs (GTX 960 and GTX 1070). I posted on Gamedev.net and others suggested me that I should be using raw buffers (byteaddressbuffer) instead. So I did. The results were like exactly the same on my GTX 1070 GPU and on an other GTX 960. The AMD RX 470 however was performing nearly exactly the same as it was before. The GCN achitecture abandoned the fixed function pipeline when fetching vertex buffers and it uses regular memory operations as it seems.

Not long ago I’ve had some look at the current generation of console development SDKs and there it is even recommended practice to read the vertex data yourself, they even provide API calls to “emulate” regular vertex buffer usage (at least on the PS4) though if you inspect the final compiled shaders, you will even find vertex fetching code in them.

I assembled a little benchmark in my engine on the sponza scene on an AMD and NVIDIA GPU, take a look (sorry for formatting issues, I can barely use wordpress it seems):

Program: Wicked Engine Editor
API: DX11
Test scene: Sponza

– 3 shadow cascades (2D) – 3 scene render passes

– 1 spotlight shadow (2D) – 1 scene render pass

– 4 pointlight shadows (Cubemap) – 4 scene render passes

– Z prepass – 1 scene render pass

– Opaque pass – 1 scene render pass

Timing method: DX11 timestamp queries
Methods:

– InputLayout : The default hardware vertex buffer usage with CPU side input layout declarations. The instance buffers are bound as vertex buffers with each render call.

– CustomFetch (typed buffer): Vertex buffers are bound as shader resource views with DXGI_FORMAT_R32G32B32A32_FLOAT format. Instance buffers are bound as Structured Buffers holding a 4×4 matrix each.

– CustomFetch (RAW buffer 1): Vertex buffers are bound as shader resource views with a MiscFlag of D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS. In the shader the buffers are addressed in byte offsets from the beginning of the buffer. Instance buffers are bound as Structured Buffers holding a 4×4 matrix each.

– CustomFetch (RAW buffer 2): Even instancing information is retrieved from raw buffers instead of structured buffers.ShadowPass and ZPrepass: These are using 3 buffers max:

– position (float4)

– UV (float4) // only for alpha tested

– instance buffer

OpaquePass: This is using 6 buffers:

– position (float4)

– normal (float4)

– UV (float4)

– previous frame position VB (float4)

– instance buffer (float4x4)

– previous frame instance buffer (float4x3)

RESULTS:

GPU     Method        ShadowPass    ZPrepass   OpaquePass   All GPU
NVidia GTX 960  InputLayout       4.52 ms     0.37 ms    6.12 ms    15.68 ms

NVidia GTX 960  CustomFetch (typed buffer)   18.89 ms    1.31 ms    8.68 ms    33.58 ms

NVidia GTX 960  CustomFetch (RAW buffer 1)   18.29 ms    1.35 ms    8.62 ms    33.03 ms

NVidia GTX 960  CustomFetch (RAW buffer 2)   18.42 ms    1.32 ms    8.61 ms    33.18 ms

AMD RX 470   InputLayout       7.43 ms     0.29 ms    3.06 ms    14.01 ms

AMD RX 470   CustomFetch (typed buffer)   7.41 ms     0.31 ms    3.12 ms    14.08 ms

AMD RX 470   CustomFetch (RAW buffer 1)   7.50 ms     0.29 ms    3.07 ms    14.09 ms

AMD RX 470   CustomFetch (RAW buffer 2)   7.56 ms     0.28 ms    3.09 ms    14.15 ms

Sadly, it seems that we can not get rid of the vertexbuffer/inputlayout APIs of DX11 when developing for PC platform because NVIDIA GPUs are much less performant with this method of custom vertex fetching. But what about mobile platforms? I have a Windows Phone build of Wicked Engine, and I want to test it on the Snapdragon 808 GPU but it seems like a bit of extra work to set it up on mobile so I will do it later probably. I somewhat already quite disappointed though because my engine is a designed for PC-like high performance usage so the mobile set up will have to wait a bit.
So the final note: If current gen consoles are your only platform, you can fetch your vertex data by hand with no problem and probably even more optimally by bypassing the typed conversions/or some other magic. If you are developing on PC, you have to keep the vertex buffer APIs intact yet which can be a pain as it will require a more ambigous shader syntax. And why in the hell should we declare strings (eg. TEXCOORD4) in the input layout is completely beyond me and annoying as hell.
turanszkij Avatar

Posted by

3 responses to “Should we get rid of Vertex Buffers?”

  1. Thanks for actual measurements. Same ideas were circling around recently…

  2. […] are reports that feeding data to the vertex shader with typed buffers can increase performance on AMD’s GCN, […]

  3. […] Turánszki János posted a benchmark of pull vs push for NV & […]

Leave a Reply to Experiments in GPU-based occlusion culling part 2: MultiDrawIndirect and mesh lodding – Interplay of LightCancel reply

Discover more from Wicked Engine

Subscribe now to keep reading and get access to the full archive.

Continue reading