Area Lights


I am trying to get back into blogging. I thought writing about implementing area light rendering might help me with that.

If you are interested in the full source code, pull my implementetion from the Wicked Engine lighting shader. I won’t post it here, because I’d rather just talk about it.

A 2014 Siggraph presentation from Frostbite caught my attention for showcasing their research on real time area light rendering. When learning graphics programming from various tutorials, there is explanations for punctual light source rendering, like point, spot and directional lights. Even most games get away with using these simplistic light sources.

For rendering area lights, we need much more complicated light equations and more performance requirements for our shaders. Luckily, the above mentioned presentation came with a paper with all the shaders for diffuse light equations for spherical, disc, rectangular and tube light sources.

The code for specular lighting for these type of lights was not included in that paper, but it mentioned the “representative point method“. What this technique essentially does is that you keep the specular calculation, but change the light vector. The light vector was the vector pointing from the light position to the surface position. But for our lights, we are not interested in the reflection between the light’s center and the surface, but between the light “mesh” and the surface.

Representative point method

If we modify the light vector to point from the surface to the closest point on the light mesh to the reflection vector, then we can keep using our specular BRDF equation and we will get a nice result; the specular highlight will be in the shape of the light mesh (or somewhere close to it). It is important, that this is not a physically accurate model, but it is something nice which is still performant in real time.

My first intuition was just that we could trace the mesh with the reflection ray. Then our light vector (L) is the vector from the surface point (P) to the intersection point (I), so L=I-P. The problem is, what if there is no intersection? Then we won’t have a light vector to feed into our specular brdf. This way we are only getting hard cutoff reflections, and surface roughness won’t work because the reflections can’t be properly “blurred” on the edges where there is no trace hit.

The correct approach to this is, that we have to find the closest point on mesh from the reflection ray. If the trace succeeds, then our closest point is the hit point, if not, then we have to “rotate” the light vector to sucessfully trace the mesh. We don’t actually rotate, just find the closest point (C), and so our new light vector is: L=C-P.

See the image below (V = view vector, R = reflection vector):

representative_pointFor all our four different light types, we have to come up with the code to find the closest point.

  • Sphere light:
    • This one is simple: first, calculate the real reflection vector (R) and the old lightvector (L). Additional symbols: surface normal vector (N) and view vector (V).

      R = reflect(V, N);

      centerToRay = dot(L,R) * R – L;

      closestPoint = L + centerToRay * saturate(lightRadius / length(centerToRay));

  • Disc light:
    • The idea is, first trace the disc plane with the reflection vector, then just calculate the closest point to the sphere from the plane intersection point like for the sphere light type. Tracing a plane is trivial:

      distanceToPlane = dot(planeNormal, (planeOrigin – rayOrigin) / dot(planeNormal, rayDirection));

      planeIntersectionPoint = rayOrigin + rayDirection * distanceToPlane;

  • Rectangle light:
    • Now this is a bit more complicated. The algorithm I use consists of two paths: The first path is when the reflection ray could trace the rectangle. The second path is, when the trace didn’t succeed. In that case, we need to find the intersection with the plane of the rectangle, then find the closest point on one of the four edges of the rectangle to the plane intersection point.
    • For tracing the rectangle, I trace the two triangles that make up the rect and take the correct intersection if it exists. Tracing a triangle involves tracing the triangle plane, then deciding if we are inside the triangle. A, B, C are the triangle corner points.

      planeNormal = normalize(cross(B – A, C – B));

      planeOrigin = A;

      t = Trace_Plane(rayOrigin, rayDirection, planeOrigin, planeNormal);

      p = rayOrigin + rayDirection * tN1 = normalize(cross(B – A, p – B));

      N2 = normalize(cross(C – B, p – C));

      N3 = normalize(cross(A – C, p – A));d0 = dot(N1, N2);

      d1 = dot(N2, N3);

      intersects = (d0 > 0.99) AND (d1 > 0.99);

    • The other algorithm is finding the closest point on line segment from point. A and B are the line segment endpoints. C is the point on the plane.

      AB = B – A;

      t = dot(C – A, AB) / dot(AB, AB);

      closestPointOnSegment = A + saturate(t) * AB;

  • Tube light:
    • First, we should calculate the closest point on the tube line segment to R. Then just place a sphere on that point and do as we did for the sphere light (that is, calculate the closest point on sphere to the reflection ray R). Every algorithm is already described already to this point, so all that needs to be done is just put them together.

So what to do when you have the closest point on the light surface? You have to convert it to the new light vector: newLightVector = closestPoint – surfacePos.

When you have your new light vector, you can feed it to the specular brdf function and in the end you will get a nice specular highlight!


With the regular shadow mapping techniques, we can do shadows for area lights as well. Results are again not accurate, but get the job done. In Wicked Engine, I am only doing regular cube map shadows for area lights like I would do for point lights. I can not say I am happy with them, especially for long tube lights for example. In an other engine however, I have been experimenting with dual paraboloid shadow mapping for point lights. I recommend a single paraboloid shadow map for the disc and rectangle are lights in the light facing direction. These are better in my opinion than regular perspective shadow maps, because they distort very much for high field of views (these light types would require near 180 degrees of FOV).

For the sphere and tube light types I still recommend cubemap shadows.

Original sources:

Voxel-based Global Illumination


People are always asking me of the Voxel Global Illumination technique in Wicked Engine so I thought writing a blog about it would be a good idea.

There are several use cases of a voxel data structure. One interesting application is using it to calculate global illumination. There are a couple of techniques for that, too. I have chosen the voxel cone tracing approach, because I found it the most flexible one for dynamic scenes, but CryEngine for example, uses Light propagation volumes instead with a sparse voxel octree which has smaller memory footprint. The cone tracing technique works best with a regular voxel grid because we perform ray-marching against the data like with screen space reflections for example. A regular voxel grid consumes more memory, but it is faster to create (voxelize), and more cache efficient to traverse (ray-march).

So let’s break down this technique into pieces. I have to disclose this at the beginnning: We can do everything in this technique real time if we do everything on the GPU. First, we have our scene model with polygonal meshes. We need to convert it to a voxel representation. The voxel structure is a 3D texture which holds the direct illumination of the voxelized geometries in each pixel. There is an optional step here which I describe later. Once we have this, we can pre-integrate it by creating a mipmap chain for the resource. This is essential for cone tracing because we want to ray-march the texture with quadrilinear interpolation (sampling a 3D texture with min-mag-mip-linear filtering). We can then retrieve the bounced direct illumination in a final screen space cone tracing pass. The additional step in the middle is relevant if we want more bounces, because we can dispatch additional cone tracing compute shader passes for the whole structure (not in screen space).

The nice thing about this technique is that we can retrieve all sorts of effects. We have “free” ambient occlusion by default when doing this cone tracing, light bouncing, but we can retrieve reflections, refractions and shadows as well from this voxel structure with additional ray march steps. We can have a configurable amount of light bounces. Cone tracing code can be shared between the bouncing and querying shader and different types of rays as well. The entire thing remains fully on the GPU, the CPU is only responsible for command buffer generation.

Following this, I will describe the above steps in more detail. I will be using the DirectX 11 graphics API, but any modern API will probably do the job. You will definetly need a recent GPU for the most efficient implementation. This technique is targeted for PCs or the most recent consoles (Playstation 4 or Xbox One). It most likely can not run on mobile or handheld devices because of their limited hardware.

I think this is an advanced topic and I’d like to aim for experienced graphics programmers, so I won’t present code samples for the more trivial parts, but the whole implementation is available to anyone in Wicked Engine.

Part 1: Voxelization on the GPU

The most involving part is definetly the first one, the voxelization step. It involves making use of advanced graphics API features like geometry shaders, abandoning the output merger and writing into resources “by hand”. We can also make use of new hardware features like conservative rasterization and rasterizer ordered views, but we will implement them in the shaders as well.

The main trick is to be able to run this real time is that we need to parallelize the process well. For that, we will exploit the fixed function rasterization hardware, and we will get a pixel shader invocation for each voxel which will be rendered. We also do only a single render pass for every object.

We need to integrate the following pipeline to our scene rendering algorithm:

1.) Vertex shader

The voxelizing vertex shader needs to transform vertices into world space and pass through the attributes to the geometry shader stage. Or just do a pass through and transform to world space in the GS, doesn’t matter.

2.) Geometry shader

This will be responsible to select the best facing axis of each triangle received from the vertex shader. This is important because we want to voxelize each triangle once, on the axis it is best visible, otherwise we would get seams and bad looking results.

// select the greatest component of the face normal input[3] is the input array of three vertices
float3 facenormal = abs(input[0].nor + input[1].nor + input[2].nor);
 uint maxi = facenormal[1] > facenormal[0] ? 1 : 0;
 maxi = facenormal[2] > facenormal[maxi] ? 2 : maxi;

After we determined the dominant axis, we need to project to it orthogonally by swizzling the position’s xyz components, then setting the z component to 1 and scaling it to clip space.

for (uint i = 0; i < 3; ++i)
 // voxel space pos:
 output[i].pos = float4((input[i] - g_xWorld_VoxelRadianceDataCenter) / g_xWorld_VoxelRadianceDataSize, 1);

// Project onto dominant axis:
 if (maxi == 0)
 output[i] = output[i].pos.zyx;
 else if (maxi == 1)
 output[i] = output[i].pos.xzy;

// projected pos:
 output[i].pos.xy /= g_xWorld_VoxelRadianceDataRes;

output[i].pos.z = 1;

output[i].N = input[i].nor;
 output[i].tex = input[i].tex;
 output[i].P = input[i];
 output[i].instanceColor = input[i].instanceColor;

At the end, we could also expand our triangle a bit to be more conservative to avoid gaps. We could also just be setting a conservative rasterizer state if we have hardware support for it and avoid the expansion here.

// Conservative Rasterization setup:
 float2 side0N = normalize(output[1].pos.xy - output[0].pos.xy);
 float2 side1N = normalize(output[2].pos.xy - output[1].pos.xy);
 float2 side2N = normalize(output[0].pos.xy - output[2].pos.xy);
 const float texelSize = 1.0f / g_xWorld_VoxelRadianceDataRes;
 output[0].pos.xy += normalize(-side0N + side2N)*texelSize;
 output[1].pos.xy += normalize(side0N - side1N)*texelSize;
 output[2].pos.xy += normalize(side1N - side2N)*texelSize;

It is important to pass the vertices’ world position to the pixel shader, because we will use that directly to index into our voxel grid daa structure and write into it. We will also need texture coords and normals for correct diffuse color and lighting.

3.) Pixel shader

After the geometry shader, the rasterizer unit scheduled some pixel shader invocations for our voxels, so in the pixel shader we determine the color of the voxel and write it into our data structure. We probably need to sample our base texture of the surface and evaluate direct lighting which affects the fragment (the voxel). While evaluating the lighting, use a forward rendering approach, so iterate through the nearby lights for the fragment and do the light calculations for the diffuse part of the light. Leave the specular out of it, because we don’t care about the view dependant part now, we want to be able to query lighting from any direction anyway later. I recommend using a simplified lighting model, but try to keep it somewhat consistent with your main lighting model which is probably a physically based model (at least it is for me and you should also have one :P) and account for the energy loss caused by leaving out the specularity.

When you calculated the color of the voxel, write it out by using the following trick: I didn’t bind a render target for the render pass, but I have set an Unordered Access View by calling OMSetRenderTargetsAndUnorderedAccessViews(). So the shader returns nothing, but we write into our voxel grid in the shader code. My voxel grid is a RWStructuredBuffer here to be able to support atomic operations easily, but later it will be converted to a 3D texture for easier filtering and better cache utilization. The Structured buffer is a linear array of VoxelType of size gridDimensions X*Y*Z. VoxelType is a structure holding a 32 bit uint for the voxel color (packed HDR color with 0-255 RGB, an emissive multiplier in 7 bits and the last bit indicates if the voxel is empty or not). The structure also contains a normal vector packed into a uint. Our interpolated 3D world position comes in handy when determining the write position into the buffer, just truncate and flatten the interpolated world position which you reveived from the geometry shader. For writing the results, you must use atomic max operations on the voxel uints. You could be writing to a texture here without atomic operations, but using rasterizer ordered views, bt they don’t support volume resources, so a multi pass approach would be necessary for the individual slices of the texture.

An additional note: If you have generated shadow maps, you can use them in your lighting calculations here to get more proper illumination when cone tracing. If you don’t have shadow maps, you can even use the voxel grid to retrieve (soft) shadow information for the scene later.


If you got so far, you just voxelized the scene. You should write a debugger to visualize the results. I am using a naive approach which is maybe a bit slow, but gets the job done. I issue a Draw() command with a vertex count of voxel grid dimensions X*Y*Z, read my voxel grid in the vertex shader indexed by the SV_VertexID, then expand to a cube in the geometry shader if the voxel color is not empty (greater than zero). The pixel shader outputs the voxel color for each screen pixel covered.

Part 2: Filtering the data

We voxelized our scene into a linear array of voxels with nicely packed data. The packed data helped in the voxelization process, but it is no good for cone tracing, we need a texture which we can filter and sample. I have a compute shader which unpacks the voxel data, copies it into a 3D texture with RGBA16 format for HDR colors and finally it also clears the packed voxel data by filling it with zeroes. A nice effect would be not just writing the target texture, but intepolating with old values so that abrupt changes in lighting, or moving objects don’t cause much flickering. But we have to account for moving camera and offsetting the voxel grid. We could lerp intelligently with a nice algorithm, but I found that the easiest method is just “do not lerp when the voxel grid got offset” was good enough for me.

Then we generate a mip chain for the 3D texture. DX11 can do this automatically for us by calling GenerateMips() on the device context, but we can also do it in shaders if we want better quality than the default box filter. I experimented with gaussian filtering, but I couldn’t write one to be fast enough to be worthwhile, so I am using the default filter.

But what about the normals, because we saved them in the voxelization process? They are only needed when doing multiple light bounces or in more advanced voxel algorithms, like anisotropic voxelization.


Part 3: Cone tracing

We have the voxel scene ready for our needs, so let’s query it for information. To gather the global illumination for the scene, we have to run the cone tracing in screen space for every pixel on the screen once. This can happen in the forward rendering object shaders or against the gbuffer in a deferred renderer, when rendering a full screen quad, or in a compute shader. In forward rendering, we may lose some performance because of the worse thread utilization if we have many small triangles. A Z-prepass is an absolute must have if we are doing this in forward rendering. We don’t want to shade a pixel multiple times because this is a heavy computation.

For diffuse light bounces, we need the pixel’s surface normal and world position at minimum. From the world position, calculate the voxel grid coordinate, then shoot rays in the direction of the normal and around the normal in a hemisphere. But the ray should not start at the surface voxel, but the next voxel along the ray, so we don’t accumulate the current surface’s lighting. Begin ray marching, and each step sample your voxel from increasing mip levels, accumulate color and alpha and when alpha reaches 1, exit and divide the distance travelled. Do this for each ray, and in the end divide the accumulated result with the number of rays as well. Now you have light bounce information and ambient occlusion information as well, just add it to your diffuse light buffer.

Assembling the hemisphere: You can create a hemisphere on a surface by using a static array of precomputed randomized positions on a sphere and the surface normal. First, if you do a reflect(surfaceNormal, randomPointOnSphere), you get a random point on a sphere with variance added by the normal vector. This helps with banding as discrete precomputed points get modulated by surface normal. We still have a sphere, but we want the upper half of it, so check if a point goes below the “horizon” and force it to go to the other direction if it does:

bool belowHorizon = dot(surfaceNormal, randomPointOnSphere) < 0;

coneDirection = belowHorizon ? – coneDirection : coneDirection;

Avoid self-occlusion: So far, the method of my choice to avoid self occlusion is to start the cone tracing with offset from the surface by the normal direction and also the cone direction. If I don’t do this, then the cone starts off the surface and immediately samples its own voxel, so each surface would get its own contribution from the GI, which is not good. But if we start further off, then that means close by surfaces will not contribute to each other’s GI and there will be a visible disconnect in lighting. I imagine it would help to use anisotropic voxels, which means store a unique voxel for a few directions and only sample the voxels facing the opposite direction to the cone. This of course would require much additional memory to store.

Accumulating alpha: The correct way to accumulate alpha is a bit different to regular alpha blending:

float4 color = 0, alpha = 0;

// …

// And inside cone tracing loop:

float4 voxel = SampleVoxels().rgba;

float4 a = 1 – alpha;

color += a * voxel.rgb;

alpha += a * voxel.a;

As you can see, this is more like a front-to back blending. This is important, because otherwise we would receive a black staircase artefact on the edge of voxels, where the unfilled (black) regions with zero alpha would bleed into the result very aggressively.

Stepping the cone: When we step along the ray in voxel-size increments (ray-marching) in world space, we can retrieve the diameter of the cone for a given position by calculating this:

float coneDiameter = 2 * tan(coneHalfAngle) * distanceFromConeOrigin;

Then we can retrieve the correct mip level to sample from the 3D texture by doing:

float mip = log2(coneDiameter / voxelSize);

With this, we have a single light bounce for our scene. But much better results can be achieved with at least a single secondary light bounce. Read on for that.


Part 4: Additinal light bounces

This is a simple step if you are familiar with compute shaders and you have wrapped the cone tracing function to be reusable. When we filtered our voxel grid, we spawn a thread in a compute shader for each voxel (better just for the non-empty voxels), unpack its normal vector and do the cone tracing like in the previous step, but instead for each pixel on the screen, we need to do it for each voxel. This needs to write into an additional 3D texture by the way, because we are sampling the filtered one in this pass, so mind the additional memory footprint.

Part 5: Cone traced reflections

To trace reflections with cone tracing, use the same technique, but the steps along mip levels should take the surface roughness into account. For rough surfaces, the cone should approach the diffuse cone tracing size, for smooth surfaces, keep the mip level increasing to minimum. Just experiment with it until you get results which you like. Or go physically based, and it will be much cooler and would probably go for a nice paper.

Maybe the voxel grid resolution which is used for the diffuse GI is not fine enough for reflections. You will probably want to use a finer voxelization for them. Maybe using separate voxel data for diffuse and specular reflections is a good idea, with some update frequency optimizations. You could, for example update the diffuse voxels in even frames and specular voxels in odd frames orsomething like that.

You probably want this as a fallback to screen space reflections, if they are available.


Part 6: Consider optimizations

The technique, at the current stage, will only work on very fast GPUs. But there already are some games using tech like this (Rise of Tomb Raider using voxel AO), or parts of it, even on consoles (The Tomorrow Children). This is possible with some aggressive optimization techniques. Sparse Voxel Octrees can reduce memory requirements, voxel cascades can bring up framerates with clever updating frequency changes. And of course do not re-voxelize anything that is not necessary, eg. static objects (however, it can be difficult to separate them, because dynamic lights should also force re-voxelization of static objects if they intersect).

And as always, you can see my source code at my GitHub! Points of interest:

Thank you for reading!

Should we get rid of Vertex Buffers?

TLDR: If your only platform to support is a recent AMD GPU, or console, then yes. 🙂

I am working on a “game engine” nowadays but mainly focusing on the rendering aspect. I wanted to get rid of some APIs lately in my graphics wrapper to be more easier to use, because I just hate the excessive amount of state setup (I am using DirectX 11 like rendering commands). Looking at the current code, my observation is that there are many of those ugly vertex buffer and input layout management pieces that just feel so unnecessary when there are so flexible memory management operations available to the shader developers nowadays.

My idea is that why should we declare input layouts, bind them before rendering, then also binding appropriate vertex buffers to appropriate shaders before rendering? At least in my pipeline, if I already know the shader I am using, I already know the required “input layout” of the “vertices” it should process, so why not just read them in the shader as I see fit right there?

For example we already have Access to ByteAddressBuffer from which is trivial to read vertex buffer data (unless it is a typed data for example RGBA8_UNORM format but it can still be converted easily). We can save ourselves a call to IASetInputLayout and instead of IASetVertexBuffers with a stride and offset, we can just bind it as VSSetShaderResources and do the reading in the beginning of the vertex shader. I find it easier, more to the point and also more efficient because we avoid loading from typed buffers and one less call to the API.

So I began rewriting my scene rendering code to make use of custom fetching the vertex buffers. First I used typed buffer loads, but that left me with subpar performance with Nvidia GPUs (GTX 960 and GTX 1070). I posted on and others suggested me that I should be using raw buffers (byteaddressbuffer) instead. So I did. The results were like exactly the same on my GTX 1070 GPU and on an other GTX 960. The AMD RX 470 however was performing nearly exactly the same as it was before. The GCN achitecture abandoned the fixed function pipeline when fetching vertex buffers and it uses regular memory operations as it seems.

Not long ago I’ve had some look at the current generation of console development SDKs and there it is even recommended practice to read the vertex data yourself, they even provide API calls to “emulate” regular vertex buffer usage (at least on the PS4) though if you inspect the final compiled shaders, you will even find vertex fetching code in them.

I assembled a little benchmark in my engine on the sponza scene on an AMD and NVIDIA GPU, take a look (sorry for formatting issues, I can barely use wordpress it seems):

Program: Wicked Engine Editor
Test scene: Sponza

– 3 shadow cascades (2D) – 3 scene render passes

– 1 spotlight shadow (2D) – 1 scene render pass

– 4 pointlight shadows (Cubemap) – 4 scene render passes

– Z prepass – 1 scene render pass

– Opaque pass – 1 scene render pass

Timing method: DX11 timestamp queries

– InputLayout : The default hardware vertex buffer usage with CPU side input layout declarations. The instance buffers are bound as vertex buffers with each render call.

– CustomFetch (typed buffer): Vertex buffers are bound as shader resource views with DXGI_FORMAT_R32G32B32A32_FLOAT format. Instance buffers are bound as Structured Buffers holding a 4×4 matrix each.

– CustomFetch (RAW buffer 1): Vertex buffers are bound as shader resource views with a MiscFlag of D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS. In the shader the buffers are addressed in byte offsets from the beginning of the buffer. Instance buffers are bound as Structured Buffers holding a 4×4 matrix each.

– CustomFetch (RAW buffer 2): Even instancing information is retrieved from raw buffers instead of structured buffers.ShadowPass and ZPrepass: These are using 3 buffers max:

– position (float4)

– UV (float4) // only for alpha tested

– instance buffer

OpaquePass: This is using 6 buffers:

– position (float4)

– normal (float4)

– UV (float4)

– previous frame position VB (float4)

– instance buffer (float4x4)

– previous frame instance buffer (float4x3)


GPU     Method        ShadowPass    ZPrepass   OpaquePass   All GPU
NVidia GTX 960  InputLayout       4.52 ms     0.37 ms    6.12 ms    15.68 ms

NVidia GTX 960  CustomFetch (typed buffer)   18.89 ms    1.31 ms    8.68 ms    33.58 ms

NVidia GTX 960  CustomFetch (RAW buffer 1)   18.29 ms    1.35 ms    8.62 ms    33.03 ms

NVidia GTX 960  CustomFetch (RAW buffer 2)   18.42 ms    1.32 ms    8.61 ms    33.18 ms

AMD RX 470   InputLayout       7.43 ms     0.29 ms    3.06 ms    14.01 ms

AMD RX 470   CustomFetch (typed buffer)   7.41 ms     0.31 ms    3.12 ms    14.08 ms

AMD RX 470   CustomFetch (RAW buffer 1)   7.50 ms     0.29 ms    3.07 ms    14.09 ms

AMD RX 470   CustomFetch (RAW buffer 2)   7.56 ms     0.28 ms    3.09 ms    14.15 ms

Sadly, it seems that we can not get rid of the vertexbuffer/inputlayout APIs of DX11 when developing for PC platform because NVIDIA GPUs are much less performant with this method of custom vertex fetching. But what about mobile platforms? I have a Windows Phone build of Wicked Engine, and I want to test it on the Snapdragon 808 GPU but it seems like a bit of extra work to set it up on mobile so I will do it later probably. I somewhat already quite disappointed though because my engine is a designed for PC-like high performance usage so the mobile set up will have to wait a bit.
So the final note: If current gen consoles are your only platform, you can fetch your vertex data by hand with no problem and probably even more optimally by bypassing the typed conversions/or some other magic. If you are developing on PC, you have to keep the vertex buffer APIs intact yet which can be a pain as it will require a more ambigous shader syntax. And why in the hell should we declare strings (eg. TEXCOORD4) in the input layout is completely beyond me and annoying as hell.

How to Resolve an MSAA DepthBuffer

If you want to implement MSAA (multisampled antialiasing) rendering, you need to render into multismpled render targets. When you want to read an anti aliased rendertarget as a shader resource, first you need to resolve it. Resolving means copying it to a non multisampled texture and averaging the subsamples (in D3D11 it is performed by calling ResolveSubresource on the device context). You can quickly find out, that it doesn’t work that way for a depthbuffer.

When you specify D3D11_BIND_DEPTHSTENCIL when creating a texture, and later try to resolve it, the D3D11 debug layer throws an error, telling you that you can’t do that. You must do the resolve by hand in a shader.

I chose the compute shader to do the job, because there is less state setup involved. I am doing a min operation on the depth buffer while reading it to get the closest one of the samples to the camera. I think most applications want to do this, but you could also get the 0th sample or the maximum, depending on the computation needs.

Texture2DMS<float> input : register(t0);

RWTexture2D<float> output : register(u0);

[numthreads(16, 16, 1)]

void main(uint3 dispatchThreadId : SV_DispatchThreadID)


uint2 dim;

uint sampleCount;

input.GetDimensions(dim.x, dim.y, sampleCount);

if (dispatchThreadId.x > dim.x || dispatchThreadId.y > dim.y)




float result = 1;

for (uint i = 0; i < sampleCount; ++i)


result = min(result, input.Load(dispatchThreadId.xy, i).r);


output[dispatchThreadId.xy] = result;


I call this compute shader like this:

Dispatch(ceil(screenWidth/16.0f), ceil(screenHeigh/16.0f), 1)

That’s the simplest shader I could do, it just loops over all the samples, and does a min operation on them.

When dispatching a compute shader with parameters like this, the dispatchThreadID gives us a direct pixel coodinate. Because there could be cases when the resolution is not dividable by the threadcount, we should make sure to discard the out of boundary texture accesses.

It could also be done with a pixel shader, but I wanted to avoid the state setup of it. In the pixel shader, we woud need to bind rasterizer, depthstencil, and blend states, and even input layouts, vertex buffers or primitive topologies unless we abuse the immediate constant buffer. I want ot avoid state setup whenever possibe because it increases CPU overhead and we can do better here.

However, I’ve heard that calling a compute shader in the middle of a rasterization pipeline can incur additional pipeline overhead, I’ve yet to witness it (comment if you can prove it).

If I’d like to do a custom resolve for an other type of texture, I would keep the shader as it is, but would change the min operation only for an other one, for example an average, or max, etc…

That is all I wanted to keep this fairly short.

Abuse the immediate constant buffer!

Very often, I need to draw simple geometries, like cubes, and I want to do the minimal amount of graphics state setup. With this technique, you don’t have to set up a vertex buffer or input layout, which means, we don’t have to write the boilerplate resource creation code for them, and don’t have to call the binding code, which also lightens the API overhead.

An immediate constant buffer differs from a regular constant buffer in a few aspects:

  • There is a reserved constant buffer slot for them, and there can be only one of them at the same time.
  • They are created automatically from the static const variables in your hlsl code.
  • They can not be updated from the API.

So when I declare a vertex array inside a shader, for example, like this:

static const float4 CUBE[]={






































…and if I want to draw this cube, then the simplest vertex shader should look like this:

float4 main(uint vID : SV_VERTEXID) : SV_Position


return mul(CUBE[vID], g_xTransform);


(where g_xTransform is the World*View*Projection matrix from a regular constant buffer)

I would then call the Draw from the DX11 API with a vertexcount of 36 because that is the array length of the CUBE vertex array. The shader automatically gets the SV_VERTEXID semantic from the input assembler, which directly indexes into the vertex array. I find this technique very clean both from the C++ side and the shader side, so I use it very frequently.

A few example use-cases:

  • Deferred light geometries
  • Light volume geometries
  • Occlusion culling occludees
  • Decals
  • Skybox/skysphere
  • Light probe debug geometries

If you need vertex arrays like this for some other simple meshes:

That’s it, cheers!

Smooth Lens Flare in the Geometry Shader

This is a historical feature from the Wicked Engine, meaning it was implemented a few years ago, but at the time it was a big step for me.


I wanted to implement simple textured lens flares but at the time all I could find was by using occlusion queries to determine if a lens flare should be visible or not. A simpler solution was needed for me. At the time I was already using the geometry shader for billboard particles, so I wanted to make further use of them here. I also wanted it to smoothly transition from fully visible, to invisible, withouth it popping when the light source goes behind an occluder. It is also my first blog post so I wanted to start with something simple.

The idea is that for a light source emitting a lensflare which is on the screen, I don’t check its visibility by occlusion query, but drawing a single vertex for it (for each flare). The vertex goes through a pass through vertex shader, then arrives at the geometry shader stage, where the occlusion is detected by checking the light source against the scene’s depth buffer. A simple solution is checking the light source’s screenspace position Z value to the depth at the XY value. This will not yield smooth results though. If the pixel is occluded, then the flare is visible, else it is not. It could be enough in cases where the geometry is predictable, like buildings, for example. However it looks extremely cheap when it is vegetation that occludes the flare, because it consists of many holes, which could be swaying in the wind making the flare flicker.

For smoothening out the popping, I use the technique which is used for the PCF shadow softening. Namely, check all the depth values in the current depth’s surroundings then average them to measure the occlusion. Thus you get the opacity value by dividing the not occluded sample count by the number of taken samples.

If there is at least one value in the surroundings which is not occluded (opacity > 0), then I spawn the flare billboards with the corresponding textures.

Prior to the shader, I project the light’s World position onto the screen with the appropriate viewprojection matrix, and send the projected light position to the shader.

Here comes the geometry shader (Can’t I format here better?):

// constant buffer



float4  xSunPos; // light position (projected)

float4  xScreen; // screen dimensions


struct InVert


float4 pos     : SV_POSITION;

nointerpolation uint vid : VERTEXID;


struct VertextoPixel{

float4 pos     : SV_POSITION;

float3 texPos    : TEXCOORD0; // texture coordinates (xy) + offset(z)

nointerpolation uint   sel : TEXCOORD1; // texture selector

nointerpolation float4 opa : TEXCOORD2; // opacity + padding


// Append a screen space quad to the output stream:

inline void append(inout TriangleStream<VertextoPixel> triStream, VertextoPixel p1, uint selector, float2 posMod, float2 size)


float2 pos = (xSunPos.xy-0.5)*float2(2,-2);

float2 moddedPos = pos*posMod;

float dis = distance(pos,moddedPos);
















// pre-baked offsets

// These values work well for me, but should be tweakable

static const float mods[] = { 1,0.55,0.4,0.1,-0.1,-0.3,-0.5 };


void main(point InVert p[1], inout TriangleStream<VertextoPixel> triStream)


VertextoPixel p1 = (VertextoPixel)0;

// Determine flare size from texture dimensions

float2 flareSize=float2(256,256);


case 0:



case 1:



case 2:



case 3:



case 4:



case 5:



case 6:





// determine depthmap dimensions (could be screen dimensions from the constantbuffer)

float2 depthMapSize;


flareSize /= depthMapSize;

// determine the flare opacity:

// These values work well for me, but should be tweakable

const float2 step = 1.0f / (depthMapSize*xSunPos.z);

const float2 range = 10.5f * step;

float samples = 0.0f;

float accdepth = 0.0f;

for (float y = -range.y; y <= range.y; y += step.y)


for (float x = -range.x; x <= range.x; x += step.x)


samples += 1.0f;

// texture_depth is non-linear depth (but it could work for linear too with linear reference value)

// SampleCmpLevelZero also makes a comparison by using a LESS_EQUAL comparison sampler

// It compares the reference value (xSunPos.z) to the depthmap value.

// Returns 0.0 if all samples in a bilinear kernel are greater than reference value

// Returns 1.0 if all samples in a bilinear kernel are less or equal than refernce value

// Can return in between values based on bilinear filtering

accdepth += (texture_depth.SampleCmpLevelZero(sampler_cmp_depth, xSunPos.xy + float2(x, y), xSunPos.z).r);



accdepth /= samples;

p1.pos = float4(0, 0, 0, 1);

p1.opa = float4(accdepth, 0, 0, 0);

// Make a new flare if it is at least partially visible:

if( accdepth>0 )



The pixel shader just samples the appropriate texture with the texture


struct VertextoPixel{

float4 pos     : SV_POSITION;

float3 texPos    : TEXCOORD0;

nointerpolation uint   sel : TEXCOORD1;

nointerpolation float4 opa : TEXCOORD2;


float4 main(VertextoPixel PSIn) : SV_TARGET


float4 color=0;

// todo: texture atlas or array



case 0:

color = texture_0.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);


case 1:

color = texture_1.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);


case 2:

color = texture_2.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);


case 3:

color = texture_3.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);


case 4:

color = texture_4.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);


case 5:

color = texture_5.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);


case 6:

color = texture_6.SampleLevel(sampler_linear_clamp, PSIn.texPos.xy, 0);




color *= 1.1 - saturate(PSIn.texPos.z);

color *= PSIn.opa.x;

return color;


That’s it, I hope it was useful. 🙂

Welcome brave developer!

This is a blog containing development insight to my game engine, Wicked Engine. Feel free to rip off any code, example, techinque from here, as you could also do it from the open source engine itself:

I want to post info from historical features as well as new ones. I try to select the ones which are sparsely blogged on the web or just feel like sharing it. I don’t intend to write complete tutorials, but sharing ideas instead, while providing minimalistic code samples.

Happy coding!