Texture Streaming

Texture streaming is an important feature of modern 3D engines as it is can be the largest contributor to reducing loading times and memory usage. Wicked Engine just got the first implementation of this system, and here you can read about the details in depth.

Overview

There are many various forms of texture streaming, here my goal was to implement a relatively simple system that can load specific mip levels from texture files and update GPU textures from it. All this is done on a background thread, while the application is currently running and rendering. Furthermore, updating the textures is handled as visually smoothly as possible, to avoid any noticeable popping-in when new mip levels are loaded in. Here is what it looks like in action (slowed down to show the effect):

The rough idea behind the approach I used, is to continuously run a background thread that looks for streaming texture requests, and loads the part of the textures that are required from files. When file data is loaded, the texture is recreated from scratch, and the existing texture replaced with the new one at a point in the application when appropriate. This is combines with fractional mipmap clamping and trilinear/anisotropic filtering to smoothly fade mipmaps in.

File loading

When initially loading the DDS texture, you can get away by reading only the header portion of the file to retrieve all the information about the image that you need. From that you can compute memory offsets and sizes for each individual mip, this will be useful for streaming, as later you will be able to open the DDS file, and read in just the required portion of it, containing the data of the mipmaps that you want to stream in. Since the DDS contains data in already the necessary layout that you can feed into a GPU copy operation from buffer to texture, this is ideal for us and lets you avoid memory allocations that are usually present when reading from file into memory, and then converting data to a correct layout. Instead, it is possible to immediately read the file contents into a mapped GPU buffer that will be used as the source of a GPU copy.

The file loading is performed in two parts. The first in “behind the loading screen”, when the texture is opened and streaming memory offsets are calculated for each mip level, the last few low resolution mips are also loaded, and copied to a GPU texture resource. These initial mip levels will always be used when the texture is in an unused “stream out” state. This is because when the texture is unused, it can quickly become used when it gets on screen, and in this case we will show these necessary mips, until higher detail is not streamed in. The way I choose how many mip levels are necessary at minimum is that I compute the memory size of all mip levels and only keep those continuous ones which fit into a 4KB allocation. This is because for example DX12 has this 4KB as the minimum texture memory alignment, so any further reduction would not yield any memory savings. This fits us the following resolution mipchains depending on texture formats:

R8G8B8A8: 16 x 16
BC2, BC3, BC5, BC6, BC7 : 32 x 32
BC1, BC4: 64 x 64

You might find the 4 KB resolutions too small and have noticeably low res textures appearing, especially when objects appear close to the camera. An other value you could reasonably choose is 64 KB, which is the next possible resource allocation alignment after 4 KB. With that you can fit the following resolutions (with mipmaps):

R8G8B8A8: 64 x 64
BC2, BC3, BC5, BC6, BC7 : 128 x 128
BC1, BC4: 256 x 256

The other part is after the “loading screen”, when the application is running and rendering, we are loading in higher resolution mip levels depending on where the camera is looking at, and also unloading high resolution mip levels that wouldn’t contribute to the rendering. This can be running on a background thread, detached from the update frequency of the main loop.

Streaming thread

The background streaming thread’s job is to open DDS files, and read the required mip levels from it, then add it to the texture, or remove mips. To facilitate fast streaming from files to memory, it is imperative to minimize file reading where we can. Even when we are doing these file accesses on a background thread, it’s still important to load them in as fast as possible, because requests are made in real time, and if loading is slow then it will have a noticeable visual impact, as in textures appearing blurry up close. For this, the best thing we can do is to not load the whole DDS texture files every time, but instead only those parts of it that we are streaming in currently. When initially loading the DDS file behind the loading screen, we already read in the header part of DDS, which tells us all the information of the texture that we need. We can compute memory offsets and sizes of each sub-resource (mip level and slice) from the DDS header, and we can either remember these offsets as part of the streaming info, or just save the DDS header itself. The streaming info is a small structure containing information of the whole texture file that tracks the progress of the streaming, it can look something like this:

struct StreamingTexture
{
	struct StreamingSubresourceData
	{
		size_t data_offset = 0;
		uint32_t row_pitch = 0;
		uint32_t slice_pitch = 0;
	};
	StreamingSubresourceData streaming_data[16] = {}; // sub-resource data for every mip level
	uint32_t mip_count = 0; // mip count of full resource
	float min_lod_clamp_absolute = 0; // relative to mip_count of full resource
};

We can populate this structure when we initially load the texture, and reuse it when streaming requests arrive. It tells us the total number of mipmaps in the whole resource, and a sub-resource data for each mip level. Notice that the sub-resource data is nearly the same as what you would give to DirectX 11 texture creation, but instead of a data pointer, it contains an offset, which is the offset inside the file. The offset can simply give you the offset from which you want to read the file from (seek in file). Since you know the data format and width/height of the texture that you are streaming, you can compute the data size of the sub-resource from those, and use that for telling how many bytes to read from the file. In my implementation, I simply read through the end of the file from the beginning of a specific sub-resource, because I always re-create textures from scratch, so I always load all lower mips as well in a streaming request. For now, it seems fast enough, but it feels like a bit of a waste. This way I didn’t need to issue GPU copies from previous resource to the new one, simply give all the data needed for texture creation.

To load these offsets from DDS, I have created a custom DDS file utility that you can download here. I wasn’t satisfied with other DDS loaders that I used, because they make allocations internally and have dependencies. My DDS loader is cross-platform, as minimal as possible without any include files, and designed with streaming in mind. After using the dds::read_header() function, you get back exactly the DDS header structure which you can use to query all information of the texture, with relative memory offsets. A simple example:

dds::Header header = dds::read_header(filedata, filesize);
if (header.is_valid())
{
	TextureDesc desc;
	desc.width = header.width();
	desc.height = header.height();
	desc.depth = header.depth();
	desc.mip_levels = header.mip_levels();
	desc.array_size = header.array_size();
	desc.format = header.format();

	std::vector<SubresourceData> initdata;
	initdata.reserve(desc.array_size * desc.mip_levels);
	for (uint32_t slice = 0; slice < desc.array_size; ++slice)
	{
		for (uint32_t mip = 0; mip < desc.mip_levels; ++mip)
		{
			SubresourceData& subresourceData = initdata.emplace_back();
			subresourceData.data_ptr = filedata + header.mip_offset(mip, slice);
			subresourceData.row_pitch = header.row_pitch(mip);
			subresourceData.slice_pitch = header.slice_pitch(mip);
		}
	}

	device->CreateTexture(&desc, initdata.data(), &texture);
}

It also supports DDS writing, so you might find this useful for other purposes too.

Every frame, the streaming thread is started as a low priority thread, if it’s not already running. If it’s already running, that means that previous streaming tasks have not finished. This is normal, as they will usually span a longer time than a render frame. If this happens, we just cancel until the next frame. The streaming thread is set to low priority, to not disturb any high priority rendering or update jobs in the job system, I found that otherwise a long running high priority thread can cause stuttering. Inside this background thread, we will go through each texture that is marked for streaming and check which resolutions they requested (more on this later). Compared to their current resolution, they either need to stream in higher res mips, they can remove mips, or do nothing if the current level of detail is already satisfied. Removing mips doesn’t always happen if they have a lower resolution request, but only after a certain amount of time elapsed, this avoids removing mips of an object that moves in-and-out of the camera. This logic is changed when we detect that memory usage of the application is above a certain threshold, in that case dropping mip levels will happen earlier when appropriate. This is roughly what happens on the background thread in Wicked Engine at the time of writing this:

for(auto& resource : streaming_texture_jobs)
{
	TextureDesc desc = resource.GetTexture().GetDesc(); // current texture description, with current resolution, etc, not the full resource resolution
	uint32_t requested_resolution = resource.get_request(); // read the requested resolution (assuming square texture)
	const GraphicsDevice::MemoryUsage memory_usage = device->GetMemoryUsage();
	const float memory_percent = float(double(memory_usage.usage) / double(memory_usage.budget)); // percent of how much memory is used relative to the budget
	const bool memory_shortage = memory_percent > streaming_threshold; // check if we are above the threshold and we are short on memory
	const bool stream_in = memory_shortage ? false : (requested_resolution >= std::min(desc.width, desc.height));

	int mip_offset = int(resource.streaming_texture.mip_count - desc.mip_levels); // what mip are we currently on relative to the full resource
	if (stream_in)
	{
		resource.streaming_unload_delay = 0; // unloading will be immediately halted
		if (mip_offset == 0)
			continue; // Can't go up a mip level because we are already on the most detailed, cancel
		// Mip level streaming IN, increase resolution:
		desc.width <<= 1;
		desc.height <<= 1;
		if (requested_resolution < std::min(desc.width, desc.height))
			continue; // Increased resolution would be too much, cancel
		desc.mip_levels++;
		mip_offset--;
	}
	else
	{
		resource.streaming_unload_delay++; // one more frame that this wants to unload
		if (!memory_shortage && resource.streaming_unload_delay < 255)
			continue; // only unload mips if it's been wanting to unload for a couple frames, or there is memory shortage
		if (ComputeTextureMemorySizeInBytes(desc) <= 4096)
			continue; // Don't reduce the texture below, because of 4KB alignment, this would not reduce memory usage further
		// Mip level streaming OUT, reduce resolution:
		desc.width >>= 1;
		desc.height >>= 1;
		desc.mip_levels--;
		mip_offset++;
	}
	if (desc.mip_levels <= resource.streaming_texture.mip_count) // check if new mip count is still in the valid range of full texture mip count
	{
		std::ifstream file(resource.streaming_filename, std::ios::binary | std::ios::ate); // open the file in binary mode
		if (!file.is_open())
			continue; // file cannot be found, cancel

		// memory offset of the first mip level in current streaming range:
		const size_t mip_data_offset = resource.streaming_texture.streaming_data[mip_offset].data_offset;

		// Read in the file contents beginning from the specified mip:
		std::vector<uint8_t> streaming_filedata;
		size_t dataSize = (size_t)file.tellg() - mip_data_offset;
		file.seekg((std::streampos)mip_data_offset);
		streaming_filedata.resize(dataSize);
		file.read((char*)streaming_filedata.data(), dataSize);
		file.close();

		const uint8_t* firstmipdata = streaming_filedata.data(); // pointer of first mip in current streaming data

		// Convert relative to absolute GPU initialization data
		SubresourceData initdata[16] = {};
		for (uint32_t mip = 0; mip < desc.mip_levels; ++mip)
		{
			auto& streaming_data = resource.streaming_texture.streaming_data[mip_offset + mip];
			initdata[mip].data_ptr = firstmipdata + streaming_data.data_offset - mip_data_offset;
			initdata[mip].row_pitch = streaming_data.row_pitch;
			initdata[mip].slice_pitch = streaming_data.slice_pitch;
		}

		// The replacement struct will store the newly created texture until replacement can be made later:
		StreamingTextureReplace replace;
		replace.resource = resource;
		device->CreateTexture(&desc, initdata, &replace.texture);

		streaming_replacement_mutex.lock();
		streaming_texture_replacements.push_back(replace); // Store the replacement texture
		streaming_replacement_mutex.unlock();
	}
}

That’s quite a big chunk of logic. What’s important, is that we determine if the texture requires any streaming (always only increase or decrease mip levels by one), and do the data reading from file if it does, then create the texture. As I previously mentioned, there is some wasted processing because we always create the texture from scratch by providing fully initialized data from scratch. But the texture creation function is thread safe and we can do this on a background thread without any additional considerations. Otherwise if we want to only load the streamed in mip and retain the previously existing ones, we need to do GPU copies which will involve creating command lists, GPU submission logic, and possibly using sparse/tiled resources to avoid recreating textures. I found that my solution that is recreating textures has a good enough performance for now.

One thing to note that while we are creating the texture fully on the background thread, we are not replacing resources with the new textures there, because rendering might be currently using them, Instead, there is a part of the streaming logic that is running on the main thread that does minimal processing.

Streaming – finalize

The finalization logic that must happen on the main thread, and synchronized with rendering is responsible for replacing the current texture resources with the newly streamed in ones. Note that this is happening every frame, even when the previous streaming jobs haven’t finished yet. This allows to update those textures that have been streamed as soon as possible, not waiting for the whole streaming thread to finish, which might take several frames’ time. In turn, we must synchronize this part by using a mutex (see the usage of streaming_replacement_mutex in the previous code). The StreamingTextureReplace structure can look like this:

struct StreamingTextureReplace
{
	Resource resource; // resource containing current texture
	Texture texture; // new texture, result of streaming
};

The finalization logic which is performed every frame is simply this:

// If any streaming replacement requests arrived, replace the resources here (main thread):
streaming_replacement_mutex.lock();
for (auto& replace : streaming_texture_replacements)
{
	replace.resource.SetTexture(replace.texture); // previous texture is deleted automatically when GPU no longer uses it
}
streaming_texture_replacements.clear();
streaming_replacement_mutex.unlock();

With this, the textures will be updated, but not smoothly, as they will be just replaced with textures that have a new mipmap count, so it will look like pop-in. To smoothly update the mipmaps of the textures, we need an additional step.

Smooth mipmap changing

You might have noticed the unexplained member of the StreamingTexture struct: min_lod_clamp_absolute. This is a float value which is used exactly for smoothing out the mipmap level changes. In DirectX 12, you can set up a texture SRV descriptor with the ResourceMinLODClamp parameter, which is used to tell the GPU the highest detailed mipmap that is allowed to be used at texture sampling, but unlike the MostDetailedMip value, this is a float which allows this to be a fractional value. The fractional value makes sense only when you are using a trilinear or anisotropic sampler, because then the texture sampling will fade between two mipmaps based on the computed texture LOD, which is exactly what we need.

Note that in Vulkan, you must use the VK_EXT_image_view_min_lod extension to have this functionality, and in DirectX 11, this is not part of the descriptor, but you can set this as a command list function, SetResourceMinLOD().

The min_lod_clamp_absolute value is not exactly the minLOD parameter that you give to the graphics API, but I use it as a float value to track current mip level relative to the full resource. From that, we can compute the value that can be used relative to the currently streamed resource. Once, we detect that the relative value changes compared to the previous one, we can update the texture descriptor. Updating the texture descriptor involves freeing the current one and creating a new one with the new minLOD parameter, this happens on the main thread every frame, while also noting that you must not delete descriptors that the GPU might still be using:

for (auto& x : resources)
{
	const TextureDesc& desc = resource.GetTexture().GetDesc();
	const float mip_offset = float(resource.streaming_texture.mip_count - desc.mip_levels);
	float min_lod_clamp_absolute_next = resource.streaming_texture.min_lod_clamp_absolute - dt * streaming_fade_speed; // dt: delta time in seconds, streaming_fade_speed: 4.0f by default
	min_lod_clamp_absolute_next = std::max(mip_offset, min_lod_clamp_absolute_next);
	if (float_equal(min_lod_clamp_absolute_next, resource.streaming_texture.min_lod_clamp_absolute))
		continue;
	resource.streaming_texture.min_lod_clamp_absolute = min_lod_clamp_absolute_next;

	const float min_lod_clamp_relative = min_lod_clamp_absolute_next - mip_offset;

	device->DeleteSubresources(&resource.GetTexture());
	device->CreateSubresource(
		&resource.GetTexture(),
		SubresourceType::SRV,
		...
		min_lod_clamp_relative // apply new minLODClamp value
	);
}

Explanation: Always try to update the min_lod_clamp_absolute to decrease towards the highest detailed mip, while clamping it to not go below the mip_offset that’s currently streamed in. If the updated min_lod_clamp_absolute_next is not equal to min_lod_clamp_absolute, then update the descriptor relative to the mip_offset. This means that if new texture detail is streamed in, then it will update smoothly, if no new updates happened for that texture, then the descriptor won’t be updated either. For streaming that removed mip levels, it’s not needed to perform smooth lod clamping, as it means those are not visible anyway.

For the sake of completeness, here is the float equals comparison check that the above code uses, it returns true if two floats are near equal, and false otherwise:

inline bool float_equal(float f1, float f2) {
	return (std::abs(f1 - f2) <= std::numeric_limits<float>::epsilon() * std::max(std::abs(f1), std::abs(f2)));
}

One important step is still missing though, which is how we determine what mip levels should be visible.

Requesting texture detail

I’ve tried two methods to tell the streaming system which mipmaps should be visible. At first, I just determined objects’s visibility on the CPU (frustum culling, occlusion culling) and if it’s visible, then tell the streaming to always stream IN, meaning it requested all the mip levels to be available for all textures that the object uses. Otherwise, it would start discarding all mip levels above the 4KB threshold. This is the simplest way, and it works okay, you can get plenty of memory savings already with this, and I recommend to implement this first.

But that solution is a bit too simplistic and it is just not good enough, think of when the camera is viewing a large distance, where objects far away will also request high resolution mip levels. When sampling the textures in the shaders, the GPU is computing exactly which mip levels it will read at every pixel. It would be good to know what the GPU will need exactly, so we can get away with only streaming those specific mips. The approach I settled on works exactly like this.

In the Wicked Engine shaders we can directly index the materials with a material index, and in my implementation each material can request the required texture resolution based on the derivatives of the UV coordinates of the pixel. So I created a RWBuffer<uint>, storing one uint32 for every material in the scene, which any shader can write into with the material index that it is running with. The 32 bit uint contains a texture resolution request for two UV sets, the first UV set in the lower, the second is in the upper 16 bits. Since the maximum resolution of a texture asset is 16384 pixels, and 16 bits is capable of storing 65536 values, this is more than enough. From the UV derivatives it is easy to compute the required mipmap LOD, and from that it becomes straight forward to compute a resolution value for the request.

inline float get_lod(in uint2 dim, in float2 uv_dx, in float2 uv_dy)
{
	return log2(max(length(uv_dx * dim), length(uv_dy * dim)));
}

RWBuffer<uint> materialFeedbackBuffer;
inline void write_mipmap_feedback(uint materialIndex, float4 uvsets_dx, float4 uvsets_dy)
{
	const float lod_uvset0 = get_lod(65536u, uvsets_dx.xy, uvsets_dy.xy);
	const float lod_uvset1 = get_lod(65536u, uvsets_dx.zw, uvsets_dy.zw);
	const uint resolution0 = 65536u >> uint(max(0, lod_uvset0));
	const uint resolution1 = 65536u >> uint(max(0, lod_uvset1));
	const uint mask = resolution0 | (resolution1 << 16u);
	InterlockedOr(materialFeedbackBuffer[materialIndex], mask);
}

The write_mipmap_feedback shader function above is used per pixel to write a resolution request into the feedback buffer at the current material Index. The get_lod function is taken from the DirectX 11 specs, and it is an example implementation of computing the texture LOD for trilinear filtering. For anisotropic filtering, it would be different if you wanted to use the LOD level as-is, but in this case it is enough because we take the conservative overestimated LOD by casting it to uint. It’s important that we are conservative, because we want to request the highest resolution that the sampling will touch, and in the end we will be using anisotropic sampler in the hardware, so we don’t need to compute the exact values here.

Take notice, that we don’t want to write out the LOD level into the feedback, but instead we compute the LOD relative to a 65K resolution, and we convert that to an actual resolution value. This is important, because a material can contain several texture assets with varying resolutions, so we can use this single request for all of them. Requesting a higher resolution for a texture than the asset contains is not a problem, it will simply stream in up to the resolution that is available in the file.

In a pixel shader you can call this feedback function like this (in this case the material index is coming from a push constant, uvsets is a float4 containing UV0 in .xy, and UV1 in .zw):

write_mipmap_feedback(push.materialIndex, ddx_coarse(uvsets), ddy_coarse(uvsets));

You probably only want to call it from your main camera rendering, and not for shadow map rendering, which will probably cover a larger area and request more streaming in than it is really necessary. It is also better if you can call it from a shader which uses [earlydepthstencil] and equal depth test because writing into RWBuffer will be more optimal – but this is probably only available if you use a depth prepass.

One very easy optimization that you can do with DX12 and Vulkan is to reduce atomics with wave intrinsics. Just rewrite the feedback function like this:

inline void write_mipmap_feedback(uint materialIndex, float4 uvsets_dx, float4 uvsets_dy)
{
	const float lod_uvset0 = get_lod(65536u, uvsets_dx.xy, uvsets_dy.xy);
	const float lod_uvset1 = get_lod(65536u, uvsets_dx.zw, uvsets_dy.zw);
	const uint resolution0 = 65536u >> uint(max(0, lod_uvset0));
	const uint resolution1 = 65536u >> uint(max(0, lod_uvset1));
	const uint mask = resolution0 | (resolution1 << 16u);
	const uint wave_mask = WaveActiveBitOr(mask);
	if(WaveIsFirstLane())
	{
		InterlockedOr(materialFeedbackBuffer[materialIndex], wave_mask);
	}
}

It is still doing the same thing (if material index is uniform for the whole wave), but it reduces to 1 atomic operation per wave. That means it will reduce the atomics by 1/64 on AMD or 1/32 on Nvidia (or newer AMD with wave32 mode). If you can’t use wave operations, for example on DX11, you can try to reduce atomics by only running this for every Nth pixel, that requires some testing to see how far you can reduce it before seeing inaccurate results.

That was the part that you do on the GPU, but you will need to read these results back on the CPU. If you use DX12 and Vulkan, reading back results can be done quite optimally, do a GPU copy from the feedback RW buffer into a buffer on a READBACK heap. The buffer on the READBACK heap can be mapped persistently, and you can save the CPU pointer to its data. Then you can simply read from that uint32_t pointer like you would read from an array.

When updating the materials, I read back the GPU results from the feedback buffer, and set the streaming request for every texture of the material:

if (textureStreamingFeedbackMapped != nullptr)
{
	const uint32_t request_packed = textureStreamingFeedbackMapped[args.jobIndex];
	const uint32_t request_uvset0 = request_packed & 0xFFFF;
	const uint32_t request_uvset1 = (request_packed >> 16u) & 0xFFFF;
	for (auto& slot : material.textures)
	{
		if (slot.resource.IsValid())
		{
			slot.resource.StreamingRequestResolution(slot.uvset == 0 ? request_uvset0 : request_uvset1);
		}
	}
}

Because the material update runs on multiple threads, and also one texture can be contained by multiple materials, I use atomics to update the requests on CPU. This is roughly implemented like this, using the atomic binary OR of an std::atomic<uint32_t>:

struct Resource
{
	// ...
	std::atomic<uint32_t> streaming_resolution{ 0 };

	void StreamingRequestResolution(uint32_t resolution)
	{
		streaming_resolution.fetch_or(resolution);
	}
};

One thing to look out for, is that since the streaming thread is also running on a separate thread, you must also use atomic operation there to read the result, and also that thread will zero out the result as well for the next round of requests. You can do it all in one go with the fetch_and(0) which will do exactly this. The streaming thread code that I shown above contains a function called get_request(), which you can implement like this:

uint32_t get_request()
{
	uint32_t requested_resolution = streaming_resolution.fetch_and(0); // set to zero while returning prev value
	if (requested_resolution > 0) // if 0, it means there was no request
	{
		requested_resolution = 1ul << (31ul - firstbithigh((unsigned long)requested_resolution)); // largest power of two
	}
}

Note that we used atomic OR operation to write all resolution requests, the value contains a bitmask of all requests, but we want to get the largest power of two request (because all of the separate request were power of two, see the implementation of the write_mipmap_feedback shader function). The firstbithigh is originally a built-in HLSL shader function, but you can implement it with CPU intrinsics like so:

#ifdef _WIN32
// Windows:
#include <intrin.h>
inline unsigned long firstbithigh(unsigned long value)
{
	unsigned long bit_index;
	if (_BitScanReverse(&bit_index, value))
	{
		return 31ul - bit_index;
	}
	return 0;
}
#else
// Linux:
inline unsigned long firstbithigh(unsigned int value)
{
	if (value == 0)
	{
		return 0;
	}
	return __builtin_clz(value);
}
#endif // _WIN32

This function returns the position of the first set bit, counting from the top. With that value we can shift a 1 bit into the correct place to get us the power of two value we are looking for, which is the highest pow2 resolution request for the current texture.

Other details

Since in Wicked Engine it is possible to embed texture resources into the scene files (in fact the Editor does this by default), this case is handled with an extra step. The streaming textures can also remember not only their own file name, but the file name of the scene file they are a part of. The scene file is a large binary file that can contain all assets embedded into it, so each streaming resource also needs its offset in this file. With this information, we can still stream mip levels from these large scene files.

The resource manager also has a mode that retains all asset files in memory, this is used by the Editor currently to save resources later into the scene. In this case, it is also handled that files won’t be re-opened when streaming, mip levels will be streamed from memory. This could be considered in a game too as it can improve streaming performance, but this will result in increased RAM usage, while VRAM usage can still be controlled well by streaming in/out of it.

Tiled resources could be used for a smarter streaming scheme, where you don’t always re-create textures, but you can control tile residency like virtual memory management. My experience with this is that it can add unexpected slowdowns when doing the tile mapping (UpdateTileMappings/vkQueueBindSparse), and it also has some additional hardware requirements (see Tiled Resource Tier). If you support Tiled Resource Tier 2, then you can also avoid recreating descriptors when changing the resourceMinLODClamp, as that hardware will support the minLodClamp parameter in texture sampling instructions, so you can change it to a shader parameter instead.

For the time I avoided going smaller granularity, like streaming in smaller blocks of textures, like in a virtual texture approach – the Wicked Engine terrain system uses virtual textures, but those blocks are generated procedurally on the GPU and not streamed from files. The problem with a block streaming approach is that ideally it would require storing texture as blocks in the file, otherwise we would end up reading larger portions of the file to retrieve smaller parts of it. But the overall memory saving might be a lot better if we only load in the small parts of the texture that are visible.

I also haven’t experimented with sampler feedback or DirectStorage yet. Sampler feedback would be relevant if we are streaming texture blocks with tiled resources, but the DirectStorage could be useful even for whole mipmap streaming.

Closing

That’s essentially it, I’m sure there are many opportunities for improvement, but this approach worked out well for me so far.

If you want to check out the actual implementation, you can look at the Wicked Engine source code, for example:

In wiResourceManager.cpp, look for the function named UpdateStreamingResources() which contains the whole logic of the streaming thread, resource replacements and mip clamp updates. It is called once every frame.
In globals.hlsli you will find the write_mipmap_feedback() shader function

Thank you for reading this far, hope you found it useful!

3 responses to “Texture Streaming”

Botondar

June 5, 2024 at 12:17 am

You don’t actually need to bake the 64k resolution into the shader code. Simply skipping the multiplication by the resolution in the lod calculation gives you the _negated_ logarithm of the resolution. So you can do:

“`
uint GetMipMask(float2 uv_dx, float2 uv_dy)
{
float lod = -log2(max(length(uv_dx), length(uv_dy)));
lod = clamp(lod, 0.0, 15.0); // clamp to the max resolution you can support/have bits for
uint mask = (1
return mask;
}
“`
This sets bit 0 (LSB) for the 1×1 mip, bit 1 for the 2×2 mip, bit 2 for the 4×4, etc.

This can be derived by following how the resolution affects the result of the lod/mask calculation through logarithmic identities.

Botondar

June 7, 2024 at 10:29 am

Oops, looks like I messed up the code formatting, and the mask calculation got cut off. Just in case it’s not clear it’s meant to be:

uint mask = left_shift(1, uint(floor(lod))) | left_shift(1, uint(ceil(lod)));

Which sets 2 bits for the 2 mip levels sampled (it could just be the ceil if you’re always streaming in the lower mip levels).

1. turanszkij
  
  June 12, 2024 at 4:52 am
  
  Thanks for pointing this out, a slightly different way of thinking about it. Your version is pretty much the same as mine computation-wise, as you multiply by -1 and shift 1 left, where in my case I multiply by 65536 and shift 65536 right.