基于计算的裁剪 - Vulkan 指南

裁剪计算核心

在解释了间接绘制和整个引擎之后，使一切工作的最后一部分是基于计算的裁剪。整个内容都包含在 indirect_cull.comp 中。

正如上一篇文章中解释的那样，我们使用裁剪着色器来构建最终的渲染列表。着色器看起来像这样（简化版）。

void main()
{
	uint gID = gl_GlobalInvocationID.x;
	if(gID < cullData.drawCount)
	{
		//grab object ID from the buffer
		uint objectID = instanceBuffer.Instances[gID].objectID;

		//check if object is visible
		bool visible  = IsVisible(objectID);

		if(visible)
		{
			//get the index of the draw to insert into
			uint batchIndex = instanceBuffer.Instances[gID].batchID;

			//atomic-add to +1 on the number of instances of that draw command
			uint countIndex = atomicAdd(drawBuffer.Draws[batchIndex].instanceCount,1);

			//write the object ID into the instance buffer that maps from gl_instanceID into ObjectID
			uint instanceIndex = drawBuffer.Draws[batchIndex].firstInstance + countIndex;
			finalInstanceBuffer.IDs[instanceIndex] = objectID;
		}
	}
}

instanceBuffer 是上一篇文章中的 AllocatedBuffer<GPUInstance> instanceBuffer;。它存储了 ObjectID + BatchID（间接绘制 ID）

从 ObjectID，我们可以计算出该对象 ID 是否可见。一旦我们确定对象是可见的，我们需要将其添加到该对象的间接绘制命令中。

在计算着色器开始时，drawBuffer 具有顶点计数和实例计数的所有绘制数据，但 instanceCount 设置为 0。计算着色器对其执行原子 +1 操作，并使用它来“保留”一个槽位，然后将其存储在 finalInstanceBuffer 中。finalInstanceBuffer 随后在顶点着色器中用于访问 ObjectID。

这是一种绕过我们没有使用 DrawIndirectCount 的方法。如果我们使用 DrawIndirectCount，另一种可能性是每个对象都有自己的间接绘制命令，然后使用幸存的对象压缩命令数组。构建这部分有很多方法，最好的方法取决于您在引擎中执行的操作和目标硬件。

视锥裁剪函数

展示了裁剪核心之后，现在我们需要实际决定如何处理裁剪函数本身。此处的裁剪基于 Arseny 在其开源 Niagara 流中展示的裁剪，您可以在此处找到原始版本。引擎中的版本是对其进行的小幅调整。

我们要做的第一件事是视锥裁剪。

bool IsVisible(uint objectIndex)
{
	//grab sphere cull data from the object buffer
	vec4 sphereBounds = objectBuffer.objects[objectIndex].spherebounds;

	vec3 center = sphereBounds.xyz;
	center = (cullData.view * vec4(center,1.f)).xyz;
	float radius = sphereBounds.w;

	bool visible = true;

	//frustrum culling
	visible = visible && center.z * cullData.frustum[1] - abs(center.x) * cullData.frustum[0] > -radius;
	visible = visible && center.z * cullData.frustum[3] - abs(center.y) * cullData.frustum[2] > -radius;

	if(cullData.distCull != 0)
	{// the near/far plane culling uses camera space Z directly
		visible = visible && center.z + radius > cullData.znear && center.z - radius < cullData.zfar;
	}

	visible = visible || cullData.cullingEnabled == 0;

	return visible;
}

对于所有的 cullData，这是在调用计算着色器时从 Cpp 写入的。它保存了视锥数据和裁剪的配置。

我们首先从 objectIndex 中获取对象 sphereBounds。spherebounds 在每次对象移动或初始化时都会计算。

一旦我们有了球体，我们将其转换为视图空间，然后针对视锥进行检查。如果检查通过，那么一切都很好，我们可以返回 visible，以便在编写间接绘制命令时使用。

视锥裁剪可以轻松地裁剪掉一半的对象，但我们可以做得更多。

遮挡裁剪

我们希望避免渲染由于位于其他对象后面而完全不可见的对象。为此，我们将使用上一帧的深度缓冲区来实现遮挡裁剪。这是一种非常常见的技术，缺点是具有 1 帧的延迟。一些引擎改为渲染一些较大的对象，然后使用该深度缓冲区进行裁剪。

正常的深度缓冲区对于进行高效裁剪来说过于详细，因此我们需要将其转换为深度金字塔。深度金字塔的想法是，我们以一种深度值始终是该区域最大深度的方式为深度缓冲区构建 mipmap 链。这样，我们可以直接查找 mipmap，以便像素大小与对象大小相似，并且它为我们提供了相当准确的近似值。

首先要做的是，我们需要在主渲染通道之后存储深度图像，就像我们处理阴影贴图一样，然后我们将其复制到深度金字塔中。

//forward pass renders depth into this
AllocatedImage _depthImage;
//pyramid depth used for culling
AllocatedImage _depthPyramid;

//special cull sampler
VkSampler _depthSampler;
//image view for each mipmap of the depth pyramid
VkImageView depthPyramidMips[16] = {};

为了构建深度金字塔，我们将使用计算着色器从每个 mipmap 复制到下一个 mipmap，进行缩减。

for (int32_t i = 0; i < depthPyramidLevels; ++i)
	{
		VkDescriptorImageInfo destTarget;
		destTarget.sampler = _depthSampler;
		destTarget.imageView = depthPyramidMips[i];
		destTarget.imageLayout = VK_IMAGE_LAYOUT_GENERAL;

		VkDescriptorImageInfo sourceTarget;
		sourceTarget.sampler = _depthSampler;

		//for te first iteration, we grab it from the depth image
		if (i == 0)
		{
			sourceTarget.imageView = _depthImage._defaultView;
			sourceTarget.imageLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
		}
		//afterwards, we copy from a depth mipmap into the next
		else {
			sourceTarget.imageView = depthPyramidMips[i - 1];
			sourceTarget.imageLayout = VK_IMAGE_LAYOUT_GENERAL;
		}

		VkDescriptorSet depthSet;
		vkutil::DescriptorBuilder::begin(_descriptorLayoutCache, get_current_frame().dynamicDescriptorAllocator)
			.bind_image(0, &destTarget, VK_DESCRIPTOR_TYPE_STORAGE_IMAGE, VK_SHADER_STAGE_COMPUTE_BIT)
			.bind_image(1, &sourceTarget, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_SHADER_STAGE_COMPUTE_BIT)
			.build(depthSet);

		vkCmdBindDescriptorSets(cmd, VK_PIPELINE_BIND_POINT_COMPUTE, _depthReduceLayout, 0, 1, &depthSet, 0, nullptr);

		uint32_t levelWidth = depthPyramidWidth >> i;
		uint32_t levelHeight = depthPyramidHeight >> i;
		if (levelHeight < 1) levelHeight = 1;
		if (levelWidth < 1) levelWidth = 1;

		DepthReduceData reduceData = { glm::vec2(levelWidth, levelHeight) };

		//execute downsample compute shader
		vkCmdPushConstants(cmd, _depthReduceLayout, VK_SHADER_STAGE_COMPUTE_BIT, 0, sizeof(reduceData), &reduceData);
		vkCmdDispatch(cmd, getGroupCount(levelWidth, 32), getGroupCount(levelHeight, 32), 1);


		//pipeline barrier before doing the next mipmap
		VkImageMemoryBarrier reduceBarrier = vkinit::image_barrier(_depthPyramid._image, VK_ACCESS_SHADER_WRITE_BIT, VK_ACCESS_SHADER_READ_BIT, VK_IMAGE_LAYOUT_GENERAL, VK_IMAGE_LAYOUT_GENERAL, VK_IMAGE_ASPECT_COLOR_BIT);

		vkCmdPipelineBarrier(cmd, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, VK_DEPENDENCY_BY_REGION_BIT, 0, 0, 0, 0, 1, &reduceBarrier);
	}

计算着色器看起来像这样。

void main()
{
	uvec2 pos = gl_GlobalInvocationID.xy;

	// Sampler is set up to do min reduction, so this computes the minimum depth of a 2x2 texel quad
	float depth = texture(inImage, (vec2(pos) + vec2(0.5)) / imageSize).x;

	imageStore(outImage, ivec2(pos), vec4(depth));
}

其中的真正技巧是纹理的采样器。在这里，我们使用了一个常见的扩展，它将计算 2x2 纹素四边形的最小值，而不是像 LINEAR mipmap 那样平均值。采样器也将在裁剪着色器中使用，它的创建方式如下。

	VkSamplerCreateInfo createInfo = {};

	//fill the normal stuff
	createInfo.sType = VK_STRUCTURE_TYPE_SAMPLER_CREATE_INFO;
	createInfo.magFilter = VK_FILTER_LINEAR;
	createInfo.minFilter = VK_FILTER_LINEAR;
	createInfo.mipmapMode = VK_SAMPLER_MIPMAP_MODE_NEAREST;
	createInfo.addressModeU = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
	createInfo.addressModeV = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
	createInfo.addressModeW = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
	createInfo.minLod = 0;
	createInfo.maxLod = 16.f;

	//add a extension struct to enable Min mode
	VkSamplerReductionModeCreateInfoEXT createInfoReduction = {};

	createInfoReduction.sType = VK_STRUCTURE_TYPE_SAMPLER_REDUCTION_MODE_CREATE_INFO_EXT
	createInfoReduction.reductionMode = VK_SAMPLER_REDUCTION_MODE_MIN;
	createInfo.pNext = &createInfoReduction;


	VK_CHECK(vkCreateSampler(_device, &createInfo, 0, &_depthSampler));

这是一个必须启用的扩展，但它在任何地方都受支持，甚至在 Switch 上也是如此。在 Vulkan 1.2 中，它是默认功能。规范

编写深度金字塔后，我们终于可以在裁剪着色器中使用它了。

	//frustum stuff from before

	visible = visible || cullData.cullingEnabled == 0;

	//flip Y because we access depth texture that way
	center.y *= -1;

	if(visible && cullData.occlusionEnabled != 0)
	{
		//project the cull sphere into screenspace coordinates
		vec4 aabb;
		if (projectSphere(center, radius, cullData.znear, cullData.P00, cullData.P11, aabb))
		{
			float width = (aabb.z - aabb.x) * cullData.pyramidWidth;
			float height = (aabb.w - aabb.y) * cullData.pyramidHeight;

			//find the mipmap level that will match the screen size of the sphere
			float level = floor(log2(max(width, height)));

			//sample the depth pyramid at that specific level
			float depth = textureLod(depthPyramid, (aabb.xy + aabb.zw) * 0.5, level).x;

			float depthSphere =cullData.znear / (center.z - radius);

			//if the depth of the sphere is in front of the depth pyramid value, then the object is visible
			visible = visible && depthSphere >= depth;
		}
	}

我们正在找到覆盖屏幕空间中球体的 AABB，然后在该点的深度金字塔中访问，在 AABB 大小类似于像素的 mipmap 中。

这种深度金字塔逻辑非常相似，如果不是几乎完全相同于虚幻引擎中使用的裁剪系统。在其中，他们没有间接绘制，因此他们在着色器中进行裁剪，并输出到可见对象数组中。然后从 CPU 读取此数组以了解哪些对象是可见的。

有了最后一块拼图，引擎现在可以以非常高的性能渲染大型场景，因为它只会渲染屏幕上可见的内容，而无需往返 CPU。

深度金字塔将有单帧延迟，但可以通过使裁剪球体稍大一些来解决这个问题，以弥补断开连接，就像虚幻引擎所做的那样，他们在裁剪方面有 3-4 帧的延迟。另一种可能的实现是，重新投影上一帧的深度并将其与新帧中的一些非常大的对象组合在一起。刺客信条在其演示文稿中谈到了这一点，龙腾世纪：审判也做了类似的事情。

裁剪和透明度排序。

可悲的是，同时进行裁剪和排序是一个非常困难的问题，因此在引擎中，我们根本不对透明对象进行排序。裁剪使用 GPU 原子操作，其顺序取决于线程在 GPU 硬件中的执行方式。这意味着它对于排序根本没有稳定性，并且最终的渲染布局将是不同的。

即便如此，也有解决方法。如果不是使用实例化进行间接绘制，而是每个对象 1 个绘制命令，并在裁剪时将其实例计数设置为 0，我们可以保持顺序。但是，如果我们这样做，我们将有 0 大小的绘制，这仍然会在引擎中消耗性能，因此这不是一个好的解决方案。

另一种可能性是在 gpu 本身中进行排序，但是 gpu 排序是一项重要的操作，因此由于超出范围，我们没有在教程中进行排序。

最后一种可能性是我们可以拥有与顺序无关的透明度。这意味着我们的透明对象根本不需要任何排序，但代价是渲染操作的成本显着增加。