GPU instanced frustum culling performance

Grimmigbeisser · July 26, 2024, 9:23am

Hi guys,

I am using HgiMetal and HdStorm to render 100k instances of one cube. I setup a prototype containing the geometry and an instancer containing the transforms for 100k cubes.
I am only able to render this with around 8 fps, which seems quite low. So I profiled the app with the Xcode metal profiler and noticed, that more than 90% is spent in the compute kernel of GPU frustum culling.
I think this is due to the kernel only using one Thread, which again is because it uses HdSt_PipelineDrawBatch::_drawItemInstances.size() for the x dimension when dispatching the compute commands. This is 1, not 100k. The actual number of instances could be read with

uint32_t const instanceCount =
_GetInstanceCount(drawItemInstance,
dc.instanceIndexBar,
traits.instanceIndexWidth);

Could this be a bug or do I not understand the setup of GPU instanced frustum culling? I think the number of threads in the compute kernel should match the number of instances, to use the compute waves efficiently. If it is 1, there is a nearly 100% of kernel ALU inefficiency.

Cheers,
Robert

davidgyu · August 1, 2024, 5:49pm

That’s a interesting observation! Storm has two code paths for GPU view frustum culling. The HgiGL implementation uses an instanced points draw call to such that each instanced bound is processed by a separate vertex shader invocation.
The HgiMetal and HgiVulkan implementations use a compute shader and currently a single compute shader invocation processes all instances of an instanced draw item. This is certainly work that could be distributed better across multiple kernel invocations.

Grimmigbeisser · August 2, 2024, 5:28am

Hi David,

thanks again for the info! Doing a bit more testing I observed, that the bottleneck went away when using more than one prototype. That is because the kernel gets called more than once then, so for most scenarios this bottleneck shouldn’t get visible anyway. But increasing the thread number for the kernel invocation to min(64, number of instances) for instanced gpu culling in the metal/vk path should be better I think.

Cheers,
Robert

Topic		Replies	Views
PointInstancer prototypes render issue in Storm (Vulkan and OpenGL) Hydra	1	42	November 22, 2024
Performance of Mesh Geometry Shader in HdStorm Hydra	4	272	February 2, 2024
Instanced lights Hydra	6	395	January 24, 2024
GPU Texture mgt with HdStRenderBuffer vs. HdRenderIndex Hydra rendering , cpp , beginner	1	171	September 20, 2024
Storm - Shared memory for compute shaders Hydra rendering	0	47	September 26, 2024

GPU instanced frustum culling performance

Related topics