Hi guys,
I am using HgiMetal and HdStorm to render 100k instances of one cube. I setup a prototype containing the geometry and an instancer containing the transforms for 100k cubes.
I am only able to render this with around 8 fps, which seems quite low. So I profiled the app with the Xcode metal profiler and noticed, that more than 90% is spent in the compute kernel of GPU frustum culling.
I think this is due to the kernel only using one Thread, which again is because it uses HdSt_PipelineDrawBatch::_drawItemInstances.size() for the x dimension when dispatching the compute commands. This is 1, not 100k. The actual number of instances could be read with
uint32_t const instanceCount =
_GetInstanceCount(drawItemInstance,
dc.instanceIndexBar,
traits.instanceIndexWidth);
Could this be a bug or do I not understand the setup of GPU instanced frustum culling? I think the number of threads in the compute kernel should match the number of instances, to use the compute waves efficiently. If it is 1, there is a nearly 100% of kernel ALU inefficiency.
Cheers,
Robert