We have USD files containing meshes with over 100,000 points. This causes usdview to render slowly, especially when accessed over a shared network drive, and we’re unable to achieve a steady 30fps playback.
We’ve observed that after usdview renders a frame once, subsequent renders of the same frame are much faster. Profiling shows that the main bottleneck is in HdStMesh::_UpdateRepr, particularly in HdSceneIndexAdapterSceneDelegate::GetMeshTopology. Since there are no dirties on the second render, it is about 5x faster than the initial one. Similarly, HdStVBOMemoryManager::_StripedBufferArrayRange::CopyData takes significantly longer on the first render.
Based on these findings, to ensure artists can view the USD file at 30fps, we decided to call stageView.renderSinglePass for every frame when opening a stage in usdview. This approach works, but introduces a new issue: the process takes too long. For files with more than 10,000 time samples (we use value clips), it takes over 15 minutes to render the entire stage, forcing users to wait.
I considered rendering multiple frames concurrently, since even HdRenderIndex uses a lot of parallel computation in the SyncAll function, usdview doesn’t seem to fully utilize CPU or GPU resources. When I tried to add concurrency in UsdImagingGLEngine, I found that Hydra is not thread-safe when marking dirty states or retrieving render parameters.
So, is there any way to render multiple frames concurrently, or any other approach to cache topology and so on before playback? We don’t need to edit the stage—just review it—so I hope this makes things easier. Any suggestions are appreciated. Thank you!
Note:
I also tried fetching all time-varying attributes before playback, but as mentioned above, the main bottleneck seems to be in Hydra, so this didn’t help much.
Sending this over to the Hydra category. There is definitely the option of adding frame-caching to usdview for play-blasting, so that playthrough after the first time will be fast, but that would not address the “15 minutes to render all frames” latency.
There isn’t currently a way to ask Hydra to render frames concurrently, but I agree with spiff that adding frame-caching for play-blasting to usdview would be helpful to achieve guaranteed playback rates for heavy scenes.
I assume you are already presenting progress frames to the user so they are getting some visual feedback while the frame sequence is pre-rendering.
Storm is specifically designed to cache mesh topologies. So if only points are moving after the first frame then the time spent processing mesh topologies should be minimal for subsequent frames. I think that aligns with the time improvement you are seeing after the first frame has been processed.
If the data starts on network storage then it is easy for storage access time to be the overall bottleneck regardless of the renderer, but that should also improve significantly once the data is present in a local disk cache.
SyncAll is only one of the parallel data processing points Hydra Storm. With lots of dynamic mesh points data you might also see parallel CPU tasks and GPU compute tasks during resource registry Commit.
Thanks to spiff and davidgyu for clarifying that Hydra currently cannot render frames concurrently. Your feedback saved me from spending more time pursuing that path.
While using usdview, I noticed something surprising: even after closing usdview, if I reopen the same file, the cache still seems to remain. This means that if I have already played back the entire stage once, I don’t need to re-render it again the next time. I’m not exactly sure why this happens (I’m working on Windows 10 with an NVIDIA GPU) and I also haven’t figured out when exactly the cache gets cleared (the only thing I’m certain of is that it will be cleared if I restart the computer or render several other files).
This gave me a new idea to address our earlier performance problem.
I wrote a standalone program that creates a GarchGLDebugWindow (similar to the tests inside USD) and instantiates an isolated UsdImagingGLEngine. I then pass it the same stage that usdview currently has open — and it works! If the standalone “renderer” renders the entire stage, the cache is also valid inside usdview.
Because this standalone renderer is independent of usdview, we can simply use multiple processes to start more than one renderer, each responsible for a different frame range of the same stage. In our case, running four processes fully utilizes the GPU and achieves up to 2× faster pre-rendering compared to the single-process approach.
I’m not sure whether this will work on other systems or hardware, but I hope this idea might help others facing similar playback performance issues.