Performance issues with prim reparenting and creation due to UsdStage::_Recompose

Hi there, I am working on integrating USD/hydra into an application via the provided c++ libraries, to visualise and edit a scene, made out of certain 3d-assets. That scene is created via the USD library in memory using DefinePrim. We are using a couple of UsdGeomXForms for each asset and it is still in a very early state of development. We are still learning a lot.

One critical issue we are facing with USD currently is, that changing the stage structure (reparenting, creating, deleting prims) is incredibly slow. I tested a creation of 90k prims in the root layer via UsdGeomXForm::Define(), do some heavy reparenting on the nodes and then delete them. That takes nearly 2 Minutes on my machine compared to our old render engine, which is 1000 times faster.

I already optimised the reparents with batching (collect reparents until we need to query into the stage with SdfBatchNamespaceEdit and then execute them with Stage::Apply()). I know, that I also should do this with the Defines (use the Sdf functions in conjunction with an SdfChangeBlock). However, I saw the execution of apply() of the collected heavy reparentings takes still seconds (around 8 seconds). That is way to much for 10k reparentings and we need to reparent and do stage graph changes frequently. I profiled this and saw that the time is spent in UsdStage::_Recompose(), to recompose probably the complete scene, which takes 8 seconds.
I already saw some old and newer posts, which suggest, that stage recomposing is basically slow and we should aim for trying not to change the stage frequently. However, in our case we need to change the stage frequently in realtime.

I learned from another thread, that Nvidia and Pixar are already using some custom implementations to handle performance problems like this for example with Nvidia UsdRT and Fabric. But this doesn’t seem to be freely available for integration outside of Omniverse. So I guess the solution for us is either to wait for an official USD integration of UsdRT/fabric or similar optimisations or do one by our self. As we currently only have few men power in this project we aim to get around the second option.

So it would be good to know, if there will be an integration in the near future, so we can just accept the bad performance for now and optimise later, when it is available.

It would also be good to know, if we are missing something essential here and there is a solution to our performance problem we didn’t see yet.

Also: Are there some numbers on what performance we can expect from reparenting/creating/deleting prims with USD? I only saw one thread with a guy from VFX who wrote that recomposing takes 30 seconds for their stage (probably millions of prims).

Kind regards,

Robert

There is a lot left unsaid here, but I can throw out a few numbers that you can reproduce yourself by downloading Houdini Apprentice…

In Houdini, I used a Primitive LOP to create 10,000 Xform prims of the form:
/parent_i/child_j
for i in range(1…100) and j in range(1…100). This took 0.05 seconds to run, with 0.001 seconds of that time spent composing the stage. Most of the 0.05 is probably overhead of Houdini generating then parsing the string with the 10K prim names in it.

Now, these prims are all empty Xforms… If each of these contained a reference, things will get more expensive. So I added a python LOP which adds a reference to each of these 10K prims using this code:

with Sdf.ChangeBlock():
    for path in node.input(0).lastModifiedPrims():
        spec = layer.GetPrimAtPath(path)
        spec.referenceList.Prepend(Sdf.Reference(hou.text.expandString('$HFS') + '/houdini/usd/assets/rubbertoy/rubbertoy.usd'))

This took 0.1 seconds to run, and about 0.03 seconds to compose (the asset being referenced is only 10 prims).

So given that, 8 seconds sounds like a lot for composition unless your assets have a huge prim count. So I suspect there are efficiencies to be gained by using raw Sdf APIs and SdfChangeBlocks?

I’m making an assumption here that you can structure your USD stage as basically a single layer referencing a bunch of assets, but I think that’s reasonable given your description of what you’re doing.

I also don’t know what sort of namespace rearranging you are doing, but if creating the whole scene from scratch can be done in 0.11 seconds, then even in the worst case where every single prim has to change its path, if editing the existing stage can’t be made faster, you could always rebuild the stage from scratch each time?

I will also say that for a lot of use cases, even 0.1 seconds is a “long time” and so if people are saying recomposing is slow, it’s partly a matter of perspective. I have seen multi-million prim scenes compose in less than 10 seconds (very machine dependent because composition is highly threaded). Is that slow? I guess it depends on your use case? But I’m always very wary of sweeping statements about USD being “fast” or “slow” at certain tasks.

Hope that helps,
Mark

A couple of further things to add to Mark’s great points…

  1. There are a couple of remaining, difficult-to-eradicate, quadratic behaviors in authoring (or mutating) sibling-prim lists. So a good rule of thumb for OpenUSD scene structuring, which probably should be in our FAQ, is that you should limit the number of child prims any prim has to hundreds, rather than thousands or higher orders of magnitude, using intermediate prims to introduce a more hierarchical structure. Not sure if this applies to your experiments, but good to keep in mind.
  2. It sounds like you are already using the most efficient available mechanism for performing your reparenting in a single batch, and I am also a little surprised that your renames are taking 8 seconds in a scene of 100K prims, if your machine has a dozen or more cores…
  3. In any case, we are actively working on our UsdStage-level Namespace Editing project which, in addition to making reparenting much more robust and able to operate on scenes composed of many layers, will also focus on making reparenting efficient both internally to the USD core, and for downstream clients (like Hydra imaging). We suspect that your profile reveals further that most of the time inside _Recompose() is actually spent inside Pcp (the composition engine) recomputing PcpPrimIndices, and upgrading that code to benefit from knowledge that we are reparenting is likely one of the trickiest pieces to attack, and we probably won’t attempt it until after the first deliverable.
  4. I’m not sure Fabric addresses reparenting in a way that can be robustly transferred back to a UsdStage (and when it does so would incur the same costs, if it does) - though I could definitely be wrong, and we’re not aware of any plans for Fabric to be contributed back to OpenUSD.

Cheers!

Thank you a lot for providing the numbers and insights, that helps a lot! We may have triggered one of that quadratic behaviours, which slowed down the composition process. I did a second optimisation pass meanwhile, where I also collected property changes, creations and deletions and executed them in a SDF chance block via Sdf layer functions. That actually did speed up the overall process around 100 times to around 2,3 seconds with 3 recompositions and around half a second with everything made in one composition. What the test exactly does is: 10000 times: Create 2 x 5 GeomXForms parented like 0<=(1 <= (2, 3), 4) and then 10000 times: reparent the root of the first 5 GeomXForms to the root of the second 5 GeomXForms. After that a visibility attribute is set on 10k nodes and then all prims are deleted (also via batchEdit).
Also, I saw indeed that the time is spent in Pcp as you suggested. So I am not really sure, what suddenly let the reparenting part of this to be 8 times faster (as it was already done as batch edit previously). It has to be some optimising side effects of the creations and the collected property editing. We have 2x 10k sibling nodes directly parented to the scene root and do reparenting them to each other. That probably triggers the quadratic behaviour then and may explain the difference to Mark’s 0,03 sec . When we will do further optimisations I will try splitting that hierarchy as you suggested but for now the performance results we get with the second optimisation pass are enough for the start.

Cheers,
Robert

Hello,

I am glad to see that this topic is discussed. I recently needed to create a bulk of prims, and I had to look over the old Siggraph paper from 2019, which mentioned using SdfChangeBlock and SdfAPI for better performance. I have written simple performance tests for the creation of 250K Override Primitives for C++ and Python. I was surprised that the performance difference between C++ and Python wasn’t much when I used SdfChangeBlock and SdfJustCreatePrimInLayer for both cases. I got the following results:

250K Override Prims: 1.603457 sec for C++ and 1.918840837 sec for Python 3.9

1000000K Override Prims:
6.943886 sec for C++ and 8.470245437 sec for Python 3.9

For simple cases like this, it is still possible to talk to downstream libraries like UsdStage and use, for example, stage->OverridePrim, but performance would be worse.

Without using SdfChangeBlock, creating a bulk of prims works very slowly. For example, I got around 125 seconds for 250K Override prims for C++ and about 108 minutes for Python 3.9 (I decided not to run it a few times, so maybe something was wrong with my machine at that time)

Here is my code, in case someone wants to test it:
https://github.com/zbyshek/USD_performance_tests/blob/main/src/main.cpp

Cheers,
Alexey