I’d like to fetch sizes of VtArrays stored in a usdc file without having to read the actual content of the array from disk and especially avoiding all the memory allocation and the decompression work.
It seems that the
_ReadPossiblyCompressedArray is the only place that currently reads the VtArray size but it does resize the VtArray in the same line (e.g. https://github.com/PixarAnimationStudios/OpenUSD/blob/10b62439e9242a55101cf8b200f2c7e02420e1b0/pxr/usd/usd/crateFile.cpp#L1955 ).
I’m currently using the Usd API to access the stage, and it would be nice to have a new API at the UsdAttribute level to read array sizes, that would expose a fast path to array metadata.
N.B. that this would be similar to another potential API to retrieve the hash of the array content, but that will be for another time!
Note that the array is only allocated and fully processed if it is compressed , which is only integer arrays and a small class of floating-point arrays (ones that are determined to contain only integer values, which comes up in some of our primvars). So you should be getting the “zero copy, no memory mapped or allocated” behavior when fetching the size of most arrays.
We have considered adding GetArraySize() methods of UsdAttribute in the past, but the addition of zero-copy arrays for crate made it not worth the API concerns and plumbing, for us. If this is something you’d like to explore adding, @aloysbaillet , let us know and we can gather our thoughts.
Indeed not all arrays are affected, but every mesh has at least 2 large int arrays and we’re tackling scenes with 500k+ non-instanced meshes.
Windows is definitely the worst as the system allocator for large arrays is very slow and locks like crazy. Jemalloc makes this much less of an issue on linux.
If I recall correctly even calling
ValueMightBeTimeVarying() on these int array attributes would cause them to allocate and decompress.
My use case is quite niche, we’re iterating in parallel over large scene to potentially merge parts of mesh hierarchies, and we gather the sizes of everything in a first pass which ideallly would not cause too much locking (which means no large allocs on Windows). Then we do the merging at a later time with pre-allocated buffers of the right sizes. We also call ValueMightBeTimeVarying on most attributes to check if they can be trivially merged.
Creating a new GetArraySize would definitely be useful in this case, but I can see how big the change would be from an API point of view, just to read an int value at the bottom…
Being able to hook our own allocator might be a more worthwhile ingeneering investment.
I mostly wanted to make sure I didn’t miss anything.
It’s unexpected to me that
ValueMightBeTimeVarying() would cause int-arrays to be decompressed, because that function was designed specifically to only consider existence (count) of timeSamples, rather than comparing any sample values - if you’re observing that, a repro and/or stacktrace would be interesting…
This was quite unexpected indeed. I found a trace with the incriminating stacktrace here:
This was sampled with a highly parallel call to this function which caused debilitating perf on windows with the default allocator: 10-40ms to query if an attribute has time samples seemed excessive… We worked around this in our code since, but worth reporting.
I’ll try and provide a clear repro step with file and code as a GitHub issue soon.
Oof, look at all those threads lined up waiting for a serialized kick at the heap.
Interestingly I’ve been trying to repro this stack in USD on the dev branch, but I can’t find a way to get
UsdStage::_GetResolvedValueImpl to call
The use case that used to trigger this was a fairly complex customer data scene, with indexed primvars causing
UsdGeomPrimvar::ValueMightBeTimeVarying to unpack the indices int array for no good reason.
@aloysbaillet , so are you saying that in the dev branch, performance is good and thread contention is low, on the same dataset that was problematic in NVidia’s OpenUSD release?
Also, at the risk of taking this thread into fraught territory, does the Windows dev and runtime environment provide no way to configure/select a non-serial memory allocator without building and distributing your own? That seems difficult to believe if so!
jemalloc and tcmalloc work on windows, but to my knowledge jemalloc hasn’t got a built in way to just magically work on windows, it, last time i checked, involved link shenanigans that interfere with link optimization of every kind.
tcmalloc otoh does allow replacement, easily via a limker flag, such that it replaces malloc and free for all dlls loaded in a process.
legend has it that jemalloc outperforms tcmalloc in a concurrency heavy scenario, but the difficulty of creating a crash free setup might infinitely outweigh any other consideration. tcmalloc definitely outperforms stdc single threaded, but that isn’t the interesting case here.
maybe someone has explored jenalloc and tcmalloc on windows to a deeper level and has more up to date information.