Does the same usda file translate into the same usdc data?

Hello everyone, I’m trying to find out if a layer has changed by looking at its usdc data, but I’m finding that for the same content (the same usda) I can get two different usc data.
I’ve put up this little script that creates two layers that have the same prim. If I clear the second layer and recreate the same prim and save it I get two different hashes although the content of the two layers is the same.

import hashlib
from pxr import Sdf

filepath1 = 'file1.usdc'
layer1 = Sdf.Layer.CreateNew(filepath1)
Sdf.CreatePrimInLayer(layer1, '/world')
layer1.Save()

with open(filepath1, 'rb') as f:
    data = f.read()
print(layer1.ExportToString())
print(data)
print(hashlib.md5(data).digest())

filepath2 = 'file2.usdc'
layer2 = Sdf.Layer.CreateNew(filepath2)
Sdf.CreatePrimInLayer(layer2, '/world')
layer2.Save()
layer2.Clear()
Sdf.CreatePrimInLayer(layer2, '/world')
layer2.Save()

with open(filepath2, 'rb') as f:
    data2 = f.read()
print(layer2.ExportToString())
print(data2)
print(hashlib.md5(data2).digest())

If I don’t Clear() and CreatePrimInLayer() and Save() again the hashes are identical.
Looking at the documentation the Clear() method is undoable, so does it store extra data or timestamp info ?https://openusd.org/release/api/class_sdf_layer.html#a9013e716d1676f98b48ab913031e6d01

Thanks for the help

Crate files don’t necessarily store their data in the exact same order every time. I think you can get lucky and they might match up, but as far as I know, there’s no such guarantee in the crate writer.

As for your specific example, I think it’s small enough that it should be identical. I’ll look into it because I’m curious, but again, I don’t think you should expect repeatable file layout.

So I like using ImHex to investigate and diff files, and it gives me the following diff.

To my eye that is showing that there’s a difference in where some padding is going between writes, which is changing where the offsets point to.

Otherwise when parsed, this is the dictionary output of the two files, which is identical.

file1.usdc

{'/': (PseudoRoot, {'primChildren': TokenVector:['world']}),
 '/world': (Prim, {})}

file2.usdc

{'/': (PseudoRoot, {'primChildren': TokenVector:['world']}),
 '/world': (Prim, {})}

@alexmohr might have some more thoughts here, but I think its just down to the order shifts between writes.

Right – in addition to padding bits we use hash tables in the .usdc implementation, and don’t take pains (or the perf cost) to sort everything when we write, so just doing usdcat on a .usdc file is liable to produce results that differ bitwise, but not content-wise.

Even if we did ensure that usdcat with a .usdc always produced bitwise-identical results, you can still run into trouble because an incremental Save() of a .usdc does not in general rewrite the whole file. It’s sort of “journaled”. So an edited and Save()d .usdc file would still differ bitwise from that same file run through usdcat.

I think if we want content fingerprint hashing, it would be best to do it at the SdfLayer level, so that it would work automatically and identically for every file format that USD understands, and would avoid any issues with .usdc or any other “database-esque” formats. And you could do things like freely flip/flop your .usd file between text and binary without changing the content fingerprint hash.

Of course that’s a small project. Today I think the best way to get consistent results is to usdcat the .usd file to .usda and hash the output.

2 Likes

Thanks for the additional info, Alex. Yeah I forgot about the update write in place.

I think usddiff might help here but I haven’t looked at its implementation.

In short, usddiff just usdcats to .usda and runs your diff tool of choice.

Ah that’s useful but different than I assumed.

Might be a cool future project to have an sdf diff and merge tool. Could be handy for merge conflicts etc in git for crate files.

Definitely – usddiff plus usdedit is almost there – usdedit converts your layer to .usda, pops you into your editor of choice, and when you’re done writes it back in the original format, so you don’t have to think about what type of file you’re dealing with.

A USD-specific merge tool would be interesting, since it could operate not only at the bare textual level, but also at the higher structural content level. Something like a user-guided selective “flattening” where the prim hierarchy and listops and dictionaries and so on are understood.

Yeah exactly. I think being able to do it at a structural level would mean that you’d not be dependent on ordering of prims for textual diffs, and also be able to handle crate files.

We’ve done similar for other structural data types in our pipelines and its super handy.