-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash verification/regeneration APIs #5867
Comments
Akio found updateHashValues in the code, but this will regenerate ALL hash values. An API endpoint to verify existing checksums and populate empty checksum fields would be a home run. Red Sox land seems like a good place to ask for those ;) |
Ah, I see what you mean. updateHashValues is documented at http://guides.dataverse.org/en/4.14/installation/config.html#filefixitychecksumalgorithm and it looks like it was added by @qqmyers in pull request #5035 Over at #4131 (comment) I see the @landreev wrote "Recalculated and added the missing MD5s." @donsizemore it sounds like you want a little more control over the process, maybe updating one file a time or all the files that don't have a checksum. And, like you said, a readonly "tell if the checksum still matches" API endpoint. |
@pdurbin not so much about control as integrity — checksums are generated at upload, and to test for say bitrot on our Dataverse storage we wouldn't want to regenerate existing checksums. |
FWIW: #5035 didn't expose a separate verify endpoint, but it does verify the existing hash before calculating one with a different algorithm - it should be possible to reuse that code in a verifyHashes endpoint... |
@donsizemore @pdurbin and @qqmyers, thanks for the discussion here. Just so I understand, a deliverable here would be an API endpoint that could be passed a specific file in order to verify the hash and another API endpoint that could be passed a specific file for which to regenerate a hash? |
@djbrooke that would be my optimal scenario: one endpoint to verify a match or report a mismatch, another endpoint to regenerate and/or populate a NULL |
Thanks, I retitled this. I'm OK if we split this into two or if we deliver these both together. |
This was delivered in pull request #6228 as part of Dataverse 4.17 and is documented at http://guides.dataverse.org/en/4.17/api/native-api.html#id15 I suspect it says "id15" instead of a normal anchor because of a conflict with this older anchor: http://guides.dataverse.org/en/4.17/api/native-api.html#datafile-integrity I guess I'll close this but it would be nice to fix up that anchor at some point. |
In comparing our Postgres checksumvalues to the checksums contained in our iRODS preservation instances, I found that the majority of our file metadata, having been imported into ~4.6 back in 2016, have empty checksumvalue fields.
I could write some python to manually calculate and correct blank entries, but I'd love a checksum answer to say the datafile integrity or dataset integrity endpoints: say one to populate missing checksumvalues; another to verify existing checksumvalues against the filesystem.
The text was updated successfully, but these errors were encountered: