-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checksum (MD5) calculation for local upload is slow #9166
Comments
Hi @qqmyers , Thanks for trying this so quickly. I believe the change made a small performance increase (30MB/s -> 50MB/s). However, in my testing, there is another change that is needed: Changing this:
to this:
gave an order of magnitude performance increase (330MB/s). It's not as fast as the Linux |
@jgara - awesome! Thanks for testing to find the best buffer size. Did you test increasing the size of the BufferedInputStream as well? I think it defaults to 8K so I was guessing something larger might help (hopefully not all the way to 1MB as I have now, but perhaps 32/64K? If you decide to test that, let me know. Otherwise I may just pick 32K and go ahead and put in the pull request. (Or feel free to submit a PR yourself - you've done the work to find/fix this.) |
@qqmyers - I was going to attempt submitting a PR but I realized I think I have my initial bit wrong. First, on the BufferedInputStream, I find it has no beneficial effect no matter the size. I also read somewhere that by design, anything larger than 8K creates additional overhead (it was over my head) -- but I did a bunch of tests and in fact nothing helped. I also tried FileChannels (the NIO stuff), based on code snippets I found on the interwebs. But there was no benefit. On my system, it's CPU bound. So I'm pretty convinced that a faster cryptographic library would be needed to go beyond 330MB/s on my system. But -- here's the issue. I tested the original code (I'm embarrassed to admit I didn't before) and it's plenty fast (300MB/s). The original code I'm talking about (from FileUtil.java):
After @donsizemore dropped your update into place and I tested again, I thought maybe it went a little faster -- but it wasn't much of a change. At this point, I set myself up to do an isolated test of the routine performing the checksum and that's when I found we needed to add a buffer to the In any case, this makes me think that:
What I observe:
I've been assuming that step #6 is calculating the checksum (that gets reported back to DVUploader). Could something else be happening here? I'm going to ask @donsizemore after the TG break to revert our test Dataverse system so I can test again (more carefully hopefully). |
Hmm. Right after the checksum, there are several checks of the file to see if it is a tabular or other special file type. Most of those would read just a few bytes but its possible that some are reading through more of the file. I can't think of a good way to test that without doing something like logging timestamps before/after each test though. FileUtil.determineFileType is where this happens. |
What steps does it take to reproduce the issue?
Using the DVUploader, upload a large file (> 10GB) to Dataverse (local file-system, not S3).
When does this issue occur?
Every time I upload a file to local storage.
Which page(s) does it occurs on?
I'm exclusively using DVUploader.
What happens?
On our DV host, the file first lands in
/tmp
. It is then copied to/usr/local/dv-temp/temp
at which point it is unzipped to this same directory (we double-zip). At this point there are (3) copies of the file on local storage. Next I see thatiostat
shows a long running ~30MB/s read operation. I believe this corresponds with DV calculating an MD5 checksum. Running themd5sum
Linux utility on the same file proceeds at 500MB/s -- so there is decent available performance on our DV host. Looking at the code (https://github.com/IQSS/dataverse/blob/1435dcca970ee524ec32506f1d8d50c81026fe86/src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java), it appears that the checksum is calculated 1KB at a time, without using buffered IO -- which could explain the suboptimal performance that I think I am seeing.To whom does it occur (all users, curators, superusers)?
Anyone performing an upload.
What did you expect to happen?
I would expect the md5 checksum calculation performance to be similar to the performance achieved by the
md5sum
linux utility.Which version of Dataverse are you using?
5.10.1
Any related open or closed issues to this bug report?
I don't believe so.
Screenshots:
No matter the issue, screenshots are always welcome.
To add a screenshot, please use one of the following formats and/or methods described here:
The text was updated successfully, but these errors were encountered: