Checksum (MD5) calculation for local upload is slow #9166

jgara · 2022-11-14T19:30:50Z

What steps does it take to reproduce the issue?
Using the DVUploader, upload a large file (> 10GB) to Dataverse (local file-system, not S3).

When does this issue occur?
Every time I upload a file to local storage.
Which page(s) does it occurs on?
I'm exclusively using DVUploader.
What happens?
On our DV host, the file first lands in /tmp. It is then copied to /usr/local/dv-temp/temp at which point it is unzipped to this same directory (we double-zip). At this point there are (3) copies of the file on local storage. Next I see that iostat shows a long running ~30MB/s read operation. I believe this corresponds with DV calculating an MD5 checksum. Running the md5sum Linux utility on the same file proceeds at 500MB/s -- so there is decent available performance on our DV host. Looking at the code (https://github.com/IQSS/dataverse/blob/1435dcca970ee524ec32506f1d8d50c81026fe86/src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java), it appears that the checksum is calculated 1KB at a time, without using buffered IO -- which could explain the suboptimal performance that I think I am seeing.
To whom does it occur (all users, curators, superusers)?
Anyone performing an upload.
What did you expect to happen?
I would expect the md5 checksum calculation performance to be similar to the performance achieved by the md5sum linux utility.

Which version of Dataverse are you using?
5.10.1

Any related open or closed issues to this bug report?
I don't believe so.

Screenshots:

No matter the issue, screenshots are always welcome.

To add a screenshot, please use one of the following formats and/or methods described here:

https://help.github.com/en/articles/file-attachments-on-issues-and-pull-requests

The text was updated successfully, but these errors were encountered:

donsizemore · 2022-11-14T19:36:21Z

Refactor file upload from web UI and temporary storage #6656

jgara · 2022-11-18T23:10:42Z

Hi @qqmyers , Thanks for trying this so quickly. I believe the change made a small performance increase (30MB/s -> 50MB/s). However, in my testing, there is another change that is needed:

Changing this:

            while (dis.read() != -1);

to this:

             byte[] bytes = new byte[8192];
            while (dis.read(bytes) != -1);

gave an order of magnitude performance increase (330MB/s). It's not as fast as the Linux md5sum utility (500MB/s) but probably Java's crypto libraries aren't as optimized as they could be. With the above change, the addition of the BufferedInputStream appears to have no effect. I played around with different byte buffer sizes (1k -> 1MB) and 8k seemed to be the sweet spot.

qqmyers · 2022-11-21T15:05:56Z

@jgara - awesome! Thanks for testing to find the best buffer size. Did you test increasing the size of the BufferedInputStream as well? I think it defaults to 8K so I was guessing something larger might help (hopefully not all the way to 1MB as I have now, but perhaps 32/64K?

If you decide to test that, let me know. Otherwise I may just pick 32K and go ahead and put in the pull request. (Or feel free to submit a PR yourself - you've done the work to find/fix this.)

jgara · 2022-11-22T21:50:38Z

@qqmyers - I was going to attempt submitting a PR but I realized I think I have my initial bit wrong. First, on the BufferedInputStream, I find it has no beneficial effect no matter the size. I also read somewhere that by design, anything larger than 8K creates additional overhead (it was over my head) -- but I did a bunch of tests and in fact nothing helped. I also tried FileChannels (the NIO stuff), based on code snippets I found on the interwebs. But there was no benefit. On my system, it's CPU bound. So I'm pretty convinced that a faster cryptographic library would be needed to go beyond 330MB/s on my system.

But -- here's the issue. I tested the original code (I'm embarrassed to admit I didn't before) and it's plenty fast (300MB/s). The original code I'm talking about (from FileUtil.java):

   729	    // from MD5Checksum.java
   730	    public static String calculateChecksum(InputStream in, ChecksumType checksumType) {
   731	        MessageDigest md = null;
   732	        try {
   733	            // Use "SHA-1" (toString) rather than "SHA1", for example.
   734	            md = MessageDigest.getInstance(checksumType.toString());
   735	        } catch (NoSuchAlgorithmException e) {
   736	            throw new RuntimeException(e);
   737	        }
   738	
   739	        byte[] dataBytes = new byte[1024];
   740	
   741	        int nread;
   742	        try {
   743	            while ((nread = in.read(dataBytes)) != -1) {
   744	                md.update(dataBytes, 0, nread);
   745	            }
   746	        } catch (IOException ex) {
   747	            throw new RuntimeException(ex);
   748	        } finally {
   749	            try {
   750	                in.close();
   751	            } catch (Exception e) {
   752	            }
   753	        }
   754	
   755	        return checksumDigestToString(md.digest());
   756	    }

After @donsizemore dropped your update into place and I tested again, I thought maybe it went a little faster -- but it wasn't much of a change. At this point, I set myself up to do an isolated test of the routine performing the checksum and that's when I found we needed to add a buffer to the dis.read() routine (-> dis.read(bytes)).

In any case, this makes me think that:

I'm wrong about a checksum being calculated when I'm observing the 30-50MB/s operation.
Or, the code responsible for the checksum calculation is somewhere else. I went hunting through the codebase though and I didn't find anything obvious (I'm not a programmer though and so easily could have missed something).

What I observe:

Using DVUploader, I upload a double-zipped large file (~ 12GB).
I've got iftop and iostat running in different windows and I'm watching the temp directories.
I see the file xfer over the wire and hit the first temp area (/tmp on my system).
I see it copied to the second temp area (/usr/local/dv-temp/temp on my system).
I see it unzipped (also to /usr/local/dv-temp/temp).
Then I see iostat hold steady (30-50MB/s) for the right amount of time to iterate over the file.
Finally, the file is copied to local storage (/mnt/dvn/dv-content on my system).

I've been assuming that step #6 is calculating the checksum (that gets reported back to DVUploader). Could something else be happening here?

I'm going to ask @donsizemore after the TG break to revert our test Dataverse system so I can test again (more carefully hopefully).

qqmyers · 2022-11-22T23:27:14Z

Hmm. Right after the checksum, there are several checks of the file to see if it is a tabular or other special file type. Most of those would read just a few bytes but its possible that some are reading through more of the file. I can't think of a good way to test that without doing something like logging timestamps before/after each test though. FileUtil.determineFileType is where this happens.

qqmyers added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Nov 14, 2022

try buffering for IQSS#9166

1b85e80

qqmyers added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Nov 14, 2022

try buffering for IQSS#9166

850b234

qqmyers added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Nov 15, 2022

try buffering for IQSS#9166

3e72ca2

pdurbin added Feature: File Upload & Handling Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc. labels Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checksum (MD5) calculation for local upload is slow #9166

Checksum (MD5) calculation for local upload is slow #9166

jgara commented Nov 14, 2022 •

edited by pdurbin

Loading

donsizemore commented Nov 14, 2022 •

edited by pdurbin

Loading

jgara commented Nov 18, 2022

qqmyers commented Nov 21, 2022

jgara commented Nov 22, 2022

qqmyers commented Nov 22, 2022

Checksum (MD5) calculation for local upload is slow #9166

Checksum (MD5) calculation for local upload is slow #9166

Comments

jgara commented Nov 14, 2022 • edited by pdurbin Loading

donsizemore commented Nov 14, 2022 • edited by pdurbin Loading

jgara commented Nov 18, 2022

qqmyers commented Nov 21, 2022

jgara commented Nov 22, 2022

qqmyers commented Nov 22, 2022

jgara commented Nov 14, 2022 •

edited by pdurbin

Loading

donsizemore commented Nov 14, 2022 •

edited by pdurbin

Loading