Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checksum (MD5) calculation for local upload is slow #9166

Open
jgara opened this issue Nov 14, 2022 · 5 comments
Open

Checksum (MD5) calculation for local upload is slow #9166

jgara opened this issue Nov 14, 2022 · 5 comments
Labels
Feature: File Upload & Handling Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc.

Comments

@jgara
Copy link

jgara commented Nov 14, 2022

What steps does it take to reproduce the issue?
Using the DVUploader, upload a large file (> 10GB) to Dataverse (local file-system, not S3).

  • When does this issue occur?
    Every time I upload a file to local storage.

  • Which page(s) does it occurs on?
    I'm exclusively using DVUploader.

  • What happens?
    On our DV host, the file first lands in /tmp. It is then copied to /usr/local/dv-temp/temp at which point it is unzipped to this same directory (we double-zip). At this point there are (3) copies of the file on local storage. Next I see that iostat shows a long running ~30MB/s read operation. I believe this corresponds with DV calculating an MD5 checksum. Running the md5sum Linux utility on the same file proceeds at 500MB/s -- so there is decent available performance on our DV host. Looking at the code (https://github.com/IQSS/dataverse/blob/1435dcca970ee524ec32506f1d8d50c81026fe86/src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java), it appears that the checksum is calculated 1KB at a time, without using buffered IO -- which could explain the suboptimal performance that I think I am seeing.

  • To whom does it occur (all users, curators, superusers)?
    Anyone performing an upload.

  • What did you expect to happen?
    I would expect the md5 checksum calculation performance to be similar to the performance achieved by the md5sum linux utility.

Which version of Dataverse are you using?
5.10.1

Any related open or closed issues to this bug report?
I don't believe so.

Screenshots:

No matter the issue, screenshots are always welcome.

To add a screenshot, please use one of the following formats and/or methods described here:

@donsizemore
Copy link
Contributor

donsizemore commented Nov 14, 2022

qqmyers added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Nov 14, 2022
qqmyers added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Nov 14, 2022
qqmyers added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Nov 15, 2022
@jgara
Copy link
Author

jgara commented Nov 18, 2022

Hi @qqmyers , Thanks for trying this so quickly. I believe the change made a small performance increase (30MB/s -> 50MB/s). However, in my testing, there is another change that is needed:

Changing this:

            while (dis.read() != -1);

to this:

             byte[] bytes = new byte[8192];
            while (dis.read(bytes) != -1);

gave an order of magnitude performance increase (330MB/s). It's not as fast as the Linux md5sum utility (500MB/s) but probably Java's crypto libraries aren't as optimized as they could be. With the above change, the addition of the BufferedInputStream appears to have no effect. I played around with different byte buffer sizes (1k -> 1MB) and 8k seemed to be the sweet spot.

@qqmyers
Copy link
Member

qqmyers commented Nov 21, 2022

@jgara - awesome! Thanks for testing to find the best buffer size. Did you test increasing the size of the BufferedInputStream as well? I think it defaults to 8K so I was guessing something larger might help (hopefully not all the way to 1MB as I have now, but perhaps 32/64K?

If you decide to test that, let me know. Otherwise I may just pick 32K and go ahead and put in the pull request. (Or feel free to submit a PR yourself - you've done the work to find/fix this.)

@jgara
Copy link
Author

jgara commented Nov 22, 2022

@qqmyers - I was going to attempt submitting a PR but I realized I think I have my initial bit wrong. First, on the BufferedInputStream, I find it has no beneficial effect no matter the size. I also read somewhere that by design, anything larger than 8K creates additional overhead (it was over my head) -- but I did a bunch of tests and in fact nothing helped. I also tried FileChannels (the NIO stuff), based on code snippets I found on the interwebs. But there was no benefit. On my system, it's CPU bound. So I'm pretty convinced that a faster cryptographic library would be needed to go beyond 330MB/s on my system.

But -- here's the issue. I tested the original code (I'm embarrassed to admit I didn't before) and it's plenty fast (300MB/s). The original code I'm talking about (from FileUtil.java):

   729	    // from MD5Checksum.java
   730	    public static String calculateChecksum(InputStream in, ChecksumType checksumType) {
   731	        MessageDigest md = null;
   732	        try {
   733	            // Use "SHA-1" (toString) rather than "SHA1", for example.
   734	            md = MessageDigest.getInstance(checksumType.toString());
   735	        } catch (NoSuchAlgorithmException e) {
   736	            throw new RuntimeException(e);
   737	        }
   738	
   739	        byte[] dataBytes = new byte[1024];
   740	
   741	        int nread;
   742	        try {
   743	            while ((nread = in.read(dataBytes)) != -1) {
   744	                md.update(dataBytes, 0, nread);
   745	            }
   746	        } catch (IOException ex) {
   747	            throw new RuntimeException(ex);
   748	        } finally {
   749	            try {
   750	                in.close();
   751	            } catch (Exception e) {
   752	            }
   753	        }
   754	
   755	        return checksumDigestToString(md.digest());
   756	    }

After @donsizemore dropped your update into place and I tested again, I thought maybe it went a little faster -- but it wasn't much of a change. At this point, I set myself up to do an isolated test of the routine performing the checksum and that's when I found we needed to add a buffer to the dis.read() routine (-> dis.read(bytes)).

In any case, this makes me think that:

  1. I'm wrong about a checksum being calculated when I'm observing the 30-50MB/s operation.
  2. Or, the code responsible for the checksum calculation is somewhere else. I went hunting through the codebase though and I didn't find anything obvious (I'm not a programmer though and so easily could have missed something).

What I observe:

  1. Using DVUploader, I upload a double-zipped large file (~ 12GB).
  2. I've got iftop and iostat running in different windows and I'm watching the temp directories.
  3. I see the file xfer over the wire and hit the first temp area (/tmp on my system).
  4. I see it copied to the second temp area (/usr/local/dv-temp/temp on my system).
  5. I see it unzipped (also to /usr/local/dv-temp/temp).
  6. Then I see iostat hold steady (30-50MB/s) for the right amount of time to iterate over the file.
  7. Finally, the file is copied to local storage (/mnt/dvn/dv-content on my system).

I've been assuming that step #6 is calculating the checksum (that gets reported back to DVUploader). Could something else be happening here?

I'm going to ask @donsizemore after the TG break to revert our test Dataverse system so I can test again (more carefully hopefully).

@qqmyers
Copy link
Member

qqmyers commented Nov 22, 2022

Hmm. Right after the checksum, there are several checks of the file to see if it is a tabular or other special file type. Most of those would read just a few bytes but its possible that some are reading through more of the file. I can't think of a good way to test that without doing something like logging timestamps before/after each test though. FileUtil.determineFileType is where this happens.

@pdurbin pdurbin added Feature: File Upload & Handling Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc. labels Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: File Upload & Handling Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc.
Projects
None yet
Development

No branches or pull requests

4 participants