[Blob Storage] Issues with input/output stream uploads, especially large streams. #5275

cdraeger · 2019-09-06T09:30:20Z

Query/Question
As just figured out in issue #5221 I am able to upload also large amounts of data from files to block blobs via the BlockBlobClient#uploadFromFile(filePath) method.

However, I am usually in need of stream operations, with which I have issues. Example: a client uploads data to our backend via octet-stream. The Spring backend already maps this to a Java InputStream which I have to channel directly to the cloud storage.

I tried the following:

BlockBlobClient#upload(inputStream, length): results in a RequestBodyTooLarge error when file size is over the API limit of 256MB. How can I solve this? I would have expected the SDK to do the magic like with a file upload which can already be many gigabytes with one SDK upload method call.
I tried writing to the output stream of a blob directly via BlockBlobClient#getBlobOutputStream(), channeling an incoming InputStream on the fly via org.springframework.util.StreamUtils#copy(inputStream, outputStream). This operation
** is very slow compared to file or input stream upload: with the latter I have speeds of 7Mb/s with my connection, the output stream write shows only ~ 140KiB/s, and
** results in completely wrong blob sizes on the storage: when I tried with a random 3Mb input stream, the blob size on the Azure portal is shown as just 4KiB.
** I am also not sure, even if it was faster and would produce a correct file size: would it work for large files?

Generally, I cannot buffer in memory or disk of the host machines due to resource constraints, this wouldn't scale. Also, the data can be potentially large (several gigabytes): if I would buffer this in memory I'd have 'out of memory' exceptions for one client only quickly, and there may be a large number of parallel uploads by thousands of different clients. The same applies for disk space when using temp files to work around this, so it's not an option.

Code snippets

public void test()
{
    final Path file = Paths.get("~/Downloads/random-file.zip");
    try(final RandomAccessFile raf = new RandomAccessFile("~/Downloads/random-file.zip", "rw"); final InputStream is = Files.newInputStream(file))
    try(final InputStream is = Files.newInputStream(path))
    {
        final long largeFileSize = 1024 * 1024 * 1024 * 10L // 10Gb of size
        final long smallFileSize = 1024 * 1024 * 3L // 3Mb of size
        raf.setLength(largeFileSize);
        //raf.setLength(smallFileSize); // -> use for trying out writing to output stream
        cloudStorageClient.upload("random-file.zip", is, largeFileSize); // -> fails, I would expect to be able to use it with any content length
        
        // writeToOutputStream(is); // -> this is very slow and produces false blob sizes.
    }
    catch (final IOException | CloudStorageException e)
    {
        LOG.error("Test failed", e);
    }
    finally
    {
        try
        {
            Files.deleteIfExists(file);
        }
        catch (final IOException e)
        {
            LOG.error("Test clean-up failed");
        }
    }
}

private void writeToOutputStream(final InputStream inputStream)
{
    try(final OutputStream outputStream = containerClient.getBlockBlobClient("random-file.zip").getBlobOutputStream())
    {
        LOG.debug("Channeling content from input stream to blob storage...");
        org.springframework.util.StreamUtils.copy(inputStream, outputStream);
    }
    catch (final IOException | StorageException e)
    {
        LOG.error("Test failed", e);
    }
}

Why is this not a Bug or a feature Request?
The input stream upload seems to adhere to the API limit of a 256Mb request body, so it's technically not a bug. But I am not sure if the SDK shouldn't handle this properly in its implementation, which is why I am not filing a feature request for now.

The issue with writing to the output stream may be a bug, but I am not sure yet.

Setup (please complete the following information if applicable):

OS: Mac OS 10.14 (Mojave)
IDE : IntelliJ
12.0.0-preview.2

The text was updated successfully, but these errors were encountered:

rickle-msft · 2019-09-06T17:17:22Z

@cdraeger thank you for opening this issue and for the detailed information. We are intending to reevaluate the internals of BlobOutputStream and run some perf tests on it before GA. The numbers you are sharing are indeed concerning and make the stream pretty unusable. I can't speak too much to why the file size wouldn't be correct, but I'll defer that until we reevaluate the internals, which will hopefully fix any bugs in the process.

As for the questions about the apis, BlobClient.upload is meant to reflect the rest api putBlob method, which has a max size of 256. In the async client, there is an overload for upload that accepts some parameters to configure some amount of buffering (though not for the whole data all at once) that will just do the right thing as in file upload. The intention is that BlobOutputStream fills this purpose in the sync case. If you are interested, my hope is that we can actually change the BlobOutputStream implementation to be based on this upload method on the async client, so I can guide you in what could be a temporary work around to use the async client to write out your stream data and you can see if that will be better suit your performance constraints. Of course if that's too much of a hassle, you can also just wait for us to write it fully and test it.

kurtzeborn · 2019-09-07T02:41:49Z

@jianghaolu, since this is streaming/netty related, @alzimmermsft and I figured you're the best to take a look at this first.

cdraeger · 2019-09-09T17:02:13Z

Hi @rickle-msft, thank you for the detailed information. Regarding the upload method overload: if you have an example snippet I can gladly try it out. I passed in the async client in addition anyway in my implementation, so I could just give it a try. I’ll get back to it next week.

Otherwise I‘m of course eagerly looking forward to the output stream reworks in general, thanks!

rickle-msft · 2019-09-09T18:20:30Z

@cdraeger The first thing you can try is cloning the repo as it is now and trying the BlobOutputStream (or just trying the release that should be coming this week). There's a slightly different implementation from what's in Preview 2 that may have a different performance profile. The other option we can try will actually also depend on a new feature coming in preview 3, and I can go into that further if this first option doesn't help. We'll play around with these different options on our side and ultimately stabilize on one that has the best performance before GA.

Edit:
I'm actually heading out for a couple weeks tomorrow, and the other option hasn't had any work towards it yet, so I'll follow up with you when I get back if the first option isn't sufficient. Just a heads up that I may be slow to respond.

rickle-msft · 2019-10-01T19:03:19Z

@cdraeger Any updates here? Do you still need more support on this?

cdraeger · 2019-10-04T09:16:15Z

Hi @rickle-msft, unfortunately no update yet from my side! We currently had issues with storage in general and are on that. I still have to retest with Preview 3 and see how to go from there, I hope to be able to give you more feedback again very soon (hopefully early next week).

rickle-msft · 2019-10-04T16:37:08Z

Sounds good! No rush here. We just wanted to make sure you had the support you need. Follow up whenever you guys are ready.

cdraeger · 2019-10-16T10:39:08Z

Hi @rickle-msft I was now able to test writing to the output stream again with preview 4:

It seems like the content length of the upload is fixed, I tested up to 50Mb random files and the blob showed the correct size in the portal afterwards.
However, the upload speed was still very slow: ~ 250kB/s. In comparison, the input stream upload managed > 50Mbit/s, but it requires me to specify the content length beforehand which I don't know when channeling streams on-the-fly. Also there is a content length limit then.

I basically copied a file InputStream to the blob OutputStream via standard stream utils copy methods. While this can be slower than direct input stream upload, it shouldn't be this slow?

rickle-msft · 2019-10-16T17:35:40Z

@cdraeger Thank you for following up. I'm glad that at least the content length is correct now. For the upload speed, are those numbers based on the total time of the operation, or are you purely measuring your network throughput? I just want to be sure that it's indeed a problem with the internals of the BlobOutputStream and that the cost of copying between streams isn't playing a part.

joshfree · 2019-10-23T17:47:59Z

@rickle-msft I'm reassigning this over to you as you're following up on #6005 and the v8 v12 design differences

rickle-msft · 2019-11-01T23:45:22Z

@cdraeger We put some time before GA into cleaning up a few inefficiencies and boosting perf. Can you try with the latest version and let us know if you are still hitting these slow speeds?

gapra-msft · 2020-01-03T23:38:55Z

@cdraeger I just merged a new version of BlockBlobOutputStream in #7067 that should help improve performance, feel free to close the issue if you find that it helps.

ghost · 2020-07-14T06:00:01Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

ghost · 2020-07-21T08:00:34Z

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

triage-new-issues bot added the triage label Sep 6, 2019

rickle-msft added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Storage Storage Service (Queues, Blobs, Files) labels Sep 6, 2019

triage-new-issues bot removed the triage label Sep 6, 2019

kurtzeborn assigned jianghaolu Sep 7, 2019

alzimmermsft mentioned this issue Sep 11, 2019

Release Storage Track 2 Library Preview 4 for Java #4714

Closed

30 tasks

mikeharder mentioned this issue Oct 23, 2019

[Storage] Consider adding upload() method to BlobClient #6005

Closed

joshfree assigned rickle-msft and unassigned jianghaolu Oct 23, 2019

mikeharder mentioned this issue Oct 30, 2019

[Storage] Add API BlobClient.upload(InputStream, length) #6089

Closed

alzimmermsft added the MQ-Storage Storage "Milestone Quality" investments. label Dec 10, 2019

gapra-msft mentioned this issue Jan 2, 2020

BlobOutputStream using BufferedUpload #7067

Merged

gapra-msft closed this as completed in #7067 Jan 3, 2020

gapra-msft reopened this Jan 3, 2020

Petermarcu added needs-author-feedback Workflow: More information is needed from author to address the issue. Service Attention Workflow: This issue is responsible by Azure service team. labels Jul 14, 2020

ghost added the no-recent-activity There has been no recent activity on this issue. label Jul 21, 2020

ghost closed this as completed Aug 5, 2020

github-actions bot locked and limited conversation to collaborators Apr 12, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Blob Storage] Issues with input/output stream uploads, especially large streams. #5275

[Blob Storage] Issues with input/output stream uploads, especially large streams. #5275

cdraeger commented Sep 6, 2019 •

edited

Loading

rickle-msft commented Sep 6, 2019

kurtzeborn commented Sep 7, 2019

cdraeger commented Sep 9, 2019

rickle-msft commented Sep 9, 2019

rickle-msft commented Oct 1, 2019

cdraeger commented Oct 4, 2019

rickle-msft commented Oct 4, 2019

cdraeger commented Oct 16, 2019

rickle-msft commented Oct 16, 2019

joshfree commented Oct 23, 2019

rickle-msft commented Nov 1, 2019

gapra-msft commented Jan 3, 2020

ghost commented Jul 14, 2020

ghost commented Jul 21, 2020

[Blob Storage] Issues with input/output stream uploads, especially large streams. #5275

[Blob Storage] Issues with input/output stream uploads, especially large streams. #5275

Comments

cdraeger commented Sep 6, 2019 • edited Loading

rickle-msft commented Sep 6, 2019

kurtzeborn commented Sep 7, 2019

cdraeger commented Sep 9, 2019

rickle-msft commented Sep 9, 2019

rickle-msft commented Oct 1, 2019

cdraeger commented Oct 4, 2019

rickle-msft commented Oct 4, 2019

cdraeger commented Oct 16, 2019

rickle-msft commented Oct 16, 2019

joshfree commented Oct 23, 2019

rickle-msft commented Nov 1, 2019

gapra-msft commented Jan 3, 2020

ghost commented Jul 14, 2020

ghost commented Jul 21, 2020

cdraeger commented Sep 6, 2019 •

edited

Loading