Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Blob Storage] Issues with input/output stream uploads, especially large streams. #5275

Closed
cdraeger opened this issue Sep 6, 2019 · 14 comments · Fixed by #7067
Closed

[Blob Storage] Issues with input/output stream uploads, especially large streams. #5275

cdraeger opened this issue Sep 6, 2019 · 14 comments · Fixed by #7067
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. MQ-Storage Storage "Milestone Quality" investments. needs-author-feedback Workflow: More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue. Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)

Comments

@cdraeger
Copy link

cdraeger commented Sep 6, 2019

Query/Question
As just figured out in issue #5221 I am able to upload also large amounts of data from files to block blobs via the BlockBlobClient#uploadFromFile(filePath) method.

However, I am usually in need of stream operations, with which I have issues. Example: a client uploads data to our backend via octet-stream. The Spring backend already maps this to a Java InputStream which I have to channel directly to the cloud storage.

I tried the following:

  • BlockBlobClient#upload(inputStream, length): results in a RequestBodyTooLarge error when file size is over the API limit of 256MB. How can I solve this? I would have expected the SDK to do the magic like with a file upload which can already be many gigabytes with one SDK upload method call.
  • I tried writing to the output stream of a blob directly via BlockBlobClient#getBlobOutputStream(), channeling an incoming InputStream on the fly via org.springframework.util.StreamUtils#copy(inputStream, outputStream). This operation
    ** is very slow compared to file or input stream upload: with the latter I have speeds of 7Mb/s with my connection, the output stream write shows only ~ 140KiB/s, and
    ** results in completely wrong blob sizes on the storage: when I tried with a random 3Mb input stream, the blob size on the Azure portal is shown as just 4KiB.
    ** I am also not sure, even if it was faster and would produce a correct file size: would it work for large files?

Generally, I cannot buffer in memory or disk of the host machines due to resource constraints, this wouldn't scale. Also, the data can be potentially large (several gigabytes): if I would buffer this in memory I'd have 'out of memory' exceptions for one client only quickly, and there may be a large number of parallel uploads by thousands of different clients. The same applies for disk space when using temp files to work around this, so it's not an option.

Code snippets

public void test()
{
    final Path file = Paths.get("~/Downloads/random-file.zip");
    try(final RandomAccessFile raf = new RandomAccessFile("~/Downloads/random-file.zip", "rw"); final InputStream is = Files.newInputStream(file))
    try(final InputStream is = Files.newInputStream(path))
    {
        final long largeFileSize = 1024 * 1024 * 1024 * 10L // 10Gb of size
        final long smallFileSize = 1024 * 1024 * 3L // 3Mb of size
        raf.setLength(largeFileSize);
        //raf.setLength(smallFileSize); // -> use for trying out writing to output stream
        cloudStorageClient.upload("random-file.zip", is, largeFileSize); // -> fails, I would expect to be able to use it with any content length
        
        // writeToOutputStream(is); // -> this is very slow and produces false blob sizes.
    }
    catch (final IOException | CloudStorageException e)
    {
        LOG.error("Test failed", e);
    }
    finally
    {
        try
        {
            Files.deleteIfExists(file);
        }
        catch (final IOException e)
        {
            LOG.error("Test clean-up failed");
        }
    }
}

private void writeToOutputStream(final InputStream inputStream)
{
    try(final OutputStream outputStream = containerClient.getBlockBlobClient("random-file.zip").getBlobOutputStream())
    {
        LOG.debug("Channeling content from input stream to blob storage...");
        org.springframework.util.StreamUtils.copy(inputStream, outputStream);
    }
    catch (final IOException | StorageException e)
    {
        LOG.error("Test failed", e);
    }
}

Why is this not a Bug or a feature Request?
The input stream upload seems to adhere to the API limit of a 256Mb request body, so it's technically not a bug. But I am not sure if the SDK shouldn't handle this properly in its implementation, which is why I am not filing a feature request for now.

The issue with writing to the output stream may be a bug, but I am not sure yet.

Setup (please complete the following information if applicable):

  • OS: Mac OS 10.14 (Mojave)
  • IDE : IntelliJ
  • 12.0.0-preview.2
@rickle-msft rickle-msft added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Storage Storage Service (Queues, Blobs, Files) labels Sep 6, 2019
@triage-new-issues triage-new-issues bot removed the triage label Sep 6, 2019
@rickle-msft
Copy link
Contributor

@cdraeger thank you for opening this issue and for the detailed information. We are intending to reevaluate the internals of BlobOutputStream and run some perf tests on it before GA. The numbers you are sharing are indeed concerning and make the stream pretty unusable. I can't speak too much to why the file size wouldn't be correct, but I'll defer that until we reevaluate the internals, which will hopefully fix any bugs in the process.

As for the questions about the apis, BlobClient.upload is meant to reflect the rest api putBlob method, which has a max size of 256. In the async client, there is an overload for upload that accepts some parameters to configure some amount of buffering (though not for the whole data all at once) that will just do the right thing as in file upload. The intention is that BlobOutputStream fills this purpose in the sync case. If you are interested, my hope is that we can actually change the BlobOutputStream implementation to be based on this upload method on the async client, so I can guide you in what could be a temporary work around to use the async client to write out your stream data and you can see if that will be better suit your performance constraints. Of course if that's too much of a hassle, you can also just wait for us to write it fully and test it.

@kurtzeborn
Copy link
Member

@jianghaolu, since this is streaming/netty related, @alzimmermsft and I figured you're the best to take a look at this first.

@cdraeger
Copy link
Author

cdraeger commented Sep 9, 2019

Hi @rickle-msft, thank you for the detailed information. Regarding the upload method overload: if you have an example snippet I can gladly try it out. I passed in the async client in addition anyway in my implementation, so I could just give it a try. I’ll get back to it next week.

Otherwise I‘m of course eagerly looking forward to the output stream reworks in general, thanks!

@rickle-msft
Copy link
Contributor

@cdraeger The first thing you can try is cloning the repo as it is now and trying the BlobOutputStream (or just trying the release that should be coming this week). There's a slightly different implementation from what's in Preview 2 that may have a different performance profile. The other option we can try will actually also depend on a new feature coming in preview 3, and I can go into that further if this first option doesn't help. We'll play around with these different options on our side and ultimately stabilize on one that has the best performance before GA.

Edit:
I'm actually heading out for a couple weeks tomorrow, and the other option hasn't had any work towards it yet, so I'll follow up with you when I get back if the first option isn't sufficient. Just a heads up that I may be slow to respond.

@rickle-msft
Copy link
Contributor

@cdraeger Any updates here? Do you still need more support on this?

@cdraeger
Copy link
Author

cdraeger commented Oct 4, 2019

Hi @rickle-msft, unfortunately no update yet from my side! We currently had issues with storage in general and are on that. I still have to retest with Preview 3 and see how to go from there, I hope to be able to give you more feedback again very soon (hopefully early next week).

@rickle-msft
Copy link
Contributor

Sounds good! No rush here. We just wanted to make sure you had the support you need. Follow up whenever you guys are ready.

@cdraeger
Copy link
Author

Hi @rickle-msft I was now able to test writing to the output stream again with preview 4:

  1. It seems like the content length of the upload is fixed, I tested up to 50Mb random files and the blob showed the correct size in the portal afterwards.
  2. However, the upload speed was still very slow: ~ 250kB/s. In comparison, the input stream upload managed > 50Mbit/s, but it requires me to specify the content length beforehand which I don't know when channeling streams on-the-fly. Also there is a content length limit then.

I basically copied a file InputStream to the blob OutputStream via standard stream utils copy methods. While this can be slower than direct input stream upload, it shouldn't be this slow?

@rickle-msft
Copy link
Contributor

@cdraeger Thank you for following up. I'm glad that at least the content length is correct now. For the upload speed, are those numbers based on the total time of the operation, or are you purely measuring your network throughput? I just want to be sure that it's indeed a problem with the internals of the BlobOutputStream and that the cost of copying between streams isn't playing a part.

@joshfree
Copy link
Member

@rickle-msft I'm reassigning this over to you as you're following up on #6005 and the v8 v12 design differences

@rickle-msft
Copy link
Contributor

@cdraeger We put some time before GA into cleaning up a few inefficiencies and boosting perf. Can you try with the latest version and let us know if you are still hitting these slow speeds?

@alzimmermsft alzimmermsft added the MQ-Storage Storage "Milestone Quality" investments. label Dec 10, 2019
@gapra-msft gapra-msft reopened this Jan 3, 2020
@gapra-msft
Copy link
Member

@cdraeger I just merged a new version of BlockBlobOutputStream in #7067 that should help improve performance, feel free to close the issue if you find that it helps.

@Petermarcu Petermarcu added needs-author-feedback Workflow: More information is needed from author to address the issue. Service Attention Workflow: This issue is responsible by Azure service team. labels Jul 14, 2020
@ghost
Copy link

ghost commented Jul 14, 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Jul 21, 2020
@ghost
Copy link

ghost commented Jul 21, 2020

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@ghost ghost closed this as completed Aug 5, 2020
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. MQ-Storage Storage "Milestone Quality" investments. needs-author-feedback Workflow: More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue. Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants