-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Blob Storage] Issues with input/output stream uploads, especially large streams. #5275
Comments
@cdraeger thank you for opening this issue and for the detailed information. We are intending to reevaluate the internals of BlobOutputStream and run some perf tests on it before GA. The numbers you are sharing are indeed concerning and make the stream pretty unusable. I can't speak too much to why the file size wouldn't be correct, but I'll defer that until we reevaluate the internals, which will hopefully fix any bugs in the process. As for the questions about the apis, BlobClient.upload is meant to reflect the rest api putBlob method, which has a max size of 256. In the async client, there is an overload for upload that accepts some parameters to configure some amount of buffering (though not for the whole data all at once) that will just do the right thing as in file upload. The intention is that BlobOutputStream fills this purpose in the sync case. If you are interested, my hope is that we can actually change the BlobOutputStream implementation to be based on this upload method on the async client, so I can guide you in what could be a temporary work around to use the async client to write out your stream data and you can see if that will be better suit your performance constraints. Of course if that's too much of a hassle, you can also just wait for us to write it fully and test it. |
@jianghaolu, since this is streaming/netty related, @alzimmermsft and I figured you're the best to take a look at this first. |
Hi @rickle-msft, thank you for the detailed information. Regarding the upload method overload: if you have an example snippet I can gladly try it out. I passed in the async client in addition anyway in my implementation, so I could just give it a try. I’ll get back to it next week. Otherwise I‘m of course eagerly looking forward to the output stream reworks in general, thanks! |
@cdraeger The first thing you can try is cloning the repo as it is now and trying the BlobOutputStream (or just trying the release that should be coming this week). There's a slightly different implementation from what's in Preview 2 that may have a different performance profile. The other option we can try will actually also depend on a new feature coming in preview 3, and I can go into that further if this first option doesn't help. We'll play around with these different options on our side and ultimately stabilize on one that has the best performance before GA. Edit: |
@cdraeger Any updates here? Do you still need more support on this? |
Hi @rickle-msft, unfortunately no update yet from my side! We currently had issues with storage in general and are on that. I still have to retest with Preview 3 and see how to go from there, I hope to be able to give you more feedback again very soon (hopefully early next week). |
Sounds good! No rush here. We just wanted to make sure you had the support you need. Follow up whenever you guys are ready. |
Hi @rickle-msft I was now able to test writing to the output stream again with preview 4:
I basically copied a file |
@cdraeger Thank you for following up. I'm glad that at least the content length is correct now. For the upload speed, are those numbers based on the total time of the operation, or are you purely measuring your network throughput? I just want to be sure that it's indeed a problem with the internals of the BlobOutputStream and that the cost of copying between streams isn't playing a part. |
@rickle-msft I'm reassigning this over to you as you're following up on #6005 and the v8 v12 design differences |
@cdraeger We put some time before GA into cleaning up a few inefficiencies and boosting perf. Can you try with the latest version and let us know if you are still hitting these slow speeds? |
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage. |
Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you! |
Query/Question
As just figured out in issue #5221 I am able to upload also large amounts of data from files to block blobs via the
BlockBlobClient#uploadFromFile(filePath)
method.However, I am usually in need of stream operations, with which I have issues. Example: a client uploads data to our backend via octet-stream. The Spring backend already maps this to a Java
InputStream
which I have to channel directly to the cloud storage.I tried the following:
BlockBlobClient#upload(inputStream, length)
: results in aRequestBodyTooLarge
error when file size is over the API limit of 256MB. How can I solve this? I would have expected the SDK to do the magic like with a file upload which can already be many gigabytes with one SDK upload method call.BlockBlobClient#getBlobOutputStream()
, channeling an incomingInputStream
on the fly viaorg.springframework.util.StreamUtils#copy(inputStream, outputStream)
. This operation** is very slow compared to file or input stream upload: with the latter I have speeds of 7Mb/s with my connection, the output stream write shows only ~ 140KiB/s, and
** results in completely wrong blob sizes on the storage: when I tried with a random 3Mb input stream, the blob size on the Azure portal is shown as just 4KiB.
** I am also not sure, even if it was faster and would produce a correct file size: would it work for large files?
Generally, I cannot buffer in memory or disk of the host machines due to resource constraints, this wouldn't scale. Also, the data can be potentially large (several gigabytes): if I would buffer this in memory I'd have 'out of memory' exceptions for one client only quickly, and there may be a large number of parallel uploads by thousands of different clients. The same applies for disk space when using temp files to work around this, so it's not an option.
Code snippets
Why is this not a Bug or a feature Request?
The input stream upload seems to adhere to the API limit of a 256Mb request body, so it's technically not a bug. But I am not sure if the SDK shouldn't handle this properly in its implementation, which is why I am not filing a feature request for now.
The issue with writing to the output stream may be a bug, but I am not sure yet.
Setup (please complete the following information if applicable):
The text was updated successfully, but these errors were encountered: