-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-6807] Implement an Azure blobstore filesystem for Python SDK #12492
Conversation
R: @pabloem |
retest this please |
(the previous comment was to start running automated tests) |
https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Commit/5464/console - trailing whitespace in blobstorageio ; ) |
Some formatting complaints - https://ci-beam.apache.org/job/beam_PreCommit_PythonFormatter_Commit/3163/console - you can run |
https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/14420/#showFailuresLink - the dependency is missing because you did not add it to be installed. You can add the aazure dependency in BeamModulePlugin.groovy and tox.ini like in this PR: https://github.com/apache/beam/pull/11149/files And skip the tests whenever the dependency is missing, like here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/aws/s3filesystem_test.py#L37-L44 |
38fde3f
to
b2e0ac4
Compare
Run Python2_PVR_Flink PreCommit |
import future.tests.base # pylint: disable=unused-import | ||
import mock | ||
|
||
from apache_beam.io.azure import blobstorageio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the import error occurs here, so you should move this import to line 40
import logging | ||
import unittest | ||
|
||
from apache_beam.io.azure import blobstorageio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's also an import error happening here. you need to catch it and skip the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change generally looks good. I just added one question about the scalability.
I have three things I'm concerned about:
- Authentication. Normally users are expected to authenticate using PipelineOptions. Can you ensure we pass a pipeline option with a connection string for users to connect?
- Furthermore, can you create a JIRA issue to improve the authentication story? (passing a connection string in pipeline options is not a very good option, but it 'just works', so we need to track later improvements)
- I would like to get integration tests with Azurite merged as well. Can you share if you've looked at that as well?
# The temporary file is deleted immediately after the operation. | ||
with open(self._temporary_file.name, "rb") as f: | ||
self._blob_to_upload.upload_blob( | ||
f.read(), overwrite=True, content_settings=self._content_settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall an issue related to very large files. What happens when we're trying to upload a large file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pabloem Let's see:
- Authentication. At the moment the only way to authenticate is with a connection string obtained from environment variables.
- Integration tests with Azurite. Integration tests with Azurite are practically ready. The only thing left is to define a function in
build.gradle
that runs and stops Azurite. (You can find the branch here: https://github.com/AldairCoronel/beam/commits/azurite).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pabloem What happens when we're trying to upload a large file?
Azure complains when you try to upload large files although in the documentation it states: Calls to write a blob, write a block, or write a page are permitted 10 minutes per megabyte to complete. If an operation is taking longer than 10 minutes per megabyte on average, it will time out.
Refer to this issue as well: https://github.com/Azure/azure-sdk-for-python/issues/12166
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Authentication. At the moment the only way to authenticate is with a connection string obtained from environment variables.
Okay this is not acceptable. We need to enable authentication via pipeline options as we already discussed privately. This PR is ready to go, but we need to enable pipelineoptions-based authentication in a follow up, okay?
Also, please address comments by @epicfaace to catch PartialBatchErrorException
, and then we can move forward to merge this change.
als9o, fwiw, sorry about the delay in reviewing this : ( |
# We intentionally do not decorate this method with a retry, since the | ||
# underlying copy and delete operations are already idempotent operations | ||
# protected by retry decorators. | ||
def delete_paths(self, paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AldairCoronel not sure if you've faced this issue when testing yourself, but when I tried using this code in my own project, I ran into this error: Azure/azure-sdk-for-python#13183
I had to work around it by calling delete_blob
instead of delete_blobs
: codalab/codalab-worksheets@1e3dd30.
Not sure if you faced a similar issue, but adding this here in case it's helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@epicfaace It did not give me problems when testing with my Azure account. The only drawback was when testing with Azurite (emulator) because delete_blobs
is not implemented yet.
I will make the changes from delete_blobs
to delete_blob
in another PR when I add the tests with Azurite.
Thank you very much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. To be clear, I think using delete_blobs
would be ideal, since we would only require a single batch request, rather than having to call delete_blob
over and over again (which is just a workaround for the error I mentioned above). If it's not supported by Azurite, though, it might be fine to just change it to use the delete_blob
workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, it appears that Microsoft might have fixed the delete_blobs
issue: Azure/azure-sdk-for-python#13183
for blob, error in zip(blobs, response): | ||
results[(container, blob)] = error.status_code | ||
|
||
except BlobStorageError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should handle both BlobStorageError
and PartialBatchErrorException
on all blob storage operations (PartialBatchErrorException is raised in, for example, Azure/azure-sdk-for-python#13183) -- otherwise, what ends up happening is that only the status code from PartialBatchErrorException is retrieved, but the message is silenced and not logged at all.
Codecov Report
@@ Coverage Diff @@
## master #12492 +/- ##
==========================================
- Coverage 34.47% 34.28% -0.19%
==========================================
Files 684 699 +15
Lines 81483 82775 +1292
Branches 9180 9361 +181
==========================================
+ Hits 28090 28382 +292
- Misses 52972 53970 +998
- Partials 421 423 +2
Continue to review full report at Codecov.
|
Run Spotless PreCommit |
Thanks @AldairCoronel ! To conclude:
|
Thanks for taking a look @epicfaace ! : ) |
This is some great functionality in this PR, is it expected to be in a release soon? |
@tanya-borisova This change should be included in the next Beam release (2.25.0), which will begin a week from today, and will probably be finalized some weeks after. https://beam.apache.org/contribute/#when-will-my-change-show-up-in-an-apache-beam-release |
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.