Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-6807] Implement an Azure blobstore filesystem for Python SDK #12492

Merged
merged 130 commits into from
Aug 28, 2020

Conversation

AldairCoronel
Copy link
Contributor

@AldairCoronel AldairCoronel commented Aug 7, 2020

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels

See CI.md for more information about GitHub Actions CI.

@AldairCoronel
Copy link
Contributor Author

R: @pabloem

@pabloem
Copy link
Member

pabloem commented Aug 10, 2020

retest this please

@pabloem
Copy link
Member

pabloem commented Aug 10, 2020

(the previous comment was to start running automated tests)

@pabloem
Copy link
Member

pabloem commented Aug 10, 2020

https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Commit/5464/console - trailing whitespace in blobstorageio ; )

@pabloem
Copy link
Member

pabloem commented Aug 10, 2020

Some formatting complaints - https://ci-beam.apache.org/job/beam_PreCommit_PythonFormatter_Commit/3163/console - you can run tox -e py3-yapf to fix them automatically.

@pabloem
Copy link
Member

pabloem commented Aug 10, 2020

https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/14420/#showFailuresLink - the dependency is missing because you did not add it to be installed. You can add the aazure dependency in BeamModulePlugin.groovy and tox.ini like in this PR: https://github.com/apache/beam/pull/11149/files

And skip the tests whenever the dependency is missing, like here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/aws/s3filesystem_test.py#L37-L44

@pabloem
Copy link
Member

pabloem commented Aug 11, 2020

Run Python2_PVR_Flink PreCommit

import future.tests.base # pylint: disable=unused-import
import mock

from apache_beam.io.azure import blobstorageio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the import error occurs here, so you should move this import to line 40

import logging
import unittest

from apache_beam.io.azure import blobstorageio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also an import error happening here. you need to catch it and skip the test

Copy link
Member

@pabloem pabloem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change generally looks good. I just added one question about the scalability.

I have three things I'm concerned about:

  • Authentication. Normally users are expected to authenticate using PipelineOptions. Can you ensure we pass a pipeline option with a connection string for users to connect?
  • Furthermore, can you create a JIRA issue to improve the authentication story? (passing a connection string in pipeline options is not a very good option, but it 'just works', so we need to track later improvements)
  • I would like to get integration tests with Azurite merged as well. Can you share if you've looked at that as well?

# The temporary file is deleted immediately after the operation.
with open(self._temporary_file.name, "rb") as f:
self._blob_to_upload.upload_blob(
f.read(), overwrite=True, content_settings=self._content_settings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall an issue related to very large files. What happens when we're trying to upload a large file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pabloem Let's see:

  • Authentication. At the moment the only way to authenticate is with a connection string obtained from environment variables.
  • Integration tests with Azurite. Integration tests with Azurite are practically ready. The only thing left is to define a function in build.gradle that runs and stops Azurite. (You can find the branch here: https://github.com/AldairCoronel/beam/commits/azurite).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pabloem What happens when we're trying to upload a large file?

Azure complains when you try to upload large files although in the documentation it states: Calls to write a blob, write a block, or write a page are permitted 10 minutes per megabyte to complete. If an operation is taking longer than 10 minutes per megabyte on average, it will time out.

Refer to this issue as well: https://github.com/Azure/azure-sdk-for-python/issues/12166

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Authentication. At the moment the only way to authenticate is with a connection string obtained from environment variables.

Okay this is not acceptable. We need to enable authentication via pipeline options as we already discussed privately. This PR is ready to go, but we need to enable pipelineoptions-based authentication in a follow up, okay?

Also, please address comments by @epicfaace to catch PartialBatchErrorException, and then we can move forward to merge this change.

@pabloem
Copy link
Member

pabloem commented Aug 25, 2020

als9o, fwiw, sorry about the delay in reviewing this : (

# We intentionally do not decorate this method with a retry, since the
# underlying copy and delete operations are already idempotent operations
# protected by retry decorators.
def delete_paths(self, paths):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AldairCoronel not sure if you've faced this issue when testing yourself, but when I tried using this code in my own project, I ran into this error: Azure/azure-sdk-for-python#13183

I had to work around it by calling delete_blob instead of delete_blobs: codalab/codalab-worksheets@1e3dd30.

Not sure if you faced a similar issue, but adding this here in case it's helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epicfaace It did not give me problems when testing with my Azure account. The only drawback was when testing with Azurite (emulator) because delete_blobs is not implemented yet.

I will make the changes from delete_blobs to delete_blob in another PR when I add the tests with Azurite.

Thank you very much!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. To be clear, I think using delete_blobs would be ideal, since we would only require a single batch request, rather than having to call delete_blob over and over again (which is just a workaround for the error I mentioned above). If it's not supported by Azurite, though, it might be fine to just change it to use the delete_blob workaround.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, it appears that Microsoft might have fixed the delete_blobs issue: Azure/azure-sdk-for-python#13183

for blob, error in zip(blobs, response):
results[(container, blob)] = error.status_code

except BlobStorageError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should handle both BlobStorageError and PartialBatchErrorException on all blob storage operations (PartialBatchErrorException is raised in, for example, Azure/azure-sdk-for-python#13183) -- otherwise, what ends up happening is that only the status code from PartialBatchErrorException is retrieved, but the message is silenced and not logged at all.

@codecov
Copy link

codecov bot commented Aug 26, 2020

Codecov Report

Merging #12492 into master will decrease coverage by 0.18%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12492      +/-   ##
==========================================
- Coverage   34.47%   34.28%   -0.19%     
==========================================
  Files         684      699      +15     
  Lines       81483    82775    +1292     
  Branches     9180     9361     +181     
==========================================
+ Hits        28090    28382     +292     
- Misses      52972    53970     +998     
- Partials      421      423       +2     
Impacted Files Coverage Δ
typehints/typecheck_test_py3.py 31.54% <0.00%> (-16.00%) ⬇️
typehints/typecheck.py 29.44% <0.00%> (-6.18%) ⬇️
utils/interactive_utils.py 30.95% <0.00%> (-2.39%) ⬇️
runners/worker/opcounters.py 33.81% <0.00%> (-0.87%) ⬇️
pipeline.py 22.04% <0.00%> (-0.28%) ⬇️
dataframe/transforms_test.py 25.00% <0.00%> (-0.21%) ⬇️
io/gcp/bigquery_test.py 27.39% <0.00%> (-0.18%) ⬇️
io/filesystems.py 55.00% <0.00%> (-0.18%) ⬇️
options/pipeline_options.py 52.99% <0.00%> (-0.16%) ⬇️
transforms/ptransform_test.py 18.37% <0.00%> (-0.09%) ⬇️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 086b985...3884527. Read the comment docs.

@pabloem
Copy link
Member

pabloem commented Aug 28, 2020

Run Spotless PreCommit

@pabloem pabloem merged commit bae1e7b into apache:master Aug 28, 2020
@pabloem
Copy link
Member

pabloem commented Aug 28, 2020

Thanks @AldairCoronel ! To conclude:

  • Let's figure out the authentication story via pipeline options
  • Let's set up integraiton tests using Azurite

@pabloem
Copy link
Member

pabloem commented Aug 28, 2020

Thanks for taking a look @epicfaace ! : )

@tanya-borisova
Copy link

This is some great functionality in this PR, is it expected to be in a release soon?

@ibzib
Copy link
Contributor

ibzib commented Sep 16, 2020

@tanya-borisova This change should be included in the next Beam release (2.25.0), which will begin a week from today, and will probably be finalized some weeks after.

https://beam.apache.org/contribute/#when-will-my-change-show-up-in-an-apache-beam-release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants