[BEAM-6807] Implement an Azure blobstore filesystem for Python SDK #12492

AldairCoronel · 2020-08-07T02:14:25Z

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	Dataflow	Samza	Twister2
Go	---	---	---
Java
Python		---	---
XLang	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

AldairCoronel · 2020-08-07T02:14:46Z

R: @pabloem

pabloem · 2020-08-10T17:58:46Z

retest this please

pabloem · 2020-08-10T17:59:07Z

(the previous comment was to start running automated tests)

pabloem · 2020-08-10T21:55:12Z

https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Commit/5464/console - trailing whitespace in blobstorageio ; )

pabloem · 2020-08-10T21:55:45Z

Some formatting complaints - https://ci-beam.apache.org/job/beam_PreCommit_PythonFormatter_Commit/3163/console - you can run tox -e py3-yapf to fix them automatically.

pabloem · 2020-08-10T21:58:48Z

https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/14420/#showFailuresLink - the dependency is missing because you did not add it to be installed. You can add the aazure dependency in BeamModulePlugin.groovy and tox.ini like in this PR: https://github.com/apache/beam/pull/11149/files

And skip the tests whenever the dependency is missing, like here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/aws/s3filesystem_test.py#L37-L44

pabloem · 2020-08-11T20:53:17Z

Run Python2_PVR_Flink PreCommit

pabloem · 2020-08-11T21:32:43Z

sdks/python/apache_beam/io/azure/blobstoragefilesystem_test.py

+import future.tests.base  # pylint: disable=unused-import
+import mock
+
+from apache_beam.io.azure import blobstorageio


the import error occurs here, so you should move this import to line 40

pabloem · 2020-08-11T21:36:32Z

sdks/python/apache_beam/io/azure/blobstorageio_test.py

+import logging
+import unittest
+
+from apache_beam.io.azure import blobstorageio


there's also an import error happening here. you need to catch it and skip the test

pabloem

This change generally looks good. I just added one question about the scalability.

I have three things I'm concerned about:

Authentication. Normally users are expected to authenticate using PipelineOptions. Can you ensure we pass a pipeline option with a connection string for users to connect?
Furthermore, can you create a JIRA issue to improve the authentication story? (passing a connection string in pipeline options is not a very good option, but it 'just works', so we need to track later improvements)
I would like to get integration tests with Azurite merged as well. Can you share if you've looked at that as well?

pabloem · 2020-08-25T21:37:19Z

sdks/python/apache_beam/io/azure/blobstorageio.py

+    # The temporary file is deleted immediately after the operation.
+    with open(self._temporary_file.name, "rb") as f:
+      self._blob_to_upload.upload_blob(
+          f.read(), overwrite=True, content_settings=self._content_settings)


I recall an issue related to very large files. What happens when we're trying to upload a large file?

@pabloem Let's see:

Authentication. At the moment the only way to authenticate is with a connection string obtained from environment variables.

Integration tests with Azurite. Integration tests with Azurite are practically ready. The only thing left is to define a function in build.gradle that runs and stops Azurite. (You can find the branch here: https://github.com/AldairCoronel/beam/commits/azurite).

@pabloem What happens when we're trying to upload a large file?

Azure complains when you try to upload large files although in the documentation it states: Calls to write a blob, write a block, or write a page are permitted 10 minutes per megabyte to complete. If an operation is taking longer than 10 minutes per megabyte on average, it will time out.

Refer to this issue as well: https://github.com/Azure/azure-sdk-for-python/issues/12166

Authentication. At the moment the only way to authenticate is with a connection string obtained from environment variables.

Okay this is not acceptable. We need to enable authentication via pipeline options as we already discussed privately. This PR is ready to go, but we need to enable pipelineoptions-based authentication in a follow up, okay?

Also, please address comments by @epicfaace to catch PartialBatchErrorException, and then we can move forward to merge this change.

pabloem · 2020-08-25T21:41:36Z

als9o, fwiw, sorry about the delay in reviewing this : (

epicfaace · 2020-08-25T21:54:39Z

sdks/python/apache_beam/io/azure/blobstorageio.py

+  # We intentionally do not decorate this method with a retry, since the
+  # underlying copy and delete operations are already idempotent operations
+  # protected by retry decorators.
+  def delete_paths(self, paths):


@AldairCoronel not sure if you've faced this issue when testing yourself, but when I tried using this code in my own project, I ran into this error: Azure/azure-sdk-for-python#13183

I had to work around it by calling delete_blob instead of delete_blobs: codalab/codalab-worksheets@1e3dd30.

Not sure if you faced a similar issue, but adding this here in case it's helpful.

@epicfaace It did not give me problems when testing with my Azure account. The only drawback was when testing with Azurite (emulator) because delete_blobs is not implemented yet.

I will make the changes from delete_blobs to delete_blob in another PR when I add the tests with Azurite.

Thank you very much!

Interesting. To be clear, I think using delete_blobs would be ideal, since we would only require a single batch request, rather than having to call delete_blob over and over again (which is just a workaround for the error I mentioned above). If it's not supported by Azurite, though, it might be fine to just change it to use the delete_blob workaround.

FYI, it appears that Microsoft might have fixed the delete_blobs issue: Azure/azure-sdk-for-python#13183

epicfaace · 2020-08-25T21:57:43Z

sdks/python/apache_beam/io/azure/blobstorageio.py

+      for blob, error in zip(blobs, response):
+        results[(container, blob)] = error.status_code
+
+    except BlobStorageError as e:


I think you should handle both BlobStorageError and PartialBatchErrorException on all blob storage operations (PartialBatchErrorException is raised in, for example, Azure/azure-sdk-for-python#13183) -- otherwise, what ends up happening is that only the status code from PartialBatchErrorException is retrieved, but the message is silenced and not logged at all.

codecov · 2020-08-26T02:42:08Z

Codecov Report

Merging #12492 into master will decrease coverage by 0.18%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #12492      +/-   ##
==========================================
- Coverage   34.47%   34.28%   -0.19%     
==========================================
  Files         684      699      +15     
  Lines       81483    82775    +1292     
  Branches     9180     9361     +181     
==========================================
+ Hits        28090    28382     +292     
- Misses      52972    53970     +998     
- Partials      421      423       +2

Impacted Files	Coverage Δ
typehints/typecheck_test_py3.py	`31.54% <0.00%> (-16.00%)`	⬇️
typehints/typecheck.py	`29.44% <0.00%> (-6.18%)`	⬇️
utils/interactive_utils.py	`30.95% <0.00%> (-2.39%)`	⬇️
runners/worker/opcounters.py	`33.81% <0.00%> (-0.87%)`	⬇️
pipeline.py	`22.04% <0.00%> (-0.28%)`	⬇️
dataframe/transforms_test.py	`25.00% <0.00%> (-0.21%)`	⬇️
io/gcp/bigquery_test.py	`27.39% <0.00%> (-0.18%)`	⬇️
io/filesystems.py	`55.00% <0.00%> (-0.18%)`	⬇️
options/pipeline_options.py	`52.99% <0.00%> (-0.16%)`	⬇️
transforms/ptransform_test.py	`18.37% <0.00%> (-0.09%)`	⬇️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 086b985...3884527. Read the comment docs.

pabloem · 2020-08-28T02:44:28Z

Run Spotless PreCommit

pabloem · 2020-08-28T03:12:02Z

Thanks @AldairCoronel ! To conclude:

Let's figure out the authentication story via pipeline options
Let's set up integraiton tests using Azurite

pabloem · 2020-08-28T03:12:15Z

Thanks for taking a look @epicfaace ! : )

tanya-borisova · 2020-09-15T10:25:35Z

This is some great functionality in this PR, is it expected to be in a release soon?

ibzib · 2020-09-16T20:13:15Z

@tanya-borisova This change should be included in the next Beam release (2.25.0), which will begin a week from today, and will probably be finalized some weeks after.

https://beam.apache.org/contribute/#when-will-my-change-show-up-in-an-apache-beam-release

probot-autolabeler bot added io python labels Aug 7, 2020

probot-autolabeler bot added the build label Aug 11, 2020

AldairCoronel added 21 commits August 11, 2020 14:57

Initial commit

088303f

Create client wrapper files

48014f0

Add azure requirements

08769ed

Add blobstoragefilesystem file

33afe4a

Add BlobStorageFileSystem as an official file system

09a8c20

Add BlobStorageFileSystem class method definitions

5e127f7

Clean client wrapper class

45713eb

feat: Add scheme, mkdirs and has_dirs methods

d4b4bee

feat: Add join method

9287649

feat: Add blobstoragefilesystem_test file

ea2adc3

feat: Add the signature of all methods in azbs file system class

8c210ef

test: scheme and join method

25f4a53

test: split method

917e6c0

feat: Add blobstorageio file to interact with Azure

b36a5c0

feat: Add parse_azfs_path method

8288599

feat: Add blobstorageio_test file

4db2c7e

test: parse_azfs_path method

f93614f

test: extra test cases for parse_azfs_path method

78e5289

feat: Add list_prefix method

ef86ece

test: list_prefix (progress)

e7f0199

test: list_prefix method works with local account

aebd55e

feat: Skip if there are no dependencies

b2e0ac4

AldairCoronel force-pushed the BEAM-6807 branch from 38fde3f to b2e0ac4 Compare August 11, 2020 20:06

pabloem reviewed Aug 11, 2020

View reviewed changes

AldairCoronel added 10 commits August 11, 2020 17:01

feat: Fix import errors

7018fdf

fix: minor type errors

fd0df02

feat: Add pylint stuff

4bf78b2

fix: Remove space

8d3e8d7

feat: Add new line

d38aa59

fix: comparison

408568f

docs: Fix little things

8e42d85

feat: remove trailing whitespaces

6709450

feat: Add message

519fbaf

feat: minor changes in comments

4c5ab4c

pabloem reviewed Aug 25, 2020

View reviewed changes

epicfaace reviewed Aug 25, 2020

View reviewed changes

AldairCoronel closed this Aug 26, 2020

AldairCoronel reopened this Aug 26, 2020

feat: Change delete_blobs to delete_blob

3884527

pabloem merged commit bae1e7b into apache:master Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-6807] Implement an Azure blobstore filesystem for Python SDK #12492

[BEAM-6807] Implement an Azure blobstore filesystem for Python SDK #12492

AldairCoronel commented Aug 7, 2020 •

edited

Loading

AldairCoronel commented Aug 7, 2020

pabloem commented Aug 10, 2020

pabloem commented Aug 10, 2020

pabloem commented Aug 10, 2020

pabloem commented Aug 10, 2020

pabloem commented Aug 10, 2020

pabloem commented Aug 11, 2020

pabloem Aug 11, 2020

pabloem Aug 11, 2020

pabloem left a comment

pabloem Aug 25, 2020

AldairCoronel Aug 26, 2020

AldairCoronel Aug 26, 2020

pabloem Aug 26, 2020

pabloem commented Aug 25, 2020

epicfaace Aug 25, 2020

AldairCoronel Aug 26, 2020

epicfaace Aug 26, 2020

epicfaace Aug 28, 2020

epicfaace Aug 25, 2020

codecov bot commented Aug 26, 2020 •

edited

Loading

pabloem commented Aug 28, 2020

pabloem commented Aug 28, 2020

pabloem commented Aug 28, 2020

tanya-borisova commented Sep 15, 2020

ibzib commented Sep 16, 2020

[BEAM-6807] Implement an Azure blobstore filesystem for Python SDK #12492

[BEAM-6807] Implement an Azure blobstore filesystem for Python SDK #12492

Conversation

AldairCoronel commented Aug 7, 2020 • edited Loading

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

AldairCoronel commented Aug 7, 2020

pabloem commented Aug 10, 2020

pabloem commented Aug 10, 2020

pabloem commented Aug 10, 2020

pabloem commented Aug 10, 2020

pabloem commented Aug 10, 2020

pabloem commented Aug 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Aug 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 26, 2020 • edited Loading

Codecov Report

pabloem commented Aug 28, 2020

pabloem commented Aug 28, 2020

pabloem commented Aug 28, 2020

tanya-borisova commented Sep 15, 2020

ibzib commented Sep 16, 2020

AldairCoronel commented Aug 7, 2020 •

edited

Loading

codecov bot commented Aug 26, 2020 •

edited

Loading