-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove from _KNOWN_TABLES on 404 insert to allow table re-creation for Python SDK #25325
Remove from _KNOWN_TABLES on 404 insert to allow table re-creation for Python SDK #25325
Conversation
…ady in _KNOWN_TABLES, according to CREATE_IF_NEEDED create disposition apache#25225 * Add check during `BigQueryWriteFn._flush_batch` such that if an insert fails with HttpError 404 and reason 'notFound', we remove the table_reference from _KNOWN_TABLES so that on subsequent calls to `BigQueryWriteFn._create_table_if_needed` the table may be recreated (depending on create_disposition)
… actually raised and errors return isn't populated
…ady in _KNOWN_TABLES, according to CREATE_IF_NEEDED create disposition apache#25225 * Add check during bigquery insert such that if an insert fails with code 404, we remove the table_reference from _KNOWN_TABLES so that on subsequent calls to `BigQueryWriteFn._create_table_if_needed` will recreate the table if needed (subject to create_disposition)
Codecov Report
@@ Coverage Diff @@
## master #25325 +/- ##
==========================================
- Coverage 72.96% 72.90% -0.07%
==========================================
Files 743 746 +3
Lines 99037 99273 +236
==========================================
+ Hits 72266 72378 +112
- Misses 25405 25529 +124
Partials 1366 1366
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 18 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Assigning reviewers. If you would like to opt out of this review, comment R: @jrmccluskey for label python. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, thank you for these changes and congrats on opening your first beam PR :)
Co-authored-by: Ahmed Abualsaud <65791736+ahmedabu98@users.noreply.github.com>
… trigger removing destination from _KNOWN_TABLES
Run Python 3.7 PostCommit |
Run Python 3.10 PostCommit |
Python 3.7 postcommits failing due to #25359 |
@tomlynchRNA this LGTM but before we can merge this, can you write up a test to confirm these changes work as intended? |
Also please fix formatting errors; you can go to |
Hey, @ahmedabu98 I've tried running the commands to check for linting changes that need to be made, however, I get the below error: tox -e py3-yapf && ../../gradlew lint
.pkg: _optional_hooks> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: get_requires_for_build_sdist> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: prepare_metadata_for_build_wheel> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: build_sdist> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
py3-yapf: install_package> target/.tox/py3-yapf/bin/python /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/bin/pip install --retries 10 --force-reinstall --no-deps /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/.tmp/package/4/apache-beam-2.46.0.dev0.tar.gz
Processing ./target/.tox/.tmp/package/4/apache-beam-2.46.0.dev0.tar.gz
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [10 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/tmp/pip-req-build-seoqvdlv/setup.py", line 180, in <module>
generate_protos_first()
File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/tmp/pip-req-build-seoqvdlv/setup.py", line 150, in generate_protos_first
gen_protos.generate_proto_files()
File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/tmp/pip-req-build-seoqvdlv/gen_protos.py", line 476, in generate_proto_files
raise RuntimeError(error_msg)
RuntimeError: Not in apache git tree, unable to find proto definitions.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
py3-yapf: exit 1 (0.61 seconds) /Users/ragy/Documents/RNA/GitHub/beam/sdks/python> target/.tox/py3-yapf/bin/python /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/bin/pip install --retries 10 --force-reinstall --no-deps /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/.tmp/package/4/apache-beam-2.46.0.dev0.tar.gz pid=80407
.pkg: _exit> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
py3-yapf: FAIL code 1 (1.78 seconds)
evaluation failed :( (1.85 seconds) I checked the issues and it seems that this was a problem in apache-beam 2.43 and was fixed. Please see the issue: #23227 Not sure how to get past this |
I never got tox yapf working locally. I am doing this to fix lint:
hopefully it works |
hi, @Abacn, unfortunately, got the same error still. I've tried running the org/apache/beam/model/interactive/v1/beam_interactive_api.proto:36:1: warning: Import google/protobuf/timestamp.proto is unused.
Writing mypy to org/apache/beam/model/pipeline/v1/external_transforms_pb2.pyi
Writing mypy to org/apache/beam/model/pipeline/v1/schema_pb2.pyi
Writing mypy to org/apache/beam/model/pipeline/v1/metrics_pb2.pyi
Writing mypy to org/apache/beam/model/pipeline/v1/endpoints_pb2.pyi
Writing mypy to org/apache/beam/model/pipeline/v1/beam_runner_api_pb2.pyi
Writing mypy to org/apache/beam/model/pipeline/v1/standard_window_fns_pb2.pyi
Writing mypy to org/apache/beam/model/job_management/v1/beam_artifact_api_pb2.pyi
Writing mypy to org/apache/beam/model/job_management/v1/beam_expansion_api_pb2.pyi
Writing mypy to org/apache/beam/model/job_management/v1/beam_job_api_pb2.pyi
Writing mypy to org/apache/beam/model/fn_execution/v1/beam_fn_api_pb2.pyi
Writing mypy to org/apache/beam/model/fn_execution/v1/beam_provision_api_pb2.pyi
Writing mypy to org/apache/beam/model/interactive/v1/beam_interactive_api_pb2.pyi
Traceback (most recent call last):
File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/gen_protos.py", line 561, in <module>
generate_proto_files(force=True)
File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/gen_protos.py", line 555, in generate_proto_files
generate_urn_files(proto_package, PYTHON_OUTPUT_PATH)
File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/gen_protos.py", line 126, in generate_urn_files
import google.protobuf.pyext._message as pyext_message
ImportError: dlopen(/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/.venv/lib/python3.10/site-packages/google/protobuf/pyext/_message.cpython-310-darwin.so, 0x0002): tried: '/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/.venv/lib/python3.10/site-packages/google/protobuf/pyext/_message.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/.venv/lib/python3.10/site-packages/google/protobuf/pyext/_message.cpython-310-darwin.so' (no such file), '/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/.venv/lib/python3.10/site-packages/google/protobuf/pyext/_message.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')) |
Yeah it's a known issue. See this comment to fix: @ahmedabu98 uses linux and it's fine. I have an M1 mac experiencing same issue. The problem is that protobuf installed from wheel missing some internal methods used by beam |
@ahmedabu98 as you suggested i tried running the below but still got the same error pip install --no-binary :all: grpcio --ignore-installed
pip install --no-binary :all: grpcio-tools --ignore-installed |
@Abacn @ahmedabu98 tox -e py3-yapf
.pkg: _optional_hooks> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: get_requires_for_build_sdist> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: prepare_metadata_for_build_wheel> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: build_sdist> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
py3-yapf: install_package> target/.tox/py3-yapf/bin/python /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/bin/pip install --retries 10 --force-reinstall --no-deps /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/.tmp/package/16/apache-beam-2.46.0.dev0.tar.gz
py3-yapf: commands_pre[0]> python --version
Python 3.10.9
py3-yapf: commands_pre[1]> pip --version
pip 23.0 from /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/lib/python3.10/site-packages/pip (python 3.10)
py3-yapf: commands_pre[2]> pip check
No broken requirements found.
py3-yapf: commands_pre[3]> bash /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/scripts/run_tox_cleanup.sh
py3-yapf: commands[0]> yapf --version
yapf 0.29.0
py3-yapf: commands[1]> time yapf --in-place --parallel --recursive apache_beam
11.89 real 125.12 user 1.01 sys
py3-yapf: commands_post[0]> bash /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/scripts/run_tox_cleanup.sh
.pkg: _exit> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
py3-yapf: OK (17.31=setup[4.82]+cmd[0.01,0.14,0.36,0.03,0.04,11.89,0.03] seconds)
congratulations :) (17.36 seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad tox is working now. You can run lint checks with ../../gradlew lint
, which outputs something like:
18:34:26 ************* Module apache_beam.io.gcp.bigquery
18:34:26 apache_beam/io/gcp/bigquery.py:1572:0: C0301: Line too long (113/80) (line-too-long)
18:34:26 apache_beam/io/gcp/bigquery.py:1573:0: C0301: Line too long (116/80) (line-too-long)
18:34:26 apache_beam/io/gcp/bigquery.py:1574:0: C0301: Line too long (104/80) (line-too-long)
18:34:26 apache_beam/io/gcp/bigquery.py:1572:23: W1202: Use lazy % or % formatting in logging functions (logging-format-interpolation)
18:34:26 apache_beam/io/gcp/bigquery.py:411:2: C0412: Imports from package apache_beam are not grouped (ungrouped-imports)
_LOGGER.info("Table {} was not found. Table will be removed from _KNOWN_TABLES and bundle will retry. " | ||
"This sometimes occurs due to the table being deleted while a streaming job is running and " | ||
"the destination was previously added to the _KNOWN_TABLES".format(destination)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lint check says these lines are too long, try breaking them down to more lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it's recommended to use printf-style %s
. ie. _LOGGER.info("Table %s was ...", destination)
This optimizes by postponing the formatting to when the message is outputted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this exception warrants a _LOGGER.warning
level log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahmedabu98 addressed all the above in the latest commit 88205fa
Hopefully, that should do it
Reminder, please take a look at this pr: @jrmccluskey @ahmedabu98 |
Thanks for the contribution but I'm not sure I agree with the assumptions of the corresponding issue. Added a comment here: #25225 (comment) |
Reminder, please take a look at this pr: @jrmccluskey @ahmedabu98 |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @tvalentyn for label python. Available commands:
|
.waiting on author |
Hey @tomlynchRNA, my first two change requests still need to be addressed. |
Co-authored-by: Ahmed Abualsaud <65791736+ahmedabu98@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tomlynchRNA, changes LGTM. We should have a more accurate test though
|
||
@mock.patch('time.sleep') | ||
@mock.patch('google.cloud.bigquery.Client.insert_rows_json') | ||
def test_insert_all_table_not_found_errors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this actually tests the changes you're making. It mocks insert_rows_json
but doesn't return a 404 error.
An integration test is probably more appropriate for these changes (instead of writing to a dummy table). And we should have a check at the beginning of the test to make sure we're running with TestDataflowRunner:
if not isinstance(self.test_pipeline.runner, TestDataflowRunner):
self.skipTest("This test will invoke a bundle retry, which needs TestDataflowRunner")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach you take in this test looks good (adding to KNOWN_TABLES before the write), let's do that with a real write to BQ. I suggest moving this test to BigQueryStreamingInsertTransformIntegrationTests
and make use of the output table created in setUp
:
beam/sdks/python/apache_beam/io/gcp/bigquery_test.py
Lines 1606 to 1620 in 92abdf6
class BigQueryStreamingInsertTransformIntegrationTests(unittest.TestCase): | |
BIG_QUERY_DATASET_ID = 'python_bq_streaming_inserts_' | |
def setUp(self): | |
self.test_pipeline = TestPipeline(is_integration_test=True) | |
self.runner_name = type(self.test_pipeline.runner).__name__ | |
self.project = self.test_pipeline.get_option('project') | |
self.dataset_id = '%s%d%s' % ( | |
self.BIG_QUERY_DATASET_ID, int(time.time()), secrets.token_hex(3)) | |
self.bigquery_client = bigquery_tools.BigQueryWrapper() | |
self.bigquery_client.get_or_create_dataset(self.project, self.dataset_id) | |
self.output_table = "%s.output_table" % (self.dataset_id) | |
_LOGGER.info( | |
"Created dataset %s in project %s", self.dataset_id, self.project) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. you can use BigqueryFullResultMatcher
to make sure the data arrived at the table correctly. Example below:
beam/sdks/python/apache_beam/io/gcp/bigquery_test.py
Lines 1657 to 1661 in 92abdf6
BigqueryFullResultMatcher( | |
project=self.project, | |
query="SELECT name, language FROM %s" % output_table_1, | |
data=[(d['name'], d['language']) for d in _ELEMENTS | |
if 'language' in d]), |
Reminder, please take a look at this pr: @tvalentyn @chamikaramj |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @damccorm for label python. Available commands:
|
waiting on author |
Hi @tomlynchRNA, just checking - do you need any further assistance on this PR or the feedback is clear. Thanks for your contribution! |
Hi @tvalentyn @ahmedabu98 yeah can you let us know what tests need to be written so we can complete this PR. |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Ref #25225
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.UpdateCHANGES.md
with noteworthy changes.If this contribution is large, please file an Apache Individual Contributor License Agreement.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.
cc @ragyibrahim