Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove from _KNOWN_TABLES on 404 insert to allow table re-creation for Python SDK #25325

Conversation

tomlynchRNA
Copy link

Ref #25225


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

cc @ragyibrahim

tomlynchRNA and others added 6 commits February 2, 2023 12:58
…ady in _KNOWN_TABLES, according to CREATE_IF_NEEDED create disposition apache#25225

* Add check during `BigQueryWriteFn._flush_batch` such that if an insert fails with HttpError 404 and reason 'notFound',  we remove the table_reference from _KNOWN_TABLES so that on subsequent calls to `BigQueryWriteFn._create_table_if_needed` the table may be recreated (depending on create_disposition)
… actually raised and errors return isn't populated
…ady in _KNOWN_TABLES, according to CREATE_IF_NEEDED create disposition apache#25225

* Add check during bigquery insert such that if an insert fails with code 404, we remove the table_reference from _KNOWN_TABLES so that on subsequent calls to `BigQueryWriteFn._create_table_if_needed` will recreate the table if needed (subject to create_disposition)
@codecov
Copy link

codecov bot commented Feb 6, 2023

Codecov Report

Merging #25325 (88205fa) into master (067e4db) will decrease coverage by 0.07%.
The diff coverage is 80.00%.

❗ Current head 88205fa differs from pull request most recent head 6e07f5e. Consider uploading reports for the commit 6e07f5e to get more accurate results

@@            Coverage Diff             @@
##           master   #25325      +/-   ##
==========================================
- Coverage   72.96%   72.90%   -0.07%     
==========================================
  Files         743      746       +3     
  Lines       99037    99273     +236     
==========================================
+ Hits        72266    72378     +112     
- Misses      25405    25529     +124     
  Partials     1366     1366              
Flag Coverage Δ
python 82.38% <80.00%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdks/python/apache_beam/io/gcp/bigquery.py 71.41% <80.00%> (+0.06%) ⬆️

... and 18 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2023

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @jrmccluskey for label python.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, thank you for these changes and congrats on opening your first beam PR :)

sdks/python/apache_beam/io/gcp/bigquery.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigquery.py Outdated Show resolved Hide resolved
ragyabraham and others added 2 commits February 7, 2023 09:46
… trigger removing destination from _KNOWN_TABLES
@ahmedabu98
Copy link
Contributor

Run Python 3.7 PostCommit

@ahmedabu98
Copy link
Contributor

Run Python 3.10 PostCommit

@ahmedabu98
Copy link
Contributor

Python 3.7 postcommits failing due to #25359

@ahmedabu98
Copy link
Contributor

@tomlynchRNA this LGTM but before we can merge this, can you write up a test to confirm these changes work as intended?

@ahmedabu98
Copy link
Contributor

Also please fix formatting errors; you can go to sdks/python and run tox -e py3-yapf && ../../gradlew lint and see what needs fixing. Check Python Tips for more details

@ragyabraham
Copy link

Hey, @ahmedabu98 I've tried running the commands to check for linting changes that need to be made, however, I get the below error:

tox -e py3-yapf && ../../gradlew lint
.pkg: _optional_hooks> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: get_requires_for_build_sdist> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: prepare_metadata_for_build_wheel> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: build_sdist> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
py3-yapf: install_package> target/.tox/py3-yapf/bin/python /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/bin/pip install --retries 10 --force-reinstall --no-deps /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/.tmp/package/4/apache-beam-2.46.0.dev0.tar.gz
Processing ./target/.tox/.tmp/package/4/apache-beam-2.46.0.dev0.tar.gz
  Preparing metadata (setup.py) ... error

  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/tmp/pip-req-build-seoqvdlv/setup.py", line 180, in <module>
          generate_protos_first()
        File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/tmp/pip-req-build-seoqvdlv/setup.py", line 150, in generate_protos_first
          gen_protos.generate_proto_files()
        File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/tmp/pip-req-build-seoqvdlv/gen_protos.py", line 476, in generate_proto_files
          raise RuntimeError(error_msg)
      RuntimeError: Not in apache git tree, unable to find proto definitions.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
py3-yapf: exit 1 (0.61 seconds) /Users/ragy/Documents/RNA/GitHub/beam/sdks/python> target/.tox/py3-yapf/bin/python /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/bin/pip install --retries 10 --force-reinstall --no-deps /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/.tmp/package/4/apache-beam-2.46.0.dev0.tar.gz pid=80407
.pkg: _exit> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
  py3-yapf: FAIL code 1 (1.78 seconds)
  evaluation failed :( (1.85 seconds)

I checked the issues and it seems that this was a problem in apache-beam 2.43 and was fixed. Please see the issue: #23227

Not sure how to get past this

@Abacn
Copy link
Contributor

Abacn commented Feb 8, 2023

I never got tox yapf working locally. I am doing this to fix lint:

# Run from root beam repo dir
pip install yapf==0.29.0
git diff master --name-only | grep "\.py$" | xargs yapf --in-place

hopefully it works

@ragyabraham
Copy link

hi, @Abacn, unfortunately, got the same error still.

I've tried running the gen_protos.py file and I get the below error message (FYI im running an M2 mac)

org/apache/beam/model/interactive/v1/beam_interactive_api.proto:36:1: warning: Import google/protobuf/timestamp.proto is unused.
Writing mypy to org/apache/beam/model/pipeline/v1/external_transforms_pb2.pyi
Writing mypy to org/apache/beam/model/pipeline/v1/schema_pb2.pyi
Writing mypy to org/apache/beam/model/pipeline/v1/metrics_pb2.pyi
Writing mypy to org/apache/beam/model/pipeline/v1/endpoints_pb2.pyi
Writing mypy to org/apache/beam/model/pipeline/v1/beam_runner_api_pb2.pyi
Writing mypy to org/apache/beam/model/pipeline/v1/standard_window_fns_pb2.pyi
Writing mypy to org/apache/beam/model/job_management/v1/beam_artifact_api_pb2.pyi
Writing mypy to org/apache/beam/model/job_management/v1/beam_expansion_api_pb2.pyi
Writing mypy to org/apache/beam/model/job_management/v1/beam_job_api_pb2.pyi
Writing mypy to org/apache/beam/model/fn_execution/v1/beam_fn_api_pb2.pyi
Writing mypy to org/apache/beam/model/fn_execution/v1/beam_provision_api_pb2.pyi
Writing mypy to org/apache/beam/model/interactive/v1/beam_interactive_api_pb2.pyi
Traceback (most recent call last):
  File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/gen_protos.py", line 561, in <module>
    generate_proto_files(force=True)
  File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/gen_protos.py", line 555, in generate_proto_files
    generate_urn_files(proto_package, PYTHON_OUTPUT_PATH)
  File "/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/gen_protos.py", line 126, in generate_urn_files
    import google.protobuf.pyext._message as pyext_message
ImportError: dlopen(/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/.venv/lib/python3.10/site-packages/google/protobuf/pyext/_message.cpython-310-darwin.so, 0x0002): tried: '/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/.venv/lib/python3.10/site-packages/google/protobuf/pyext/_message.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/.venv/lib/python3.10/site-packages/google/protobuf/pyext/_message.cpython-310-darwin.so' (no such file), '/Users/ragy/Documents/RNA/GitHub/beam/sdks/python/.venv/lib/python3.10/site-packages/google/protobuf/pyext/_message.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))

@Abacn
Copy link
Contributor

Abacn commented Feb 9, 2023

Yeah it's a known issue. See this comment to fix:

#22742 (comment)

@ahmedabu98 uses linux and it's fine. I have an M1 mac experiencing same issue. The problem is that protobuf installed from wheel missing some internal methods used by beam

@ragyabraham
Copy link

@ahmedabu98 as you suggested i tried running the below but still got the same error

pip install --no-binary :all: grpcio --ignore-installed
pip install --no-binary :all: grpcio-tools --ignore-installed

@ragyabraham
Copy link

@Abacn @ahmedabu98
Finally worked. See output below:

tox -e py3-yapf
.pkg: _optional_hooks> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: get_requires_for_build_sdist> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: prepare_metadata_for_build_wheel> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: build_sdist> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
py3-yapf: install_package> target/.tox/py3-yapf/bin/python /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/bin/pip install --retries 10 --force-reinstall --no-deps /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/.tmp/package/16/apache-beam-2.46.0.dev0.tar.gz
py3-yapf: commands_pre[0]> python --version
Python 3.10.9
py3-yapf: commands_pre[1]> pip --version
pip 23.0 from /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/target/.tox/py3-yapf/lib/python3.10/site-packages/pip (python 3.10)
py3-yapf: commands_pre[2]> pip check
No broken requirements found.
py3-yapf: commands_pre[3]> bash /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/scripts/run_tox_cleanup.sh
py3-yapf: commands[0]> yapf --version
yapf 0.29.0
py3-yapf: commands[1]> time yapf --in-place --parallel --recursive apache_beam
       11.89 real       125.12 user         1.01 sys
py3-yapf: commands_post[0]> bash /Users/ragy/Documents/RNA/GitHub/beam/sdks/python/scripts/run_tox_cleanup.sh
.pkg: _exit> python /opt/homebrew/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
  py3-yapf: OK (17.31=setup[4.82]+cmd[0.01,0.14,0.36,0.03,0.04,11.89,0.03] seconds)
  congratulations :) (17.36 seconds)

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad tox is working now. You can run lint checks with ../../gradlew lint, which outputs something like:

18:34:26 ************* Module apache_beam.io.gcp.bigquery
18:34:26 apache_beam/io/gcp/bigquery.py:1572:0: C0301: Line too long (113/80) (line-too-long)
18:34:26 apache_beam/io/gcp/bigquery.py:1573:0: C0301: Line too long (116/80) (line-too-long)
18:34:26 apache_beam/io/gcp/bigquery.py:1574:0: C0301: Line too long (104/80) (line-too-long)
18:34:26 apache_beam/io/gcp/bigquery.py:1572:23: W1202: Use lazy % or % formatting in logging functions (logging-format-interpolation)
18:34:26 apache_beam/io/gcp/bigquery.py:411:2: C0412: Imports from package apache_beam are not grouped (ungrouped-imports)

https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Commit/17748/console#:~:text=18%3A34%3A26,grouped%20(ungrouped%2Dimports)

Comment on lines 1572 to 1574
_LOGGER.info("Table {} was not found. Table will be removed from _KNOWN_TABLES and bundle will retry. "
"This sometimes occurs due to the table being deleted while a streaming job is running and "
"the destination was previously added to the _KNOWN_TABLES".format(destination))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint check says these lines are too long, try breaking them down to more lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it's recommended to use printf-style %s. ie. _LOGGER.info("Table %s was ...", destination)

This optimizes by postponing the formatting to when the message is outputted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this exception warrants a _LOGGER.warning level log.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahmedabu98 addressed all the above in the latest commit 88205fa
Hopefully, that should do it

@chamikaramj chamikaramj self-requested a review February 13, 2023 05:58
@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @jrmccluskey @ahmedabu98

@chamikaramj
Copy link
Contributor

Thanks for the contribution but I'm not sure I agree with the assumptions of the corresponding issue. Added a comment here: #25225 (comment)

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @jrmccluskey @ahmedabu98

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2023

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @tvalentyn for label python.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@tvalentyn
Copy link
Contributor

.waiting on author

@ahmedabu98
Copy link
Contributor

Hey @tomlynchRNA, my first two change requests still need to be addressed.

Co-authored-by: Ahmed Abualsaud <65791736+ahmedabu98@users.noreply.github.com>
Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tomlynchRNA, changes LGTM. We should have a more accurate test though


@mock.patch('time.sleep')
@mock.patch('google.cloud.bigquery.Client.insert_rows_json')
def test_insert_all_table_not_found_errors(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this actually tests the changes you're making. It mocks insert_rows_json but doesn't return a 404 error.

An integration test is probably more appropriate for these changes (instead of writing to a dummy table). And we should have a check at the beginning of the test to make sure we're running with TestDataflowRunner:

if not isinstance(self.test_pipeline.runner, TestDataflowRunner):
      self.skipTest("This test will invoke a bundle retry, which needs TestDataflowRunner")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach you take in this test looks good (adding to KNOWN_TABLES before the write), let's do that with a real write to BQ. I suggest moving this test to BigQueryStreamingInsertTransformIntegrationTests and make use of the output table created in setUp:

class BigQueryStreamingInsertTransformIntegrationTests(unittest.TestCase):
BIG_QUERY_DATASET_ID = 'python_bq_streaming_inserts_'
def setUp(self):
self.test_pipeline = TestPipeline(is_integration_test=True)
self.runner_name = type(self.test_pipeline.runner).__name__
self.project = self.test_pipeline.get_option('project')
self.dataset_id = '%s%d%s' % (
self.BIG_QUERY_DATASET_ID, int(time.time()), secrets.token_hex(3))
self.bigquery_client = bigquery_tools.BigQueryWrapper()
self.bigquery_client.get_or_create_dataset(self.project, self.dataset_id)
self.output_table = "%s.output_table" % (self.dataset_id)
_LOGGER.info(
"Created dataset %s in project %s", self.dataset_id, self.project)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. you can use BigqueryFullResultMatcher to make sure the data arrived at the table correctly. Example below:

BigqueryFullResultMatcher(
project=self.project,
query="SELECT name, language FROM %s" % output_table_1,
data=[(d['name'], d['language']) for d in _ELEMENTS
if 'language' in d]),

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @tvalentyn @chamikaramj

@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2023

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm for label python.
R: @pabloem for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@ahmedabu98
Copy link
Contributor

waiting on author

@tvalentyn
Copy link
Contributor

Hi @tomlynchRNA, just checking - do you need any further assistance on this PR or the feedback is clear. Thanks for your contribution!

@ragyabraham
Copy link

Hi @tvalentyn @ahmedabu98 yeah can you let us know what tests need to be written so we can complete this PR.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 6, 2023

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 6, 2023
@github-actions
Copy link
Contributor

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants