-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Mechanism to re-create deleted table if it is already in _KNOWN_TABLES, according to CREATE_IF_NEEDED create disposition #25225
Comments
…ady in _KNOWN_TABLES, according to CREATE_IF_NEEDED create disposition apache#25225 * Add check during `BigQueryWriteFn._flush_batch` such that if an insert fails with HttpError 404 and reason 'notFound', we remove the table_reference from _KNOWN_TABLES so that on subsequent calls to `BigQueryWriteFn._create_table_if_needed` the table may be recreated (depending on create_disposition)
…ady in _KNOWN_TABLES, according to CREATE_IF_NEEDED create disposition apache#25225 * Add check during bigquery insert such that if an insert fails with code 404, we remove the table_reference from _KNOWN_TABLES so that on subsequent calls to `BigQueryWriteFn._create_table_if_needed` will recreate the table if needed (subject to create_disposition)
Hey @tomlynchRNA, can you provide a stack trace of the error you're seeing here?
|
stack trace, quite big
|
I think in general, for Beam source/sink I/O, we assume read/write data store resources to not be deleted by third parties while the pipeline is running. Trying to add this as a feature to Beam I/O in general will probably will need a lot of re-work (even though we might might be able to fix for this instance). Also, CREATE_IF_NEEDED I think (as it's defined to Beam BigQuery sink right now) means that tables will be created once per pipeline (not that tables will be re-created if they are deleted at any state of the pipeline). |
That said, I'm OK with getting this fix in if we are clear that it does not modify the guarantees offered by CREATE_IF_NEEDED (i.e. the pipeline may still fail or get stuck if output tables get deleted by third parties during execution). |
What would you like to happen?
Hello,
I am running an Apache Beam pipeline and streaming inserts into Google BigQuery with the Python SDK.
I am facing an issue where because Beam only creates tables once before storing the name a list called
_KNOWN_TABLES
, if the table is deleted while the pipeline is running after Beam having already created it, further inserts will error out with a 404 and not attempt to re-create it.You can see here where the
_create_table_if_needed
method returns early if the table is already in_KNOWN_TABLES
:https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1463
I have the pipeline create disposition set to the default CREATE_IF_NEEDED, therefore I expect that if the table does not exist (and is needed) to stream inserts, that it will be created.
I propose that a mechanism be implemented allowing this behaviour, and would be willing to make the changes & open a pull request.
Looking forward to your thoughts,
Tom
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: