Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: ensure that KeyboardInterrupt during to_dataframeno longer hangs. #7698

Merged
merged 2 commits into from
Apr 22, 2019

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Apr 12, 2019

I noticed in manually testing to_dataframe that it would stop
the current cell when I hit Ctrl-C, but data kept on downloading in the
background. Trying to exit the Python shell, I'd notice that it would
hang until I pressed Ctrl-C a few more times.

Rather than get the DataFrame for each stream in one big chunk, loop
through each block and exit if the function needs to quit early. This
follows the pattern at https://stackoverflow.com/a/29237343/101923

This depends on:

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Apr 12, 2019
@tswast
Copy link
Contributor Author

tswast commented Apr 16, 2019

✅ Note to self: Need to update the minimum google-cloud-bigquery-storage dependency version after the release at #7716 is complete.

@tswast tswast force-pushed the bqstorage-exit-early branch from 0971781 to 6fc4af2 Compare April 17, 2019 16:17
@tswast tswast marked this pull request as ready for review April 17, 2019 16:58
@tswast tswast requested a review from crwilcox as a code owner April 17, 2019 16:58
@tswast tswast requested review from a team and shollyman April 17, 2019 16:58
@tswast tswast force-pushed the bqstorage-exit-early branch from fa7bb05 to 9016469 Compare April 17, 2019 22:10
@tswast tswast requested a review from plamut April 17, 2019 22:18
@tseaver tseaver changed the title fix: KeyboardInterrupt during to_dataframe (with BQ Storage API) no longer hangs BigQuery Storage: ensure that KeyboardInterrupt during to_dataframeno longer hangs. Apr 17, 2019
@tswast tswast added api: bigquery Issues related to the BigQuery API. api: bigquerystorage Issues related to the BigQuery Storage API. labels Apr 17, 2019
@tswast tswast force-pushed the bqstorage-exit-early branch 3 times, most recently from 8593692 to 44186b6 Compare April 18, 2019 00:16
@tseaver tseaver changed the title BigQuery Storage: ensure that KeyboardInterrupt during to_dataframeno longer hangs. BigQuery: ensure that KeyboardInterrupt during to_dataframeno longer hangs. Apr 18, 2019
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Apr 19, 2019
return rowstream.to_dataframe(session, dtypes=dtypes)
# Use _to_dataframe_finished to notify worker threads when to quit.
# See: https://stackoverflow.com/a/29237343/101923
self._to_dataframe_finished = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly I'm trying to reason about scope here. This is a worker pool, but is it possible we're generating multiple independent dataframes that would share the same access to _to_dataframe_finished? Do we need to key the workers to specific invocations, or is the nature of the access always blocking so that this isn't an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple reasons that I don't think it's an issue:

  • Yes, to_dataframe() is a blocking call, so you wouldn't really have multiple going at once.
  • RowIterator isn't really something you'd want to use across threads or even more than once anyway. Because of how the pagination state works, once you loop through all the rows once, to_dataframe() returns an empty DataFrame, even in the case where BQ Storage API isn't used.

tswast added 2 commits April 19, 2019 10:40
…no longer hangs

I noticed in manually testing `to_dataframe` that it would stop
the current cell when I hit Ctrl-C, but data kept on downloading in the
background. Trying to exit the Python shell, I'd notice that it would
hang until I pressed Ctrl-C a few more times.

Rather than get the DataFrame for each stream in one big chunk, loop
through each block and exit if the function needs to quit early. This
follows the pattern at https://stackoverflow.com/a/29237343/101923

Update tests to ensure multiple progress interval loops.
@tswast tswast force-pushed the bqstorage-exit-early branch from 44186b6 to bf73284 Compare April 19, 2019 17:41
@tswast tswast merged commit ee804a1 into googleapis:master Apr 22, 2019
@tswast tswast deleted the bqstorage-exit-early branch April 22, 2019 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. api: bigquerystorage Issues related to the BigQuery Storage API. cla: yes This human has signed the Contributor License Agreement. 🚨 This issue needs some love.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants