-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: ensure that KeyboardInterrupt
during to_dataframe
no longer hangs.
#7698
Conversation
✅ Note to self: Need to update the minimum |
0971781
to
6fc4af2
Compare
fa7bb05
to
9016469
Compare
KeyboardInterrupt
during to_dataframe
(with BQ Storage API) no longer hangsKeyboardInterrupt
during to_dataframe
no longer hangs.
8593692
to
44186b6
Compare
KeyboardInterrupt
during to_dataframe
no longer hangs.KeyboardInterrupt
during to_dataframe
no longer hangs.
return rowstream.to_dataframe(session, dtypes=dtypes) | ||
# Use _to_dataframe_finished to notify worker threads when to quit. | ||
# See: https://stackoverflow.com/a/29237343/101923 | ||
self._to_dataframe_finished = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly I'm trying to reason about scope here. This is a worker pool, but is it possible we're generating multiple independent dataframes that would share the same access to _to_dataframe_finished? Do we need to key the workers to specific invocations, or is the nature of the access always blocking so that this isn't an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple reasons that I don't think it's an issue:
- Yes,
to_dataframe()
is a blocking call, so you wouldn't really have multiple going at once. RowIterator
isn't really something you'd want to use across threads or even more than once anyway. Because of how the pagination state works, once you loop through all the rows once,to_dataframe()
returns an empty DataFrame, even in the case where BQ Storage API isn't used.
…no longer hangs I noticed in manually testing `to_dataframe` that it would stop the current cell when I hit Ctrl-C, but data kept on downloading in the background. Trying to exit the Python shell, I'd notice that it would hang until I pressed Ctrl-C a few more times. Rather than get the DataFrame for each stream in one big chunk, loop through each block and exit if the function needs to quit early. This follows the pattern at https://stackoverflow.com/a/29237343/101923 Update tests to ensure multiple progress interval loops.
44186b6
to
bf73284
Compare
I noticed in manually testing
to_dataframe
that it would stopthe current cell when I hit Ctrl-C, but data kept on downloading in the
background. Trying to exit the Python shell, I'd notice that it would
hang until I pressed Ctrl-C a few more times.
Rather than get the DataFrame for each stream in one big chunk, loop
through each block and exit if the function needs to quit early. This
follows the pattern at https://stackoverflow.com/a/29237343/101923
This depends on:
Add page iterator to ReadRowsStream #7680, new.pages
feature in BQ Storage client