Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: to_dataframe respects progress_bar_type when used with BQ Storage API #7697

Merged
merged 3 commits into from
Apr 23, 2019

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Apr 12, 2019

When the BigQuery Storage API was used in to_dataframe, the progress_bar_type argument was ignored. This PR fixes that by creating a concurrent queue that worker threads can send progress updates to.

A fake queue is created for the case with no progress bar to prevent filling a queue but never reading from it.

This fix depends on:

Closes #7654

@googlebot googlebot added the cla: no This human has *not* signed the Contributor License Agreement. label Apr 12, 2019
@googleapis googleapis deleted a comment from googlebot Apr 12, 2019
@tswast tswast changed the title fix: KeyboardInterrupt during to_dataframe (with BQ Storage API) no longer hangs fix: to_dataframe respects progress_bar_type with BQ Storage API Apr 12, 2019
@sduskis sduskis added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels Apr 15, 2019
@tswast tswast force-pushed the issue7654-bqstorage-progress-bar branch from 0d96dbd to 9812563 Compare April 17, 2019 17:39
@tswast
Copy link
Contributor Author

tswast commented Apr 17, 2019

Screenshots of this progress bar in action:

Terminal (IPython):

bqstorage_tqdm

Notebook:

bqstorage_tqdm_notebook

@tswast tswast force-pushed the issue7654-bqstorage-progress-bar branch 4 times, most recently from ce0a302 to 6ed43f0 Compare April 17, 2019 23:01
@tswast tswast changed the title fix: to_dataframe respects progress_bar_type with BQ Storage API BigQuery: to_dataframe respects progress_bar_type when used with BQ Storage API Apr 18, 2019
@tswast tswast added api: bigquery Issues related to the BigQuery API. api: bigquerystorage Issues related to the BigQuery Storage API. labels Apr 18, 2019
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Apr 19, 2019
@tswast tswast force-pushed the issue7654-bqstorage-progress-bar branch from 6ed43f0 to b82eb6d Compare April 22, 2019 18:27
@tswast tswast requested a review from shollyman April 22, 2019 18:30
@tswast tswast marked this pull request as ready for review April 22, 2019 18:30
@tswast tswast requested a review from crwilcox as a code owner April 22, 2019 18:30
@@ -1274,6 +1275,16 @@ def __repr__(self):
return "Row({}, {})".format(self._xxx_values, f2i)


class _FakeQueue(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be more closely named according to it's use?

@@ -1408,7 +1426,23 @@ def _to_dataframe_bqstorage_stream(
# the end using manually-parsed schema.
return pandas.concat(frames)[columns]

def _to_dataframe_bqstorage(self, bqstorage_client, dtypes):
def _process_progress_updates(self, progress_queue, progress_bar):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work well for large tables as the number of updates grows, since you're potentially walking it every _PROGRESS_INTERVAL seconds?

Copy link
Contributor Author

@tswast tswast Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decreased _PROGRESS_INTERVAL to account for the fact that we get way too many updates in a second, but you're right that with large tables (many streams) this becomes worse.

Originally I tried having a constant loop of updates for tqdm but there were some locking issues with writing to stderr/stdout outside of the main thread.

Right now I handle this by very likely dropping updates when the queue fills up, meaning the progress bar is grossly inaccurate for large tables. Maybe what I should do instead is have 2 queues. 🤔 One queue used by the workers and a "sum" thread to add up and send to main thread every _PROGRESS_INTERVAL via a different queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the problem is also bounded by the max dataframe users can fit in ram. Our truly large tables aren't really going to be pulled into dataframes without excessive projection or filtering. Perhaps this is a non issue for now.

tswast added 2 commits April 22, 2019 15:24
* Add unit test for progress bar.

* Add test for full queue.
The worker queue runs in a background thread, so it's more likely to be
able to keep up with the other workers that are adding to the worker
queue.
@tswast tswast force-pushed the issue7654-bqstorage-progress-bar branch from b82eb6d to 71112b0 Compare April 23, 2019 00:30
@tswast tswast merged commit 4dc8c36 into googleapis:master Apr 23, 2019
@tswast tswast deleted the issue7654-bqstorage-progress-bar branch April 23, 2019 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. api: bigquerystorage Issues related to the BigQuery Storage API. cla: yes This human has signed the Contributor License Agreement. 🚨 This issue needs some love.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery: Implement progress_bar_type support when bqstorage_client is used
5 participants