-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQueryReadClient.create_read_session
returning multiple empty streams
#733
Comments
I have seen this in the streaming reads that I get in Go, I think its one of these things where concurrent requests provide high speed but downstream complexity. |
I have to run multiple stream handlers, append them to a local array/slice then on completion sort the array. |
Thanks @RobHumphris . Yes, that is in essence what I am doing here with the different streams returned. For each stream, I farm it off to a thead which slaps the data into a queue. On completion, we concatenate and sort by time value. |
Ah ok, so you're also seeing empty streams in Go land? Or would this go under the radar in your approach and not be surfaced as a warning/error? |
// Data is assigned to each stream such that roughly the same number of
// rows can be read from each stream. Because the server-side unit for
// assigning data is collections of rows, the API does not guarantee that
// each stream will return the same number or rows. Additionally, the
// limits are enforced based on the number of pre-filtered rows, so some
// filters can lead to lopsided assignments. Which suggests that while we expect lopsided assignments, its not outside the realms of possibility that some will be empty. |
Thank you @frederickmannings for raising this issue and @RobHumphris for the helpful discussion, especially the reference in go! I do feel like if there's only one row to be returned, it makes sense to have only one stream populated. But indeed the client could take this situation into account and handle it more gracefully. |
Thanks for triaging @Linchin |
It looks like the reader is trying to create an empty dataframe when the stream is empty, with the same schema as defined in the session (stream itself doesn't have schema info). However, as stream itself, it doesn't really know its parent session, so the
I don't see a straightforward way for the stream to access its parent session info, @tswast is there anyway we can do that? We may also just return an empty dataframe, which can be directly concatenated with other non-empty dataframes. |
It's quite tricky to add the session info, or get the schema info here. Technically, the first returned message should contain the table schema, but with this empty stream, we don't get any message at all - the generator |
I will close the issue for now, but please leave a comment here, or open a new issue if you have any further suggestions. |
Thanks @Linchin - noted, that the I suggest at a minimum that the docs be updated to not deprecate the |
I think it's not shown as deprecated here? It appears to be deprecated with |
Sorry @Linchin, I should have been more specific. In the But as you advise, when I provide the stream, I no longer get the Attribute error, so the workaround fixes the issue. I advise that a note be added to this deprecation or it be removed entirely as the reason for deprecating is now no longer completely true. |
Thanks @frederickmannings for clarifying. I just created a PR to fix this. |
Summary
The
BigQueryReadClient
returns multiple 'empty' streams when instantiating a read session viacreate_read_session
. Attempting to invoketo_pandas()
on the stream reader yeilds anAttributeError
.Firstly, is this behaviour abnormal? If not, I will just wrap the method in a try/catch and plow on.
Environment details
google-cloud-bigquery-storage
version: 2.24.0Steps to reproduce
Stack trace
The relevent line:
python-bigquery-storage/google/cloud/bigquery_storage_v1/reader.py
Line 422 in fe09e3b
So clearly, the object is not being populated as expected. After inspecting the data from the one stream that does yeild data, it seems that the remaining streams are empty.
Detail
The emergence of this problem is something specific to the table that I am accessing, and the combination of filtering and type of the requested field. The minimal case where this occurs is when quering a single
BYTES
type field. The approximate size of this field is 0.1Mb.The issue persists when querying one row. I can query 1 row, of just this
BYTES
field from the BigQuery table and I will get some 13 empty streams and 1 populated stream.If I try catch over the streams, I am able to successfully grab the data from the one stream.
Am I doing something wrong here, or is this normal?
The text was updated successfully, but these errors were encountered: