`BigQueryReadClient.create_read_session` returning multiple empty streams #733

frederickmannings · 2024-01-15T17:10:39Z

Summary

The BigQueryReadClient returns multiple 'empty' streams when instantiating a read session via create_read_session. Attempting to invoke to_pandas() on the stream reader yeilds an AttributeError.

Firstly, is this behaviour abnormal? If not, I will just wrap the method in a try/catch and plow on.

Environment details

OS type and version: Linux - WSL - Ubuntu 22.04.03 LTS
Python version: 3.9.18
pip version: 23.3.1
package manager: poetry@1.7.1
google-cloud-bigquery-storage version: 2.24.0

Steps to reproduce

    from google.cloud.bigquery_storage_v1 import BigQueryReadClient, types

    client = BigQueryReadClient()
    requested_session = types.ReadSession()
    requested_session.table = "projects/<project>/datasets/<dataset>/tables/<table>"

    requested_session.data_format = types.DataFormat.AVRO 

    requested_session.read_options.selected_fields = <some_fields>

    requested_session.read_options.row_restriction = <some_row_restriction>

    parent = "projects/<project_id>"
    session = client.create_read_session(
        parent=parent,
        read_session=requested_session,
    )

    dfs = []
    for stream in session.streams:
        reader = client.read_rows(stream.name)
        sub_df = reader.to_dataframe() # < error raised here, for all but 1 of the streams: 'NoneType' object has no attribute '_parse_avro_schema'

        dfs.append(sub_df)
        ...

Stack trace

Exception has occurred: AttributeError
'NoneType' object has no attribute '_parse_avro_schema'
  File "reader.py", line 424, in to_dataframe
    self._stream_parser._parse_avro_schema()
  File "reader.py", line 299, in to_dataframe
    return self.rows(read_session=read_session).to_dataframe(dtypes=dtypes)
AttributeError: 'NoneType' object has no attribute '_parse_avro_schema'

The relevent line:

python-bigquery-storage/google/cloud/bigquery_storage_v1/reader.py

Line 422 in fe09e3b

self._stream_parser._parse_avro_schema()

So clearly, the object is not being populated as expected. After inspecting the data from the one stream that does yeild data, it seems that the remaining streams are empty.

Detail

The emergence of this problem is something specific to the table that I am accessing, and the combination of filtering and type of the requested field. The minimal case where this occurs is when quering a single BYTES type field. The approximate size of this field is 0.1Mb.

The issue persists when querying one row. I can query 1 row, of just this BYTES field from the BigQuery table and I will get some 13 empty streams and 1 populated stream.

If I try catch over the streams, I am able to successfully grab the data from the one stream.

Am I doing something wrong here, or is this normal?

The text was updated successfully, but these errors were encountered:

RobHumphris · 2024-01-16T09:52:49Z

I have seen this in the streaming reads that I get in Go, I think its one of these things where concurrent requests provide high speed but downstream complexity.

RobHumphris · 2024-01-16T09:55:00Z

I have to run multiple stream handlers, append them to a local array/slice then on completion sort the array.

frederickmannings · 2024-01-16T09:59:07Z

Thanks @RobHumphris .

Yes, that is in essence what I am doing here with the different streams returned. For each stream, I farm it off to a thead which slaps the data into a queue. On completion, we concatenate and sort by time value.

frederickmannings · 2024-01-16T10:00:24Z

I have seen this in the streaming reads that I get in Go, I think its one of these things where concurrent requests provide high speed but downstream complexity.

Ah ok, so you're also seeing empty streams in Go land? Or would this go under the radar in your approach and not be surfaced as a warning/error?

RobHumphris · 2024-01-16T10:07:46Z

Ah ok, so you're also seeing empty streams in Go land? Or would this go under the radar in your approach and not be surfaced as a warning/error?
If there is nothing in the stream it looks like the Go client's ReadRows(...) function handles is gracefully.
Looking in the client code:

// Data is assigned to each stream such that roughly the same number of
// rows can be read from each stream. Because the server-side unit for
// assigning data is collections of rows, the API does not guarantee that
// each stream will return the same number or rows. Additionally, the
// limits are enforced based on the number of pre-filtered rows, so some
// filters can lead to lopsided assignments.

Which suggests that while we expect lopsided assignments, its not outside the realms of possibility that some will be empty.
I also have an element in the CreateReadSessionRequest function where I set the maximum streams: MaxStreamCount: 20.

Linchin · 2024-01-31T23:02:03Z

Thank you @frederickmannings for raising this issue and @RobHumphris for the helpful discussion, especially the reference in go! I do feel like if there's only one row to be returned, it makes sense to have only one stream populated. But indeed the client could take this situation into account and handle it more gracefully.

frederickmannings · 2024-02-04T12:51:02Z

Thanks for triaging @Linchin

Linchin · 2024-02-27T00:00:58Z

It looks like the reader is trying to create an empty dataframe when the stream is empty, with the same schema as defined in the session (stream itself doesn't have schema info). However, as stream itself, it doesn't really know its parent session, so the _parse_avro_schema attribute is not populated. An interesting workaround is to pass the session info manually:

sub_df = reader.to_dataframe(read_session=session)

I don't see a straightforward way for the stream to access its parent session info, @tswast is there anyway we can do that? We may also just return an empty dataframe, which can be directly concatenated with other non-empty dataframes.

Linchin · 2024-02-28T01:21:43Z

It's quite tricky to add the session info, or get the schema info here. Technically, the first returned message should contain the table schema, but with this empty stream, we don't get any message at all - the generator self.pages() here doesn't yield anything at all. We may file a bug with the backend team, but for now I think the workaround in #733 (comment) is way simpler, compared to reworking the client to include the session info.

Linchin · 2024-03-05T20:02:49Z

I will close the issue for now, but please leave a comment here, or open a new issue if you have any further suggestions.

frederickmannings · 2024-03-29T17:15:34Z

Thanks @Linchin - noted, that the to_dataframe method exits gracefully if it is passed the parent stream and no data is returned.

I suggest at a minimum that the docs be updated to not deprecate the read_session argument?

Linchin · 2024-03-29T17:52:32Z

I think it's not shown as deprecated here? It appears to be deprecated with ReadRowsIterable, though.

frederickmannings · 2024-04-02T11:30:48Z

Sorry @Linchin, I should have been more specific. In the to_dataframe method of the ReadRowsStream class, it is listed as deprecated:
https://cloud.google.com/python/docs/reference/bigquerystorage/latest/google.cloud.bigquery_storage_v1.reader.ReadRowsStream#google_cloud_bigquery_storage_v1_reader_ReadRowsStream_to_dataframe

But as you advise, when I provide the stream, I no longer get the Attribute error, so the workaround fixes the issue. I advise that a note be added to this deprecation or it be removed entirely as the reason for deprecating is now no longer completely true.

Linchin · 2024-04-03T22:10:35Z

Thanks @frederickmannings for clarifying. I just created a PR to fix this.

product-auto-label bot added the api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. label Jan 15, 2024

Linchin added priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Jan 31, 2024

Linchin added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Feb 23, 2024

Linchin self-assigned this Feb 26, 2024

Linchin closed this as completed Mar 5, 2024

Linchin mentioned this issue Apr 3, 2024

fix: make read_session no longer DEPRECATED #765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`BigQueryReadClient.create_read_session` returning multiple empty streams #733

`BigQueryReadClient.create_read_session` returning multiple empty streams #733

frederickmannings commented Jan 15, 2024

RobHumphris commented Jan 16, 2024

RobHumphris commented Jan 16, 2024

frederickmannings commented Jan 16, 2024

frederickmannings commented Jan 16, 2024

RobHumphris commented Jan 16, 2024

Linchin commented Jan 31, 2024 •

edited

Loading

frederickmannings commented Feb 4, 2024

Linchin commented Feb 27, 2024

Linchin commented Feb 28, 2024

Linchin commented Mar 5, 2024

frederickmannings commented Mar 29, 2024 •

edited

Loading

Linchin commented Mar 29, 2024 •

edited

Loading

frederickmannings commented Apr 2, 2024

Linchin commented Apr 3, 2024

BigQueryReadClient.create_read_session returning multiple empty streams #733

BigQueryReadClient.create_read_session returning multiple empty streams #733

Comments

frederickmannings commented Jan 15, 2024

Summary

Environment details

Steps to reproduce

Stack trace

Detail

RobHumphris commented Jan 16, 2024

RobHumphris commented Jan 16, 2024

frederickmannings commented Jan 16, 2024

frederickmannings commented Jan 16, 2024

RobHumphris commented Jan 16, 2024

Linchin commented Jan 31, 2024 • edited Loading

frederickmannings commented Feb 4, 2024

Linchin commented Feb 27, 2024

Linchin commented Feb 28, 2024

Linchin commented Mar 5, 2024

frederickmannings commented Mar 29, 2024 • edited Loading

Linchin commented Mar 29, 2024 • edited Loading

frederickmannings commented Apr 2, 2024

Linchin commented Apr 3, 2024

`BigQueryReadClient.create_read_session` returning multiple empty streams #733

`BigQueryReadClient.create_read_session` returning multiple empty streams #733

Linchin commented Jan 31, 2024 •

edited

Loading

frederickmannings commented Mar 29, 2024 •

edited

Loading

Linchin commented Mar 29, 2024 •

edited

Loading