Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQueryReadClient.create_read_session returning multiple empty streams #733

Closed
frederickmannings opened this issue Jan 15, 2024 · 14 comments · Fixed by #765
Closed

BigQueryReadClient.create_read_session returning multiple empty streams #733

frederickmannings opened this issue Jan 15, 2024 · 14 comments · Fixed by #765
Assignees
Labels
api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@frederickmannings
Copy link

Summary

The BigQueryReadClient returns multiple 'empty' streams when instantiating a read session via create_read_session. Attempting to invoke to_pandas() on the stream reader yeilds an AttributeError.

Firstly, is this behaviour abnormal? If not, I will just wrap the method in a try/catch and plow on.

Environment details

  • OS type and version: Linux - WSL - Ubuntu 22.04.03 LTS
  • Python version: 3.9.18
  • pip version: 23.3.1
  • package manager: poetry@1.7.1
  • google-cloud-bigquery-storage version: 2.24.0

Steps to reproduce

    from google.cloud.bigquery_storage_v1 import BigQueryReadClient, types

    client = BigQueryReadClient()
    requested_session = types.ReadSession()
    requested_session.table = "projects/<project>/datasets/<dataset>/tables/<table>"

    requested_session.data_format = types.DataFormat.AVRO 

    requested_session.read_options.selected_fields = <some_fields>

    requested_session.read_options.row_restriction = <some_row_restriction>

    parent = "projects/<project_id>"
    session = client.create_read_session(
        parent=parent,
        read_session=requested_session,
    )

    dfs = []
    for stream in session.streams:
        reader = client.read_rows(stream.name)
        sub_df = reader.to_dataframe() # < error raised here, for all but 1 of the streams: 'NoneType' object has no attribute '_parse_avro_schema'

        dfs.append(sub_df)
        ...

Stack trace

Exception has occurred: AttributeError
'NoneType' object has no attribute '_parse_avro_schema'
  File "reader.py", line 424, in to_dataframe
    self._stream_parser._parse_avro_schema()
  File "reader.py", line 299, in to_dataframe
    return self.rows(read_session=read_session).to_dataframe(dtypes=dtypes)
AttributeError: 'NoneType' object has no attribute '_parse_avro_schema'

The relevent line:

self._stream_parser._parse_avro_schema()

So clearly, the object is not being populated as expected. After inspecting the data from the one stream that does yeild data, it seems that the remaining streams are empty.

Detail

The emergence of this problem is something specific to the table that I am accessing, and the combination of filtering and type of the requested field. The minimal case where this occurs is when quering a single BYTES type field. The approximate size of this field is 0.1Mb.

The issue persists when querying one row. I can query 1 row, of just this BYTES field from the BigQuery table and I will get some 13 empty streams and 1 populated stream.

If I try catch over the streams, I am able to successfully grab the data from the one stream.

Am I doing something wrong here, or is this normal?

@product-auto-label product-auto-label bot added the api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. label Jan 15, 2024
@RobHumphris
Copy link

I have seen this in the streaming reads that I get in Go, I think its one of these things where concurrent requests provide high speed but downstream complexity.

@RobHumphris
Copy link

I have to run multiple stream handlers, append them to a local array/slice then on completion sort the array.

@frederickmannings
Copy link
Author

Thanks @RobHumphris .

Yes, that is in essence what I am doing here with the different streams returned. For each stream, I farm it off to a thead which slaps the data into a queue. On completion, we concatenate and sort by time value.

@frederickmannings
Copy link
Author

I have seen this in the streaming reads that I get in Go, I think its one of these things where concurrent requests provide high speed but downstream complexity.

Ah ok, so you're also seeing empty streams in Go land? Or would this go under the radar in your approach and not be surfaced as a warning/error?

@RobHumphris
Copy link

Ah ok, so you're also seeing empty streams in Go land? Or would this go under the radar in your approach and not be surfaced as a warning/error?
If there is nothing in the stream it looks like the Go client's ReadRows(...) function handles is gracefully.
Looking in the client code:

// Data is assigned to each stream such that roughly the same number of
// rows can be read from each stream. Because the server-side unit for
// assigning data is collections of rows, the API does not guarantee that
// each stream will return the same number or rows. Additionally, the
// limits are enforced based on the number of pre-filtered rows, so some
// filters can lead to lopsided assignments.

Which suggests that while we expect lopsided assignments, its not outside the realms of possibility that some will be empty.
I also have an element in the CreateReadSessionRequest function where I set the maximum streams: MaxStreamCount: 20.

@Linchin
Copy link
Contributor

Linchin commented Jan 31, 2024

Thank you @frederickmannings for raising this issue and @RobHumphris for the helpful discussion, especially the reference in go! I do feel like if there's only one row to be returned, it makes sense to have only one stream populated. But indeed the client could take this situation into account and handle it more gracefully.

@Linchin Linchin added priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Jan 31, 2024
@frederickmannings
Copy link
Author

Thanks for triaging @Linchin

@Linchin Linchin added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Feb 23, 2024
@Linchin Linchin self-assigned this Feb 26, 2024
@Linchin
Copy link
Contributor

Linchin commented Feb 27, 2024

It looks like the reader is trying to create an empty dataframe when the stream is empty, with the same schema as defined in the session (stream itself doesn't have schema info). However, as stream itself, it doesn't really know its parent session, so the _parse_avro_schema attribute is not populated. An interesting workaround is to pass the session info manually:

sub_df = reader.to_dataframe(read_session=session)

I don't see a straightforward way for the stream to access its parent session info, @tswast is there anyway we can do that? We may also just return an empty dataframe, which can be directly concatenated with other non-empty dataframes.

@Linchin
Copy link
Contributor

Linchin commented Feb 28, 2024

It's quite tricky to add the session info, or get the schema info here. Technically, the first returned message should contain the table schema, but with this empty stream, we don't get any message at all - the generator self.pages() here doesn't yield anything at all. We may file a bug with the backend team, but for now I think the workaround in #733 (comment) is way simpler, compared to reworking the client to include the session info.

@Linchin
Copy link
Contributor

Linchin commented Mar 5, 2024

I will close the issue for now, but please leave a comment here, or open a new issue if you have any further suggestions.

@Linchin Linchin closed this as completed Mar 5, 2024
@frederickmannings
Copy link
Author

frederickmannings commented Mar 29, 2024

Thanks @Linchin - noted, that the to_dataframe method exits gracefully if it is passed the parent stream and no data is returned.

I suggest at a minimum that the docs be updated to not deprecate the read_session argument?

@Linchin
Copy link
Contributor

Linchin commented Mar 29, 2024

I think it's not shown as deprecated here? It appears to be deprecated with ReadRowsIterable, though.

@frederickmannings
Copy link
Author

Sorry @Linchin, I should have been more specific. In the to_dataframe method of the ReadRowsStream class, it is listed as deprecated:
https://cloud.google.com/python/docs/reference/bigquerystorage/latest/google.cloud.bigquery_storage_v1.reader.ReadRowsStream#google_cloud_bigquery_storage_v1_reader_ReadRowsStream_to_dataframe

But as you advise, when I provide the stream, I no longer get the Attribute error, so the workaround fixes the issue. I advise that a note be added to this deprecation or it be removed entirely as the reason for deprecating is now no longer completely true.

@Linchin
Copy link
Contributor

Linchin commented Apr 3, 2024

Thanks @frederickmannings for clarifying. I just created a PR to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants