-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support RANGE in queries Part 2: Arrow #1868
Conversation
tests/unit/test_table.py
Outdated
@@ -3503,7 +3503,10 @@ def test_to_dataframe_no_tqdm_no_progress_bar(self): | |||
user_warnings = [ | |||
warning for warning in warned if warning.category is UserWarning | |||
] | |||
self.assertEqual(len(user_warnings), 0) | |||
# Note: number of warnings is inconsistent across python versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the assertion for number of warnings, because we get different number of them with different python versions. Same for line 3540
tests/unit/test_table.py
Outdated
@@ -3534,7 +3537,10 @@ def test_to_dataframe_no_tqdm(self): | |||
user_warnings = [ | |||
warning for warning in warned if warning.category is UserWarning | |||
] | |||
self.assertEqual(len(user_warnings), 1) | |||
# Note: number of warnings is inconsistent across python versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the assertion for number of warnings, because we get different number of them with different python versions.
Adding a csv file for system test due to RANGE is not supported in JSON with load jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together. Mostly minor comments and questions.
@@ -142,6 +142,12 @@ def bq_to_arrow_struct_data_type(field): | |||
return pyarrow.struct(arrow_fields) | |||
|
|||
|
|||
def bq_to_arrow_range_data_type(field): | |||
element_type = field.element_type.upper() | |||
arrow_element_type = _pyarrow_helpers.bq_to_arrow_scalars(element_type)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do validation here? None-check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I will add a None-check here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it, as well as the unit tests in test__pandas_helpers.py
.
@@ -274,6 +286,22 @@ def types_mapper(arrow_data_type): | |||
elif time_dtype is not None and pyarrow.types.is_time(arrow_data_type): | |||
return time_dtype | |||
|
|||
elif pyarrow.types.is_struct(arrow_data_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle structs more generally here, or is that logic elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Indeed, our types mapper function doesn't seem to do any conversion for STRUCT or ARRAY. This function is used as the parameter types_mapper
to pyarrow's Table.to_pandas()
, allowing for customizable type mapping from pyarrow to pandas. I'm not entirely sure why we didn't provide struct/array mapping options, but I think it might be related to the fact that the fields of a struct can be of any type, so the conversion can become quite complicated.
google/cloud/bigquery/table.py
Outdated
# only supports upto pandas 1.3. If pandas.ArrowDtype is not | ||
# present, we raise a warning and set range_date_dtype to None. | ||
msg = ( | ||
"Unable ro find class ArrowDtype in pandas, setting " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ro/to/ here and in the other two msgs below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this!
@@ -103,6 +103,7 @@ class Model(proto.Message): | |||
|
|||
class ModelType(proto.Enum): | |||
r"""Indicates the type of the Model.""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these formatting changes are benign, but it's not clear why they're in this PR. Maybe pull out unrelated cleanups into a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I will revert these changes
df = row_iterator.to_dataframe(create_bqstorage_client=False) | ||
|
||
user_warnings = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with these tests, but are we losing some existing signal here? Or did test expectations change somehow? Its not clear how this related to the PR intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we are losing some signals here - we no longer validate the number of warnings here. With this PR we will have different length of warnings depending on the version of pandas. I deleted the warnings check, because I didn't want to have pandas version hard-coded in our tests (Should I be concerned about this?). Another alternative I can think of is to check self.assertTrue(len(user_warnings) in {length_with_older_pandas, length_with_newer_pandas})
- this way we will lose some signal but less. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the supported versions emit at least one UserWarning we could still do an assert on the count being nonzero, but if we don't think this is a useful validation I'm fine removing it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some followup comments and a very minor nit, otherwise LGTM. Thanks!
google/cloud/bigquery/_helpers.py
Outdated
else: | ||
raise ValueError(f"Unsupported range field type: {value}") | ||
raise ValueError(f"Unsupported range field type: {field.element_type}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we change this to indicate the element type is unsupported, rather than "field type"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I'll change it to be consistent with the name of the field.
df = row_iterator.to_dataframe(create_bqstorage_client=False) | ||
|
||
user_warnings = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the supported versions emit at least one UserWarning we could still do an assert on the count being nonzero, but if we don't think this is a useful validation I'm fine removing it as well.
Indeed, I added back the test, and it will pass when |
This PR supports getting RANGE values in queries as Arrow. This is the second part of the RANGE queries PR, and should not be merged until part 1 is merged. Part 1 #1884 supports RANGE as JSON.
Part of #1724🦕