-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: default to DATETIME type when loading timezone-naive datetimes from Pandas #1061
Conversation
if detected_type == "TIMESTAMP": | ||
valid_item = _first_array_valid(dataframe[field.name]) | ||
if isinstance(valid_item, datetime) and valid_item.tzinfo is None: | ||
detected_type = "DATETIME" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of doing this check for all detected TIMESTAMP values, but it turned out it's only necessary for datetimes inside an array, because that's when we need to use pyarrow
to help.
For datetime values outside of arrays, we can already distinguish between naive and aware ones based on Pandas dtypes, meaning that we do not even enter augment_schema()
for them.
|
||
# Valid item is None because all items in the "valid" array are invalid. Try | ||
# to find a true valid array manually. | ||
for array in islice(series, first_valid_index + 1, None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure if slicing the series results in an unnecessary copy (Pandas docs say it's context-dependent), thus played it safe and just used islice
.
Status checks got stuck... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Good catch in identifying the additional logic needed for arrays of DATETIME
deps!: BigQuery Storage and pyarrow are required dependencies (#776) fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (#786) feat!: destination tables are no-longer removed by `create_job` (#891) feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (#972) fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (#972) feat!: mark the package as type-checked (#1058) feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (#1061) feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (#967) fix: improve type annotations for mypy validation (#1081) feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (#1117) docs: Add migration guide from version 2.x to 3.x (#1027) Release-As: 3.0.0
deps!: BigQuery Storage and pyarrow are required dependencies (googleapis#776) fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (googleapis#786) feat!: destination tables are no-longer removed by `create_job` (googleapis#891) feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (googleapis#972) fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (googleapis#972) feat!: mark the package as type-checked (googleapis#1058) feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (googleapis#1061) feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (googleapis#967) fix: improve type annotations for mypy validation (googleapis#1081) feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (googleapis#1117) docs: Add migration guide from version 2.x to 3.x (googleapis#1027) Release-As: 3.0.0
deps!: BigQuery Storage and pyarrow are required dependencies (googleapis#776) fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (googleapis#786) feat!: destination tables are no-longer removed by `create_job` (googleapis#891) feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (googleapis#972) fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (googleapis#972) feat!: mark the package as type-checked (googleapis#1058) feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (googleapis#1061) feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (googleapis#967) fix: improve type annotations for mypy validation (googleapis#1081) feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (googleapis#1117) docs: Add migration guide from version 2.x to 3.x (googleapis#1027) Release-As: 3.0.0
Closes #985.
This proved to be more tricky than expected, because manual introspection is needed when augmenting the schema -
pyarrow
attaches the UTC timezone to naive datetimes, making it problematic to distinguish these from timezone-aware datetimes.PR checklist: