-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion #27358
[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion #27358
Conversation
…that are timestamp types
cc @HyukjinKwon @viirya please take a look, thanks! |
@@ -165,22 +165,6 @@ def _check_series_localize_timestamps(s, timezone): | |||
return s | |||
|
|||
|
|||
def _check_dataframe_localize_timestamps(pdf, timezone): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to just remove this, it was only used in the one place
require_minimum_pandas_version() | ||
|
||
for column, series in pdf.iteritems(): | ||
pdf[column] = _check_series_localize_timestamps(series, timezone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is pyarrow stores the DataFrame data in blocks internally, and assigning series back to the DataFrame would cause the blocks to be reallocated.
|
||
# If the given column is a date type column, creates a series of datetime.date directly | ||
# instead of creating datetime64[ns] as intermediate data to avoid overflow caused by | ||
# datetime64[ns] type handling. | ||
s = arrow_column.to_pandas(date_as_object=True) | ||
|
||
s = _check_series_localize_timestamps(s, self._timezone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this was causing the same issue, but it's easy enough to just check the column type and only convert if necessary.
Test build #117382 has finished for PR 27358 at commit
|
This is a pretty minor change, so I'm gonna go ahead and merge |
return _check_dataframe_localize_timestamps(pdf, timezone) | ||
for field in self.schema: | ||
if isinstance(field.dataType, TimestampType): | ||
pdf[field.name] = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it different? Doesn't this also assign the series back to the DataFrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for the case of timestamps making a copy is unavailable. This is just to prevent non-timestamp columns that were also causing a copy when assigned back to the DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. looks good then. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. sorry for late response.
Thanks @HyukjinKwon ! |
What changes were proposed in this pull request?
Prevent unnecessary copies of data during conversion from Arrow to Pandas.
Why are the changes needed?
During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html and ARROW-7596 for discussion.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests