Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion #27358

Conversation

BryanCutler
Copy link
Member

What changes were proposed in this pull request?

Prevent unnecessary copies of data during conversion from Arrow to Pandas.

Why are the changes needed?

During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html and ARROW-7596 for discussion.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

@BryanCutler BryanCutler changed the title [SPARK-30640][PYTHON[SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion [SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion Jan 24, 2020
@BryanCutler
Copy link
Member Author

cc @HyukjinKwon @viirya please take a look, thanks!

@@ -165,22 +165,6 @@ def _check_series_localize_timestamps(s, timezone):
return s


def _check_dataframe_localize_timestamps(pdf, timezone):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just remove this, it was only used in the one place

require_minimum_pandas_version()

for column, series in pdf.iteritems():
pdf[column] = _check_series_localize_timestamps(series, timezone)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is pyarrow stores the DataFrame data in blocks internally, and assigning series back to the DataFrame would cause the blocks to be reallocated.


# If the given column is a date type column, creates a series of datetime.date directly
# instead of creating datetime64[ns] as intermediate data to avoid overflow caused by
# datetime64[ns] type handling.
s = arrow_column.to_pandas(date_as_object=True)

s = _check_series_localize_timestamps(s, self._timezone)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this was causing the same issue, but it's easy enough to just check the column type and only convert if necessary.

@SparkQA
Copy link

SparkQA commented Jan 24, 2020

Test build #117382 has finished for PR 27358 at commit 3a61dd1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

This is a pretty minor change, so I'm gonna go ahead and merge

return _check_dataframe_localize_timestamps(pdf, timezone)
for field in self.schema:
if isinstance(field.dataType, TimestampType):
pdf[field.name] = \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it different? Doesn't this also assign the series back to the DataFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for the case of timestamps making a copy is unavailable. This is just to prevent non-timestamp columns that were also causing a copy when assigned back to the DataFrame

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. looks good then. thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya !

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. sorry for late response.

@BryanCutler
Copy link
Member Author

Thanks @HyukjinKwon !

@BryanCutler BryanCutler deleted the pyspark-pandas-timestamp-copy-fix-SPARK-30640 branch January 28, 2020 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants