[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion #27358

BryanCutler · 2020-01-24T23:16:56Z

What changes were proposed in this pull request?

Prevent unnecessary copies of data during conversion from Arrow to Pandas.

Why are the changes needed?

During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html and ARROW-7596 for discussion.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

…that are timestamp types

BryanCutler · 2020-01-24T23:19:23Z

cc @HyukjinKwon @viirya please take a look, thanks!

BryanCutler · 2020-01-24T23:20:27Z

python/pyspark/sql/pandas/types.py

@@ -165,22 +165,6 @@ def _check_series_localize_timestamps(s, timezone):
        return s


-def _check_dataframe_localize_timestamps(pdf, timezone):


Better to just remove this, it was only used in the one place

BryanCutler · 2020-01-24T23:22:22Z

python/pyspark/sql/pandas/types.py

-    require_minimum_pandas_version()
-
-    for column, series in pdf.iteritems():
-        pdf[column] = _check_series_localize_timestamps(series, timezone)


The problem is pyarrow stores the DataFrame data in blocks internally, and assigning series back to the DataFrame would cause the blocks to be reallocated.

BryanCutler · 2020-01-24T23:23:06Z

python/pyspark/sql/pandas/serializers.py


        # If the given column is a date type column, creates a series of datetime.date directly
        # instead of creating datetime64[ns] as intermediate data to avoid overflow caused by
        # datetime64[ns] type handling.
        s = arrow_column.to_pandas(date_as_object=True)

-        s = _check_series_localize_timestamps(s, self._timezone)


I don't know if this was causing the same issue, but it's easy enough to just check the column type and only convert if necessary.

SparkQA · 2020-01-24T23:50:46Z

Test build #117382 has finished for PR 27358 at commit 3a61dd1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2020-01-26T23:19:51Z

This is a pretty minor change, so I'm gonna go ahead and merge

viirya · 2020-01-27T00:31:04Z

python/pyspark/sql/pandas/conversion.py

-                        return _check_dataframe_localize_timestamps(pdf, timezone)
+                        for field in self.schema:
+                            if isinstance(field.dataType, TimestampType):
+                                pdf[field.name] = \


Is it different? Doesn't this also assign the series back to the DataFrame?

Yeah, for the case of timestamps making a copy is unavailable. This is just to prevent non-timestamp columns that were also causing a copy when assigned back to the DataFrame

ok. looks good then. thanks!

Thanks @viirya !

HyukjinKwon

LGTM. sorry for late response.

BryanCutler · 2020-01-28T18:27:02Z

Thanks @HyukjinKwon !

Remove _check_dataframe_localize_timestamps and only convert columns …

3a61dd1

…that are timestamp types

BryanCutler changed the title ~~[SPARK-30640][PYTHON[SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion~~ [SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion Jan 24, 2020

BryanCutler commented Jan 24, 2020

View reviewed changes

dongjoon-hyun added PYSPARK SQL labels Jan 25, 2020

BryanCutler closed this in 43d9c7e Jan 26, 2020

viirya reviewed Jan 27, 2020

View reviewed changes

HyukjinKwon reviewed Jan 28, 2020

View reviewed changes

BryanCutler deleted the pyspark-pandas-timestamp-copy-fix-SPARK-30640 branch January 28, 2020 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion #27358

[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion #27358

BryanCutler commented Jan 24, 2020

BryanCutler commented Jan 24, 2020

BryanCutler Jan 24, 2020

BryanCutler Jan 24, 2020

BryanCutler Jan 24, 2020

SparkQA commented Jan 24, 2020

BryanCutler commented Jan 26, 2020

viirya Jan 27, 2020

BryanCutler Jan 27, 2020

viirya Jan 27, 2020

BryanCutler Jan 27, 2020

HyukjinKwon left a comment

BryanCutler commented Jan 28, 2020

		@@ -165,22 +165,6 @@ def _check_series_localize_timestamps(s, timezone):
		return s


		def _check_dataframe_localize_timestamps(pdf, timezone):

[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion #27358

[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion #27358

Conversation

BryanCutler commented Jan 24, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

BryanCutler commented Jan 24, 2020

BryanCutler Jan 24, 2020

Choose a reason for hiding this comment

BryanCutler Jan 24, 2020

Choose a reason for hiding this comment

BryanCutler Jan 24, 2020

Choose a reason for hiding this comment

SparkQA commented Jan 24, 2020

BryanCutler commented Jan 26, 2020

viirya Jan 27, 2020

Choose a reason for hiding this comment

BryanCutler Jan 27, 2020

Choose a reason for hiding this comment

viirya Jan 27, 2020

Choose a reason for hiding this comment

BryanCutler Jan 27, 2020

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

BryanCutler commented Jan 28, 2020