Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 #42793

Closed
wants to merge 33 commits into from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Sep 4, 2023

What changes were proposed in this pull request?

This PR proposes to support pandas 2.1.0 for PySpark. See What's new in 2.1.0 for more detail.

Why are the changes needed?

We should follow the latest version of pandas.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The existing CI should passed with Pandas 2.1.0

Was this patch authored or co-authored using generative AI tooling?

No.

@itholic itholic changed the title [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 [WIP][SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 Sep 4, 2023
@itholic
Copy link
Contributor Author

itholic commented Sep 4, 2023

Since there are many features are deprecated from Pandas 2.1.0, let me investigate if there is any corresponding feature from Pandas API on Spark while we're here.

@github-actions github-actions bot added the SQL label Sep 5, 2023
Comment on lines -314 to +321
psdf = psdf.reset_index(level=should_drop_index, drop=True)
drop = not any(
[
isinstance(func_or_funcs[gkey.name], list)
for gkey in self._groupkeys
if gkey.name in func_or_funcs
]
)
psdf = psdf.reset_index(level=should_drop_index, drop=drop)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fixed in Pandas: pandas-dev/pandas#52849.

Comment on lines -276 to +282
pdf = makeMissingDataframe(0.3, 42)
pdf = pd.DataFrame(
index=[
"".join(
np.random.choice(
list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"), 10
)
)
for _ in range(30)
],
columns=list("ABCD"),
dtype="float64",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing util makeMissingDataframe is removed.

@@ -487,23 +487,23 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
... pass
>>> inferred = infer_return_type(func)
>>> inferred.dtypes
[dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False)]
[dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False, categories_dtype=int64)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added dtype of categories is added to __repr__: pandas-dev/pandas#52179.

m 2.0 NaN
dog kg NaN 3.0
m 4.0 NaN
>>> df_multi_level_cols2.stack().sort_index()
Copy link
Contributor Author

@itholic itholic Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column ordering bug is fixed in Pandas: pandas-dev/pandas#53786.

@itholic
Copy link
Contributor Author

itholic commented Sep 13, 2023

Many tests are failing due to the PyArrow upgrade in CI.

#42897 is fixing this issue, so let me rebase the PR after the fixing is get merged.

Manually cherry-pick #42897 to fix the CI failure.

@github-actions github-actions bot added the INFRA label Sep 13, 2023
@github-actions github-actions bot removed the INFRA label Sep 13, 2023
@itholic itholic changed the title [WIP][SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 Sep 14, 2023
@itholic itholic marked this pull request as ready for review September 14, 2023 00:04
@zhengruifeng
Copy link
Contributor

not related to this PR itself, what is the policy to upgrade the minimum version of dependencies listed here ?

@itholic
Copy link
Contributor Author

itholic commented Sep 15, 2023

@zhengruifeng AFAIK, there is no separate policy for minimum version. We may change the minimum version of a particular package when if an older version no longer works properly with Spark, or if the community for that package no longer maintains a particular older version, etc.

@HyukjinKwon
Copy link
Member

Let's probably upgrade them since we're going ahead for 4.0.0 major version bumpup

@dongjoon-hyun
Copy link
Member

Could you resolve the conflict, @itholic ?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs)

0 1.000000 4.494400
1 11.262736 20.857489
"""
return self.applymap(func=func)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call will show a deprecation warning from applymap?

I guess we should call return self._apply_series_op(lambda psser: psser.apply(func)) here and applymap should call map instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah we shouldn't call applymap here.

Just applied the suggestion. Thanks!

@@ -42,6 +42,8 @@ Upgrading from PySpark 3.5 to 4.0
* In Spark 4.0, ``squeeze`` parameter from ``ps.read_csv`` and ``ps.read_excel`` has been removed from pandas API on Spark.
* In Spark 4.0, ``null_counts`` parameter from ``DataFrame.info`` has been removed from pandas API on Spark, use ``show_counts`` instead.
* In Spark 4.0, the result of ``MultiIndex.append`` does not keep the index names from pandas API on Spark.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a line her, where we tell users to have pandas version 2.1.0 installed for spark 4.0
The only way now to find witch pandas version to install is to check the docker file in dev/infra

https://github.com/jupyter/docker-stacks/blob/52a999a554fe42951e017f7be132d808695a1261/images/pyspark-notebook/Dockerfile#L69

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Related information has been added to the top of the migration guide. Thanks!

@itholic
Copy link
Contributor Author

itholic commented Sep 18, 2023

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM again

The failure StreamingQueryListenerSuite is irrelevant to this PR.

Merged to master for Apache Spark 4.0.0.

@dongjoon-hyun
Copy link
Member

Thank you, @itholic and all!

@itholic
Copy link
Contributor Author

itholic commented Sep 19, 2023

Thanks all!

@itholic itholic deleted the pandas_2.1.0 branch November 20, 2023 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants