-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 #42793
Conversation
Since there are many features are deprecated from Pandas 2.1.0, let me investigate if there is any corresponding feature from Pandas API on Spark while we're here. |
psdf = psdf.reset_index(level=should_drop_index, drop=True) | ||
drop = not any( | ||
[ | ||
isinstance(func_or_funcs[gkey.name], list) | ||
for gkey in self._groupkeys | ||
if gkey.name in func_or_funcs | ||
] | ||
) | ||
psdf = psdf.reset_index(level=should_drop_index, drop=drop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug fixed in Pandas: pandas-dev/pandas#52849.
pdf = makeMissingDataframe(0.3, 42) | ||
pdf = pd.DataFrame( | ||
index=[ | ||
"".join( | ||
np.random.choice( | ||
list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"), 10 | ||
) | ||
) | ||
for _ in range(30) | ||
], | ||
columns=list("ABCD"), | ||
dtype="float64", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testing util makeMissingDataframe
is removed.
@@ -487,23 +487,23 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp | |||
... pass | |||
>>> inferred = infer_return_type(func) | |||
>>> inferred.dtypes | |||
[dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False)] | |||
[dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False, categories_dtype=int64)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added dtype
of categories is added to __repr__
: pandas-dev/pandas#52179.
m 2.0 NaN | ||
dog kg NaN 3.0 | ||
m 4.0 NaN | ||
>>> df_multi_level_cols2.stack().sort_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column ordering bug is fixed in Pandas: pandas-dev/pandas#53786.
not related to this PR itself, what is the policy to upgrade the minimum version of dependencies listed here ? |
@zhengruifeng AFAIK, there is no separate policy for minimum version. We may change the minimum version of a particular package when if an older version no longer works properly with Spark, or if the community for that package no longer maintains a particular older version, etc. |
Let's probably upgrade them since we're going ahead for 4.0.0 major version bumpup |
Could you resolve the conflict, @itholic ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CIs)
python/pyspark/pandas/frame.py
Outdated
0 1.000000 4.494400 | ||
1 11.262736 20.857489 | ||
""" | ||
return self.applymap(func=func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call will show a deprecation warning from applymap
?
I guess we should call return self._apply_series_op(lambda psser: psser.apply(func))
here and applymap
should call map
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah we shouldn't call applymap
here.
Just applied the suggestion. Thanks!
@@ -42,6 +42,8 @@ Upgrading from PySpark 3.5 to 4.0 | |||
* In Spark 4.0, ``squeeze`` parameter from ``ps.read_csv`` and ``ps.read_excel`` has been removed from pandas API on Spark. | |||
* In Spark 4.0, ``null_counts`` parameter from ``DataFrame.info`` has been removed from pandas API on Spark, use ``show_counts`` instead. | |||
* In Spark 4.0, the result of ``MultiIndex.append`` does not keep the index names from pandas API on Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a line her, where we tell users to have pandas version 2.1.0 installed for spark 4.0
The only way now to find witch pandas version to install is to check the docker file in dev/infra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Related information has been added to the top of the migration guide. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM again
The failure StreamingQueryListenerSuite
is irrelevant to this PR.
Merged to master for Apache Spark 4.0.0.
Thank you, @itholic and all! |
Thanks all! |
What changes were proposed in this pull request?
This PR proposes to support pandas 2.1.0 for PySpark. See What's new in 2.1.0 for more detail.
Why are the changes needed?
We should follow the latest version of pandas.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
The existing CI should passed with Pandas 2.1.0
Was this patch authored or co-authored using generative AI tooling?
No.