-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20487
Conversation
@ueshin, @icexelloss and @cloud-fan, would you mind taking a look please? |
Test build #86990 has finished for PR 20487 at commit
|
Will fix the test soon tomorrow .. |
@@ -1923,6 +1923,9 @@ def toPandas(self): | |||
0 2 Alice | |||
1 5 Bob | |||
""" | |||
from pyspark.sql.utils import require_minimum_pandas_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toPandas
seems already failed when it includes types TimestampType
with old Pandas version:
>>> import datetime
>>> spark.createDataFrame([[datetime.datetime.now()]]).toPandas()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 1978, in toPandas
_check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
File "/.../spark/python/pyspark/sql/types.py", line 1775, in _check_series_convert_timestamps_local_tz
return _check_series_convert_timestamps_localize(s, None, timezone)
File "/.../spark/python/pyspark/sql/types.py", line 1750, in _check_series_convert_timestamps_localize
require_minimum_pandas_version()
File "/.../spark/python/pyspark/sql/utils.py", line 128, in require_minimum_pandas_version
"your version was %s." % (minimum_pandas_version, pandas.__version__))
ImportError: Pandas >= 0.19.2 must be installed; however, your version was 0.16.0.
Since we set the supported version, I think we should better explicitly require the version. Let me know if anyone thinks differently ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already called here though https://github.com/apache/spark/pull/20487/files#diff-6fc344560230bf0ef711bb9b5573f1faL1939
am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's pyarrow vs this one is pandas. Wanted to produce a proper message before import pandas as pd
before :-).
Above case (https://github.com/apache/spark/pull/20487/files#r165714499) is when Pandas is lower than 0.19.2. When Pandas is missing, it shows sth like:
>>> spark.range(1).toPandas()
before:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 1975, in toPandas
import pandas as pd
ImportError: No module named pandas
after:
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 1927, in toPandas
require_minimum_pandas_version()
File "/.../spark/python/pyspark/sql/utils.py", line 125, in require_minimum_pandas_version
"it was not found." % minimum_pandas_version)
ImportError: Pandas >= 0.19.2 must be installed; however, it was not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add require_minimum_pandas_version()
to line 649 in session.py?
https://github.com/apache/spark/blob/a0e4b166f71f9bb5f3e5af7843a03c11658892fd/python/pyspark/sql/session.py#L643-L653
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, let me give a shot to clean up there too.
Test build #86994 has finished for PR 20487 at commit
|
@HyukjinKwon just to clarify, seems like these PySpark tests are already skipped when required pyarrow and pandas are not found, this PR refactors the error message to make that cleaner, is that correct? |
Test build #86997 has finished for PR 20487 at commit
|
Ah, If PyArrow version is lower then we claim, for example, 0.7.0, seems tests go failed:
Will clarify it in PR description. |
67efc40
to
6403198
Compare
with self.assertRaisesRegexp(ImportError, 'Pandas >= .* must be installed'): | ||
with self.assertRaisesRegexp( | ||
ImportError, | ||
'(Pandas >= .* must be installed|No module named pandas)'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Pandas is lower then we have, it throws Pandas >= .* must be installed
. It Pandas is not installed import pandas as pd
in the test throws an exception, "No module named pandas".
@unittest.skipIf(not _have_old_pandas, "Old Pandas not installed") | ||
def test_to_pandas_old(self): | ||
@unittest.skipIf(_have_pandas, "Required Pandas was found.") | ||
def test_to_pandas_required_pandas_not_found(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, this also test when Pandas is missing too.
_pyarrow_requirement_message = _exception_message(e) | ||
|
||
_have_pandas = _pandas_requirement_message is None | ||
_have_pyarrow = _pyarrow_requirement_message is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the logic I used:
_pyarrow_requirement_message
contains error message for PyArrow requirement if missing or version is not matched.
if _pyarrow_requirement_message
contains the message, _have_pyarrow
becomes False
.
import pandas | ||
except ImportError: | ||
raise ImportError("Pandas >= %s must be installed; however, " | ||
"it was not found." % minimum_pandas_version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I catch ImportError
here just to make the error message nicer.
Test build #87016 has finished for PR 20487 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
python/pyspark/sql/utils.py
Outdated
|
||
|
||
def require_minimum_pyarrow_version(): | ||
""" Raise ImportError if minimum version of pyarrow is not installed | ||
""" | ||
minimum_pyarrow_version = "0.8.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment in https://github.com/apache/spark/blob/master/pom.xml#L188
to point here. otherwise it's hard to remember to change both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
@@ -1923,6 +1923,9 @@ def toPandas(self): | |||
0 2 Alice | |||
1 5 Bob | |||
""" | |||
from pyspark.sql.utils import require_minimum_pandas_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already called here though https://github.com/apache/spark/pull/20487/files#diff-6fc344560230bf0ef711bb9b5573f1faL1939
am I missing something?
@@ -2794,7 +2792,6 @@ def count_bucketed_cols(names, table="pyspark_bucket"): | |||
|
|||
def _to_pandas(self): | |||
from datetime import datetime, date | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import seems not used in this function.
@@ -185,6 +185,10 @@ | |||
<paranamer.version>2.8</paranamer.version> | |||
<maven-antrun.version>1.8</maven-antrun.version> | |||
<commons-crypto.version>1.0.0</commons-crypto.version> | |||
<!-- | |||
If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py, | |||
./python/run-tests.py and ./python/setup.py too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
./python/run-tests.py
is not there yet. It's a part of #20473.
@@ -100,6 +100,11 @@ def _supports_symlinks(): | |||
file=sys.stderr) | |||
exit(-1) | |||
|
|||
# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py and | |||
# ./python/run-tests.py. In case of Arrow, you should also check ./pom.xml. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -185,6 +185,10 @@ | |||
<paranamer.version>2.8</paranamer.version> | |||
<maven-antrun.version>1.8</maven-antrun.version> | |||
<commons-crypto.version>1.0.0</commons-crypto.version> | |||
<!-- | |||
If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py, | |||
./python/run-tests.py and ./python/setup.py too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add the similar comment to each *.py
file, not only setup.py
, to refer one another? And also we should add for Pandas in each *.py
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm .. I thought the proper place to upgrade the versions should be in setup.py
and pom.xml
so if we happen to update PyArrow (pom.xml
/ setup.py
) or Pandas (setup.py
), I thought we are going to take a look for either place first.
To be honest, I actually don't quite like to write down specific paths in those comments because if we happen to move, we should update all the comments ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I agree with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, true - though I think just the file name is ok - they are distinct enough to find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me just keep them .. maybe I am too much caring about this but */*/pom.xml, [./dev/run-tests|./python/run-tests|./python/run-tests.py] and [util.py|utils.py] might be confusing ..
Test build #87057 has finished for PR 20487 at commit
|
Test build #87066 has finished for PR 20487 at commit
|
looks like this PR doesn't skip the "old Pandas" tests, but rewrite them? |
Ah, yup. There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. Let me clarify the title and description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
873b4b9
to
b7a940d
Compare
Test build #87134 has finished for PR 20487 at commit
|
retest this please |
Test build #87139 has finished for PR 20487 at commit
|
@@ -646,6 +646,9 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr | |||
except Exception: | |||
has_pandas = False | |||
if has_pandas and isinstance(data, pandas.DataFrame): | |||
from pyspark.sql.utils import require_minimum_pandas_version | |||
require_minimum_pandas_version() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for curious, do you have a list of the places that we do this version check for pandas and pyarrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I exactly know all the places exactly. For now, I can think of: createDataFrame with Pandas DataFrame input, toPandas and pandas_udf for APIs, and some places in session.py
/ types.py
for internal methods like _check*
family or *arrow*
or *pandas*
.
I was thinking of working on putting those into a single module (file) after 2.3.0. Will cc you and @ueshin there.
LGTM |
Merged to master. |
…n PySpark tests (to skip or test) This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test. We declared the extra dependencies: https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204 In case of PyArrow: Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed: ``` ====================================================================== ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType())) File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf return _create_udf(f=f, returnType=return_type, evalType=eval_type) File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version "however, your version was %s." % pyarrow.__version__) ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0. ---------------------------------------------------------------------- Ran 33 tests in 8.098s FAILED (errors=33) ``` In case of Pandas: There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. Manually tested by modifying the condition: ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#20487 from HyukjinKwon/pyarrow-pandas-skip. (cherry picked from commit 71cfba0) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
## What changes were proposed in this pull request? This is a followup pr of apache#20487. When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3. E.g., in Python 2: ``` No module named pandas ``` in Python 3: ``` No module named 'pandas' ``` So, one test to check an import error fails in Python 3 without pandas. This pr fixes it. ## How was this patch tested? Tested manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#20538 from ueshin/issues/SPARK-23319/fup1.
## What changes were proposed in this pull request? This is a followup pr of #20487. When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3. E.g., in Python 2: ``` No module named pandas ``` in Python 3: ``` No module named 'pandas' ``` So, one test to check an import error fails in Python 3 without pandas. This pr fixes it. ## How was this patch tested? Tested manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20538 from ueshin/issues/SPARK-23319/fup1.
…n PySpark tests (to skip or test) ## What changes were proposed in this pull request? This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test. We declared the extra dependencies: https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204 In case of PyArrow: Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed: ``` ====================================================================== ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType())) File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf return _create_udf(f=f, returnType=return_type, evalType=eval_type) File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version "however, your version was %s." % pyarrow.__version__) ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0. ---------------------------------------------------------------------- Ran 33 tests in 8.098s FAILED (errors=33) ``` In case of Pandas: There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. ## How was this patch tested? Manually tested by modifying the condition: ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#20487 from HyukjinKwon/pyarrow-pandas-skip.
## What changes were proposed in this pull request? This is a followup pr of apache#20487. When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3. E.g., in Python 2: ``` No module named pandas ``` in Python 3: ``` No module named 'pandas' ``` So, one test to check an import error fails in Python 3 without pandas. This pr fixes it. ## How was this patch tested? Tested manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#20538 from ueshin/issues/SPARK-23319/fup1.
What changes were proposed in this pull request?
This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.
We declared the extra dependencies:
spark/python/setup.py
Line 204 in b8bfce5
In case of PyArrow:
Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:
In case of Pandas:
There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.
How was this patch tested?
Manually tested by modifying the condition: