Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20487

Closed
wants to merge 3 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 2, 2018

What changes were proposed in this pull request?

This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.

We declared the extra dependencies:

'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0']

In case of PyArrow:

Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:

======================================================================
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
    f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
  File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
    return _create_udf(f=f, returnType=return_type, evalType=eval_type)
  File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
    "however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.

----------------------------------------------------------------------
Ran 33 tests in 8.098s

FAILED (errors=33)

In case of Pandas:

There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.

How was this patch tested?

Manually tested by modifying the condition:

test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'

@HyukjinKwon
Copy link
Member Author

@ueshin, @icexelloss and @cloud-fan, would you mind taking a look please?

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86990 has finished for PR 20487 at commit 08b42f8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Will fix the test soon tomorrow ..

@@ -1923,6 +1923,9 @@ def toPandas(self):
0 2 Alice
1 5 Bob
"""
from pyspark.sql.utils import require_minimum_pandas_version
Copy link
Member Author

@HyukjinKwon HyukjinKwon Feb 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toPandas seems already failed when it includes types TimestampType with old Pandas version:

>>> import datetime
>>> spark.createDataFrame([[datetime.datetime.now()]]).toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 1978, in toPandas
    _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
  File "/.../spark/python/pyspark/sql/types.py", line 1775, in _check_series_convert_timestamps_local_tz
    return _check_series_convert_timestamps_localize(s, None, timezone)
  File "/.../spark/python/pyspark/sql/types.py", line 1750, in _check_series_convert_timestamps_localize
    require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/utils.py", line 128, in require_minimum_pandas_version
    "your version was %s." % (minimum_pandas_version, pandas.__version__))
ImportError: Pandas >= 0.19.2 must be installed; however, your version was 0.16.0.

Since we set the supported version, I think we should better explicitly require the version. Let me know if anyone thinks differently ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's pyarrow vs this one is pandas. Wanted to produce a proper message before import pandas as pd before :-).

Above case (https://github.com/apache/spark/pull/20487/files#r165714499) is when Pandas is lower than 0.19.2. When Pandas is missing, it shows sth like:

>>> spark.range(1).toPandas()

before:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 1975, in toPandas
    import pandas as pd
ImportError: No module named pandas

after:

  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 1927, in toPandas
    require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/utils.py", line 125, in require_minimum_pandas_version
    "it was not found." % minimum_pandas_version)
ImportError: Pandas >= 0.19.2 must be installed; however, it was not found.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, let me give a shot to clean up there too.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86994 has finished for PR 20487 at commit 1606070.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@icexelloss
Copy link
Contributor

@HyukjinKwon just to clarify, seems like these PySpark tests are already skipped when required pyarrow and pandas are not found, this PR refactors the error message to make that cleaner, is that correct?

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86997 has finished for PR 20487 at commit 67efc40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 3, 2018

Ah, If PyArrow version is lower then we claim, for example, 0.7.0, seems tests go failed:

======================================================================
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
    f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
  File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
    return _create_udf(f=f, returnType=return_type, evalType=eval_type)
  File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
    "however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.

----------------------------------------------------------------------
Ran 33 tests in 8.098s

FAILED (errors=33)

Will clarify it in PR description.

with self.assertRaisesRegexp(ImportError, 'Pandas >= .* must be installed'):
with self.assertRaisesRegexp(
ImportError,
'(Pandas >= .* must be installed|No module named pandas)'):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Pandas is lower then we have, it throws Pandas >= .* must be installed. It Pandas is not installed import pandas as pd in the test throws an exception, "No module named pandas".

@unittest.skipIf(not _have_old_pandas, "Old Pandas not installed")
def test_to_pandas_old(self):
@unittest.skipIf(_have_pandas, "Required Pandas was found.")
def test_to_pandas_required_pandas_not_found(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, this also test when Pandas is missing too.

_pyarrow_requirement_message = _exception_message(e)

_have_pandas = _pandas_requirement_message is None
_have_pyarrow = _pyarrow_requirement_message is None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the logic I used:

_pyarrow_requirement_message contains error message for PyArrow requirement if missing or version is not matched.

if _pyarrow_requirement_message contains the message, _have_pyarrow becomes False.

import pandas
except ImportError:
raise ImportError("Pandas >= %s must be installed; however, "
"it was not found." % minimum_pandas_version)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I catch ImportError here just to make the error message nicer.

@SparkQA
Copy link

SparkQA commented Feb 3, 2018

Test build #87016 has finished for PR 20487 at commit 6403198.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM



def require_minimum_pyarrow_version():
""" Raise ImportError if minimum version of pyarrow is not installed
"""
minimum_pyarrow_version = "0.8.0"
Copy link
Member

@felixcheung felixcheung Feb 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment in https://github.com/apache/spark/blob/master/pom.xml#L188
to point here. otherwise it's hard to remember to change both places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -1923,6 +1923,9 @@ def toPandas(self):
0 2 Alice
1 5 Bob
"""
from pyspark.sql.utils import require_minimum_pandas_version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -2794,7 +2792,6 @@ def count_bucketed_cols(names, table="pyspark_bucket"):

def _to_pandas(self):
from datetime import datetime, date
import numpy as np
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import seems not used in this function.

@@ -185,6 +185,10 @@
<paranamer.version>2.8</paranamer.version>
<maven-antrun.version>1.8</maven-antrun.version>
<commons-crypto.version>1.0.0</commons-crypto.version>
<!--
If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py,
./python/run-tests.py and ./python/setup.py too.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./python/run-tests.py is not there yet. It's a part of #20473.

@@ -100,6 +100,11 @@ def _supports_symlinks():
file=sys.stderr)
exit(-1)

# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py and
# ./python/run-tests.py. In case of Arrow, you should also check ./pom.xml.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -185,6 +185,10 @@
<paranamer.version>2.8</paranamer.version>
<maven-antrun.version>1.8</maven-antrun.version>
<commons-crypto.version>1.0.0</commons-crypto.version>
<!--
If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py,
./python/run-tests.py and ./python/setup.py too.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the similar comment to each *.py file, not only setup.py, to refer one another? And also we should add for Pandas in each *.py file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm .. I thought the proper place to upgrade the versions should be in setup.py and pom.xml so if we happen to update PyArrow (pom.xml / setup.py) or Pandas (setup.py), I thought we are going to take a look for either place first.

To be honest, I actually don't quite like to write down specific paths in those comments because if we happen to move, we should update all the comments ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I agree with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, true - though I think just the file name is ok - they are distinct enough to find

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me just keep them .. maybe I am too much caring about this but */*/pom.xml, [./dev/run-tests|./python/run-tests|./python/run-tests.py] and [util.py|utils.py] might be confusing ..

@SparkQA
Copy link

SparkQA commented Feb 5, 2018

Test build #87057 has finished for PR 20487 at commit a0e4b16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2018

Test build #87066 has finished for PR 20487 at commit 873b4b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

looks like this PR doesn't skip the "old Pandas" tests, but rewrite them?

@HyukjinKwon
Copy link
Member Author

Ah, yup. There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. Let me clarify the title and description.

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

@HyukjinKwon HyukjinKwon changed the title [SPARK-23319][TESTS] Explicitly skips PySpark tests for old Pandas and PyArrow [SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) Feb 7, 2018
@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87134 has finished for PR 20487 at commit b7a940d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87139 has finished for PR 20487 at commit b7a940d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -646,6 +646,9 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
except Exception:
has_pandas = False
if has_pandas and isinstance(data, pandas.DataFrame):
from pyspark.sql.utils import require_minimum_pandas_version
require_minimum_pandas_version()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for curious, do you have a list of the places that we do this version check for pandas and pyarrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I exactly know all the places exactly. For now, I can think of: createDataFrame with Pandas DataFrame input, toPandas and pandas_udf for APIs, and some places in session.py / types.py for internal methods like _check* family or *arrow* or *pandas*.

I was thinking of working on putting those into a single module (file) after 2.3.0. Will cc you and @ueshin there.

@cloud-fan
Copy link
Contributor

LGTM

@HyukjinKwon
Copy link
Member Author

Merged to master.

HyukjinKwon added a commit to HyukjinKwon/spark that referenced this pull request Feb 7, 2018
…n PySpark tests (to skip or test)

This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.

We declared the extra dependencies:

https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204

In case of PyArrow:

Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:

```
======================================================================
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
    f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
  File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
    return _create_udf(f=f, returnType=return_type, evalType=eval_type)
  File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
    "however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.

----------------------------------------------------------------------
Ran 33 tests in 8.098s

FAILED (errors=33)
```

In case of Pandas:

There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.

Manually tested by modifying the condition:

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#20487 from HyukjinKwon/pyarrow-pandas-skip.

(cherry picked from commit 71cfba0)

Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
@asfgit asfgit closed this in 71cfba0 Feb 7, 2018
HyukjinKwon pushed a commit to HyukjinKwon/spark that referenced this pull request Feb 8, 2018
## What changes were proposed in this pull request?

This is a followup pr of apache#20487.

When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3.

E.g., in Python 2:

```
No module named pandas
```

in Python 3:

```
No module named 'pandas'
```

So, one test to check an import error fails in Python 3 without pandas.

This pr fixes it.

## How was this patch tested?

Tested manually in my local environment.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes apache#20538 from ueshin/issues/SPARK-23319/fup1.
asfgit pushed a commit that referenced this pull request Feb 8, 2018
## What changes were proposed in this pull request?

This is a followup pr of #20487.

When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3.

E.g., in Python 2:

```
No module named pandas
```

in Python 3:

```
No module named 'pandas'
```

So, one test to check an import error fails in Python 3 without pandas.

This pr fixes it.

## How was this patch tested?

Tested manually in my local environment.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #20538 from ueshin/issues/SPARK-23319/fup1.
asfgit pushed a commit that referenced this pull request Feb 8, 2018
…w versions in PySpark tests (to skip or test)

This PR backports #20487 to branch-2.3.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Takuya UESHIN <ueshin@databricks.com>

Closes #20534 from HyukjinKwon/PR_TOOL_PICK_PR_20487_BRANCH-2.3.
robert3005 pushed a commit to palantir/spark that referenced this pull request Feb 12, 2018
…n PySpark tests (to skip or test)

## What changes were proposed in this pull request?

This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.

We declared the extra dependencies:

https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204

In case of PyArrow:

Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:

```
======================================================================
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
    f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
  File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
    return _create_udf(f=f, returnType=return_type, evalType=eval_type)
  File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
    "however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.

----------------------------------------------------------------------
Ran 33 tests in 8.098s

FAILED (errors=33)
```

In case of Pandas:

There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.

## How was this patch tested?

Manually tested by modifying the condition:

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#20487 from HyukjinKwon/pyarrow-pandas-skip.
robert3005 pushed a commit to palantir/spark that referenced this pull request Feb 12, 2018
## What changes were proposed in this pull request?

This is a followup pr of apache#20487.

When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3.

E.g., in Python 2:

```
No module named pandas
```

in Python 3:

```
No module named 'pandas'
```

So, one test to check an import error fails in Python 3 without pandas.

This pr fixes it.

## How was this patch tested?

Tested manually in my local environment.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes apache#20538 from ueshin/issues/SPARK-23319/fup1.
@HyukjinKwon HyukjinKwon deleted the pyarrow-pandas-skip branch October 16, 2018 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants