[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20487

HyukjinKwon · 2018-02-02T13:58:08Z

What changes were proposed in this pull request?

This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.

We declared the extra dependencies:

Line 204 in b8bfce5

'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0']

In case of PyArrow:

Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:

======================================================================
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
    f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
  File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
    return _create_udf(f=f, returnType=return_type, evalType=eval_type)
  File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
    "however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.

----------------------------------------------------------------------
Ran 33 tests in 8.098s

FAILED (errors=33)

In case of Pandas:

There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.

How was this patch tested?

Manually tested by modifying the condition:

test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'

test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'

test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'

test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'

HyukjinKwon · 2018-02-02T13:58:40Z

@ueshin, @icexelloss and @cloud-fan, would you mind taking a look please?

SparkQA · 2018-02-02T14:27:26Z

Test build #86990 has finished for PR 20487 at commit 08b42f8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-02T14:34:31Z

Will fix the test soon tomorrow ..

HyukjinKwon · 2018-02-02T17:58:34Z

python/pyspark/sql/dataframe.py

@@ -1923,6 +1923,9 @@ def toPandas(self):
        0    2  Alice
        1    5    Bob
        """
+        from pyspark.sql.utils import require_minimum_pandas_version


toPandas seems already failed when it includes types TimestampType with old Pandas version:

>>> import datetime >>> spark.createDataFrame([[datetime.datetime.now()]]).toPandas() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 1978, in toPandas _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) File "/.../spark/python/pyspark/sql/types.py", line 1775, in _check_series_convert_timestamps_local_tz return _check_series_convert_timestamps_localize(s, None, timezone) File "/.../spark/python/pyspark/sql/types.py", line 1750, in _check_series_convert_timestamps_localize require_minimum_pandas_version() File "/.../spark/python/pyspark/sql/utils.py", line 128, in require_minimum_pandas_version "your version was %s." % (minimum_pandas_version, pandas.__version__)) ImportError: Pandas >= 0.19.2 must be installed; however, your version was 0.16.0.

Since we set the supported version, I think we should better explicitly require the version. Let me know if anyone thinks differently ..

this is already called here though https://github.com/apache/spark/pull/20487/files#diff-6fc344560230bf0ef711bb9b5573f1faL1939
am I missing something?

Ah, that's pyarrow vs this one is pandas. Wanted to produce a proper message before import pandas as pd before :-).

Above case (https://github.com/apache/spark/pull/20487/files#r165714499) is when Pandas is lower than 0.19.2. When Pandas is missing, it shows sth like:

>>> spark.range(1).toPandas()

before:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 1975, in toPandas import pandas as pd ImportError: No module named pandas

after:

File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 1927, in toPandas require_minimum_pandas_version() File "/.../spark/python/pyspark/sql/utils.py", line 125, in require_minimum_pandas_version "it was not found." % minimum_pandas_version) ImportError: Pandas >= 0.19.2 must be installed; however, it was not found.

We should also add require_minimum_pandas_version() to line 649 in session.py?
https://github.com/apache/spark/blob/a0e4b166f71f9bb5f3e5af7843a03c11658892fd/python/pyspark/sql/session.py#L643-L653

Yup, let me give a shot to clean up there too.

SparkQA · 2018-02-02T18:16:55Z

Test build #86994 has finished for PR 20487 at commit 1606070.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-02-02T18:38:38Z

@HyukjinKwon just to clarify, seems like these PySpark tests are already skipped when required pyarrow and pandas are not found, this PR refactors the error message to make that cleaner, is that correct?

SparkQA · 2018-02-02T18:52:55Z

Test build #86997 has finished for PR 20487 at commit 67efc40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-03T00:27:58Z

Ah, If PyArrow version is lower then we claim, for example, 0.7.0, seems tests go failed:

======================================================================
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
    f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
  File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
    return _create_udf(f=f, returnType=return_type, evalType=eval_type)
  File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
    "however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.

----------------------------------------------------------------------
Ran 33 tests in 8.098s

FAILED (errors=33)

Will clarify it in PR description.

HyukjinKwon · 2018-02-03T00:45:05Z

python/pyspark/sql/tests.py

-            with self.assertRaisesRegexp(ImportError, 'Pandas >= .* must be installed'):
+            with self.assertRaisesRegexp(
+                    ImportError,
+                    '(Pandas >= .* must be installed|No module named pandas)'):


If Pandas is lower then we have, it throws Pandas >= .* must be installed. It Pandas is not installed import pandas as pd in the test throws an exception, "No module named pandas".

HyukjinKwon · 2018-02-03T00:46:58Z

python/pyspark/sql/tests.py

-    @unittest.skipIf(not _have_old_pandas, "Old Pandas not installed")
-    def test_to_pandas_old(self):
+    @unittest.skipIf(_have_pandas, "Required Pandas was found.")
+    def test_to_pandas_required_pandas_not_found(self):


Now, this also test when Pandas is missing too.

HyukjinKwon · 2018-02-03T00:48:55Z

python/pyspark/sql/tests.py

+    _pyarrow_requirement_message = _exception_message(e)
+
+_have_pandas = _pandas_requirement_message is None
+_have_pyarrow = _pyarrow_requirement_message is None


Here is the logic I used:

_pyarrow_requirement_message contains error message for PyArrow requirement if missing or version is not matched.

if _pyarrow_requirement_message contains the message, _have_pyarrow becomes False.

HyukjinKwon · 2018-02-03T00:49:27Z

python/pyspark/sql/utils.py

+        import pandas
+    except ImportError:
+        raise ImportError("Pandas >= %s must be installed; however, "
+                          "it was not found." % minimum_pandas_version)


I catch ImportError here just to make the error message nicer.

SparkQA · 2018-02-03T01:12:47Z

Test build #87016 has finished for PR 20487 at commit 6403198.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

LGTM

felixcheung · 2018-02-04T19:41:31Z

python/pyspark/sql/utils.py



 def require_minimum_pyarrow_version():
    """ Raise ImportError if minimum version of pyarrow is not installed
    """
+    minimum_pyarrow_version = "0.8.0"


maybe add a comment in https://github.com/apache/spark/blob/master/pom.xml#L188
to point here. otherwise it's hard to remember to change both places.

felixcheung · 2018-02-04T19:42:41Z

python/pyspark/sql/dataframe.py

@@ -1923,6 +1923,9 @@ def toPandas(self):
        0    2  Alice
        1    5    Bob
        """
+        from pyspark.sql.utils import require_minimum_pandas_version


this is already called here though https://github.com/apache/spark/pull/20487/files#diff-6fc344560230bf0ef711bb9b5573f1faL1939
am I missing something?

HyukjinKwon · 2018-02-05T03:06:16Z

python/pyspark/sql/tests.py

@@ -2794,7 +2792,6 @@ def count_bucketed_cols(names, table="pyspark_bucket"):

    def _to_pandas(self):
        from datetime import datetime, date
-        import numpy as np


This import seems not used in this function.

HyukjinKwon · 2018-02-05T03:06:56Z

pom.xml

@@ -185,6 +185,10 @@
    <paranamer.version>2.8</paranamer.version>
    <maven-antrun.version>1.8</maven-antrun.version>
    <commons-crypto.version>1.0.0</commons-crypto.version>
+    <!--
+    If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py,
+    ./python/run-tests.py and ./python/setup.py too.


./python/run-tests.py is not there yet. It's a part of #20473.

HyukjinKwon · 2018-02-05T03:07:16Z

python/setup.py

@@ -100,6 +100,11 @@ def _supports_symlinks():
              file=sys.stderr)
        exit(-1)

+# If you are changing the versions here, please also change ./python/pyspark/sql/utils.py and
+# ./python/run-tests.py. In case of Arrow, you should also check ./pom.xml.


ditto of https://github.com/apache/spark/pull/20487/files#r165873632

ueshin · 2018-02-05T04:28:30Z

pom.xml

@@ -185,6 +185,10 @@
    <paranamer.version>2.8</paranamer.version>
    <maven-antrun.version>1.8</maven-antrun.version>
    <commons-crypto.version>1.0.0</commons-crypto.version>
+    <!--
+    If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py,
+    ./python/run-tests.py and ./python/setup.py too.


We should add the similar comment to each *.py file, not only setup.py, to refer one another? And also we should add for Pandas in each *.py file.

Hmmmm .. I thought the proper place to upgrade the versions should be in setup.py and pom.xml so if we happen to update PyArrow (pom.xml / setup.py) or Pandas (setup.py), I thought we are going to take a look for either place first.

To be honest, I actually don't quite like to write down specific paths in those comments because if we happen to move, we should update all the comments ..

I see. I agree with it.

yes, true - though I think just the file name is ok - they are distinct enough to find

Let me just keep them .. maybe I am too much caring about this but */*/pom.xml, [./dev/run-tests|./python/run-tests|./python/run-tests.py] and [util.py|utils.py] might be confusing ..

SparkQA · 2018-02-05T06:54:59Z

Test build #87057 has finished for PR 20487 at commit a0e4b16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-05T15:30:55Z

Test build #87066 has finished for PR 20487 at commit 873b4b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-06T07:28:16Z

looks like this PR doesn't skip the "old Pandas" tests, but rewrite them?

HyukjinKwon · 2018-02-06T07:53:05Z

Ah, yup. There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. Let me clarify the title and description.

icexelloss

LGTM too

SparkQA · 2018-02-07T03:26:46Z

Test build #87134 has finished for PR 20487 at commit b7a940d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-07T03:39:19Z

retest this please

SparkQA · 2018-02-07T07:21:42Z

Test build #87139 has finished for PR 20487 at commit b7a940d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-07T07:29:45Z

python/pyspark/sql/session.py

@@ -646,6 +646,9 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
        except Exception:
            has_pandas = False
        if has_pandas and isinstance(data, pandas.DataFrame):
+            from pyspark.sql.utils import require_minimum_pandas_version
+            require_minimum_pandas_version()


just for curious, do you have a list of the places that we do this version check for pandas and pyarrow?

I don't think I exactly know all the places exactly. For now, I can think of: createDataFrame with Pandas DataFrame input, toPandas and pandas_udf for APIs, and some places in session.py / types.py for internal methods like _check* family or *arrow* or *pandas*.

I was thinking of working on putting those into a single module (file) after 2.3.0. Will cc you and @ueshin there.

cloud-fan · 2018-02-07T08:56:42Z

LGTM

HyukjinKwon · 2018-02-07T14:28:56Z

Merged to master.

…n PySpark tests (to skip or test) This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test. We declared the extra dependencies: https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204 In case of PyArrow: Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed: ``` ====================================================================== ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType())) File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf return _create_udf(f=f, returnType=return_type, evalType=eval_type) File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version "however, your version was %s." % pyarrow.__version__) ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0. ---------------------------------------------------------------------- Ran 33 tests in 8.098s FAILED (errors=33) ``` In case of Pandas: There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. Manually tested by modifying the condition: ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#20487 from HyukjinKwon/pyarrow-pandas-skip. (cherry picked from commit 71cfba0) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

## What changes were proposed in this pull request? This is a followup pr of apache#20487. When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3. E.g., in Python 2: ``` No module named pandas ``` in Python 3: ``` No module named 'pandas' ``` So, one test to check an import error fails in Python 3 without pandas. This pr fixes it. ## How was this patch tested? Tested manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#20538 from ueshin/issues/SPARK-23319/fup1.

## What changes were proposed in this pull request? This is a followup pr of #20487. When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3. E.g., in Python 2: ``` No module named pandas ``` in Python 3: ``` No module named 'pandas' ``` So, one test to check an import error fails in Python 3 without pandas. This pr fixes it. ## How was this patch tested? Tested manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20538 from ueshin/issues/SPARK-23319/fup1.

…w versions in PySpark tests (to skip or test) This PR backports #20487 to branch-2.3. Author: hyukjinkwon <gurwls223@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #20534 from HyukjinKwon/PR_TOOL_PICK_PR_20487_BRANCH-2.3.

…n PySpark tests (to skip or test) ## What changes were proposed in this pull request? This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test. We declared the extra dependencies: https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204 In case of PyArrow: Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed: ``` ====================================================================== ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType())) File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf return _create_udf(f=f, returnType=return_type, evalType=eval_type) File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version "however, your version was %s." % pyarrow.__version__) ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0. ---------------------------------------------------------------------- Ran 33 tests in 8.098s FAILED (errors=33) ``` In case of Pandas: There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. ## How was this patch tested? Manually tested by modifying the condition: ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#20487 from HyukjinKwon/pyarrow-pandas-skip.

## What changes were proposed in this pull request? This is a followup pr of apache#20487. When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3. E.g., in Python 2: ``` No module named pandas ``` in Python 3: ``` No module named 'pandas' ``` So, one test to check an import error fails in Python 3 without pandas. This pr fixes it. ## How was this patch tested? Tested manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#20538 from ueshin/issues/SPARK-23319/fup1.

HyukjinKwon commented Feb 2, 2018

View reviewed changes

HyukjinKwon force-pushed the pyarrow-pandas-skip branch from 67efc40 to 6403198 Compare February 3, 2018 00:39

HyukjinKwon commented Feb 3, 2018

View reviewed changes

felixcheung approved these changes Feb 4, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Feb 5, 2018

[SPARK-23300][TESTS] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests #20473

Closed

HyukjinKwon commented Feb 5, 2018

View reviewed changes

ueshin reviewed Feb 5, 2018

View reviewed changes

icexelloss approved these changes Feb 6, 2018

View reviewed changes

HyukjinKwon added 3 commits February 7, 2018 09:21

Explicitly skips PySpark tests for old Pandas and PyArrow

46d03fd

Minor cleanup

dd50411

Add a requirement for createDataFrame with Pandas DataFrame ahead too

b7a940d

HyukjinKwon force-pushed the pyarrow-pandas-skip branch from 873b4b9 to b7a940d Compare February 7, 2018 00:22

HyukjinKwon changed the title ~~[SPARK-23319][TESTS] Explicitly skips PySpark tests for old Pandas and PyArrow~~ [SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) Feb 7, 2018

cloud-fan reviewed Feb 7, 2018

View reviewed changes

asfgit closed this in 71cfba0 Feb 7, 2018

HyukjinKwon mentioned this pull request Feb 7, 2018

[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20534

Closed

ueshin mentioned this pull request Feb 8, 2018

[SPARK-23319][TESTS][FOLLOWUP] Fix a test for Python 3 without pandas. #20538

Closed

HyukjinKwon mentioned this pull request Apr 3, 2018

[SPARK-23776][python][test] Check for needed components/files before running pyspark-sql tests #20909

Closed

HyukjinKwon deleted the pyarrow-pandas-skip branch October 16, 2018 12:44

[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20487

[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20487

Conversation

HyukjinKwon commented Feb 2, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Feb 2, 2018

SparkQA commented Feb 2, 2018

HyukjinKwon commented Feb 2, 2018

HyukjinKwon Feb 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 2, 2018

icexelloss commented Feb 2, 2018

SparkQA commented Feb 2, 2018

HyukjinKwon commented Feb 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 3, 2018

felixcheung left a comment

Choose a reason for hiding this comment

felixcheung Feb 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2018

SparkQA commented Feb 5, 2018

cloud-fan commented Feb 6, 2018

HyukjinKwon commented Feb 6, 2018

icexelloss left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2018

HyukjinKwon commented Feb 7, 2018

SparkQA commented Feb 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 7, 2018

HyukjinKwon commented Feb 7, 2018

HyukjinKwon commented Feb 2, 2018 •

edited

Loading

HyukjinKwon Feb 2, 2018 •

edited

Loading

HyukjinKwon commented Feb 3, 2018 •

edited

Loading

felixcheung Feb 4, 2018 •

edited

Loading