[SPARK-32180][PYTHON][DOCS] Installation page of Getting Started in PySpark documentation #29640

rohitmishr1484 · 2020-09-03T15:34:18Z

What changes were proposed in this pull request?

This PR proposes to add getting started- installation to new PySpark docs.

Why are the changes needed?

Better documentation.

Does this PR introduce any user-facing change?

No. Documentation only.

How was this patch tested?

Generating documents locally.

rohitmishr1484 · 2020-09-03T15:43:53Z

Hi @HyukjinKwon,

As per your suggestion, I have made the necessary changes. Please have a look at it.

This is a new pull request on the same topic (installation page), as you rightly pointed the earlier got messed up. Thanks for pointing that out. I am pretty sure it's a mistake from my end since I tried updating a forked repo for the first time. I am attaching a pdf copy of my earlier pull request since it contains all your earlier review and comments (see last few pages of the pdf)

PR.pdf

Please let me know your thoughts and apologize for making you review this document multiple times.

HyukjinKwon · 2020-09-04T00:40:37Z

ok to test

SparkQA · 2020-09-04T01:17:46Z

Test build #128271 has finished for PR 29640 at commit d32ece1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

python/docs/source/getting_started/index.rst

python/docs/source/getting_started/installation.rst

HyukjinKwon · 2020-09-04T01:35:05Z

Looks fine otherwise. @holdenk and @srowen can you take a look when you guys are available?

python/docs/source/getting_started/index.rst

srowen · 2020-09-04T01:37:41Z

python/docs/source/getting_started/installation.rst

+Installation
+============
+
+The official release channel is to download it from `the Apache Spark website <https://spark.apache.org/downloads.html>`_.


"Official releases are available from ..."

python/docs/source/getting_started/installation.rst

srowen · 2020-09-04T01:39:34Z

python/docs/source/getting_started/installation.rst

+
+	conda activate pyspark_env
+
+In lower Conda version, the following command might be used:


In earlier Conda versions ... should be used:
(earlier than what?)

Hi @srowen,

Based on the conda documentation the answer is "Conda version before 4.4".
The Source for this info is

conda activate: The logic and mechanisms underlying environment activation have been reworked. With conda 4.4, conda activate and conda deactivate are now the preferred commands for activating and deactivating environments. You'll find they are much more snappy than the source activate and source deactivate commands from previous conda versions. The conda activate command also has advantages of (1) being universal across all OSes, shells, and platforms, and (2) not having path collisions with scripts from other packages like Python virtualenv's activate script.

Right, I'm suggesting saying that here.

python/docs/source/getting_started/installation.rst

rohitmishr1484 · 2020-09-05T11:09:27Z

@HyukjinKwon and @srowen, Thanks for your review.

@HyukjinKwon, I have updated all the changes you suggested.
@srowen, I have added comments corresponding to the 3 questions you had and updated other suggested changes.

I have summited a commit with some of the changes.

I will update other changes (corresponding to the comments) once both of you give a green signal.

SparkQA · 2020-09-05T11:32:39Z

Test build #128318 has finished for PR 29640 at commit 45ce72e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-09-05T20:55:32Z

python/docs/source/getting_started/installation.rst

+Using Conda  
+-----------
+
+Conda is an open-source package management and environment management system which is a part of `Anaconda <https://docs.continuum.io/anaconda/>`_ distribution. It is both cross-platform and language agnostic.


part of the

srowen · 2020-09-05T20:56:55Z

python/docs/source/getting_started/installation.rst

+
+    conda activate pyspark_env
+
+In Conda version earlier than 4.4, the following command might be used:


should be used

srowen · 2020-09-05T20:57:25Z

python/docs/source/getting_started/installation.rst

+
+	source activate pyspark_env
+
+PySpark installation using ``pip`` under Conda environment is official. 


What do you mean 'official' here?

srowen · 2020-09-05T20:57:41Z

python/docs/source/getting_started/installation.rst

+
+    pip install pyspark
+
+`PySpark at Conda <https://anaconda.org/conda-forge/pyspark>`_ is not the official release.


Same, I don't think this matters

I would just say, for example:

Note that PySpark at Conda <https://anaconda.org/conda-forge/pyspark>_ is available but not necessarily synced with PySpark release cycle because it is maintained by the community separately.

srowen · 2020-09-05T20:58:12Z

python/docs/source/getting_started/installation.rst

+
+Ensure the ``SPARK_HOME`` environment variable points to the directory where the code has been extracted. 
+Define ``PYTHONPATH`` such that it can find the PySpark and 
+Py4J under ``$SPARK_HOME/python/lib``, one example of doing this is shown below:


(By the way I think you need just single back-ticks?)
Start a new sentence at "One example"

oh fyi double backticks here make it like a code block. single backtick makes italic for some reasons.

srowen · 2020-09-05T20:59:19Z

python/docs/source/getting_started/installation.rst

+`Py4J`        0.10.9                    Required
+============= ========================= ==========================================================================
+
+**Note**: A prerequisite for PySpark installation is the availability of JAVA 8 or later and ``JAVA_HOME`` properly set. 


JAVA -> Java
Nit: you can avoid a lot of these passive sentences. "PySpark requires Java 8 or later, with JAVA_HOME properly set" for example.

srowen · 2020-09-05T20:59:30Z

python/docs/source/getting_started/installation.rst

+============= ========================= ==========================================================================
+
+**Note**: A prerequisite for PySpark installation is the availability of JAVA 8 or later and ``JAVA_HOME`` properly set. 
+For using JDK 11, set ``-Dio.netty.tryReflectionSetAccessible=true`` for Arrow related features and refer to `Downloading <https://spark.apache.org/docs/latest/#downloading>`_


If using Java 11, set ...

HyukjinKwon · 2020-09-08T03:29:01Z

python/docs/source/getting_started/installation.rst

+============= ========================= ==========================================================================
+`pandas`      0.23.2                    Optional for SQL component
+`NumPy`       1.7                       Required for ML component(Optional in PySpark if ML component is not used)
+`pyarrow`     0.15.1                    Optional


pyarrow is also currently only Optional for SQL.

HyukjinKwon · 2020-09-08T03:40:44Z

python/docs/source/getting_started/installation.rst

+============= ========================= ==========================================================================
+Package       Minimum supported version Note
+============= ========================= ==========================================================================
+`pandas`      0.23.2                    Optional for SQL component


I would just write like:

Optional for SQL
Required for ML
Optional for SQL

Seems too long. Sorry for a bit of forth and back here.

SparkQA · 2020-09-09T21:03:19Z

Test build #128469 has finished for PR 29640 at commit 14a97f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rohitmishr1484 · 2020-09-10T07:14:57Z

@HyukjinKwon and @srowen,

I have made the suggested changes. Thanks for your review.

srowen · 2020-09-11T15:38:16Z

Merged to master

sunchao · 2020-09-11T23:58:58Z

@srowen @rohitmishr1484 seems this commit is failing Python linter:

Warning, treated as error:
/home/runner/work/spark/spark/python/docs/source/getting_started/installation.rst:20:Duplicate explicit target name: "official release channel".
Makefile:20: recipe for target 'html' failed
make: *** [html] Error 2

srowen · 2020-09-12T00:04:10Z

Weird - it passed the test run above. Let me take a look and patch it.

This simply fixes an .rst generation error in #29640 Closes #29735 from srowen/SPARK-32180.2. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

This simply fixes an .rst generation error in apache/spark#29640 Closes #29735 from srowen/SPARK-32180.2. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

Performed suggested changes to getting started/installation page

d32ece1

probot-autolabeler bot added DOCS PYTHON labels Sep 3, 2020

HyukjinKwon reviewed Sep 4, 2020

View reviewed changes

python/docs/source/getting_started/index.rst Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 4, 2020

View reviewed changes

python/docs/source/getting_started/installation.rst Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 4, 2020

View reviewed changes

python/docs/source/getting_started/installation.rst Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 4, 2020

View reviewed changes

python/docs/source/getting_started/installation.rst Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 4, 2020

View reviewed changes

python/docs/source/getting_started/installation.rst Outdated Show resolved Hide resolved

srowen requested changes Sep 4, 2020

View reviewed changes

HyukjinKwon reviewed Sep 4, 2020

View reviewed changes

python/docs/source/getting_started/installation.rst Outdated Show resolved Hide resolved

1. Modified as per suggestions from reviewer

45ce72e

rohitmishr1484 requested review from srowen and HyukjinKwon September 5, 2020 19:11

srowen requested changes Sep 5, 2020

View reviewed changes

HyukjinKwon reviewed Sep 8, 2020

View reviewed changes

2. Updated necessary changes

14a97f8

HyukjinKwon mentioned this pull request Sep 10, 2020

[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI #29703

Closed

srowen approved these changes Sep 10, 2020

View reviewed changes

srowen closed this in f6322d1 Sep 11, 2020

sunchao mentioned this pull request Sep 11, 2020

[SPARK-32802][SQL] Avoid using SpecificInternalRow in RunLengthEncoding#Encoder #29654

Closed

sunchao mentioned this pull request Sep 12, 2020

[SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Fix Python lint issue #29733

Closed

srowen mentioned this pull request Sep 12, 2020

[SPARK-32180][FOLLOWUP] Fix .rst error in new Pyspark installation guide #29735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32180][PYTHON][DOCS] Installation page of Getting Started in PySpark documentation #29640

[SPARK-32180][PYTHON][DOCS] Installation page of Getting Started in PySpark documentation #29640

rohitmishr1484 commented Sep 3, 2020 •

edited by HyukjinKwon

Loading

rohitmishr1484 commented Sep 3, 2020 •

edited

Loading

HyukjinKwon commented Sep 4, 2020

SparkQA commented Sep 4, 2020

HyukjinKwon commented Sep 4, 2020

srowen Sep 4, 2020

srowen Sep 4, 2020

rohitmishr1484 Sep 5, 2020

srowen Sep 5, 2020

rohitmishr1484 commented Sep 5, 2020 •

edited

Loading

SparkQA commented Sep 5, 2020

srowen Sep 5, 2020

srowen Sep 5, 2020

srowen Sep 5, 2020

srowen Sep 5, 2020

HyukjinKwon Sep 8, 2020

srowen Sep 5, 2020

HyukjinKwon Sep 8, 2020

srowen Sep 5, 2020

srowen Sep 5, 2020

HyukjinKwon Sep 8, 2020

HyukjinKwon Sep 8, 2020

SparkQA commented Sep 9, 2020

rohitmishr1484 commented Sep 10, 2020

srowen commented Sep 11, 2020

sunchao commented Sep 11, 2020

srowen commented Sep 12, 2020


		conda activate pyspark_env

		In lower Conda version, the following command might be used:


		conda activate pyspark_env

		In Conda version earlier than 4.4, the following command might be used:


		source activate pyspark_env

		PySpark installation using ``pip`` under Conda environment is official.


		pip install pyspark

		`PySpark at Conda <https://anaconda.org/conda-forge/pyspark>`_ is not the official release.

[SPARK-32180][PYTHON][DOCS] Installation page of Getting Started in PySpark documentation #29640

[SPARK-32180][PYTHON][DOCS] Installation page of Getting Started in PySpark documentation #29640

Conversation

rohitmishr1484 commented Sep 3, 2020 • edited by HyukjinKwon Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

rohitmishr1484 commented Sep 3, 2020 • edited Loading

HyukjinKwon commented Sep 4, 2020

SparkQA commented Sep 4, 2020

HyukjinKwon commented Sep 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohitmishr1484 commented Sep 5, 2020 • edited Loading

SparkQA commented Sep 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 9, 2020

rohitmishr1484 commented Sep 10, 2020

srowen commented Sep 11, 2020

sunchao commented Sep 11, 2020

srowen commented Sep 12, 2020

rohitmishr1484 commented Sep 3, 2020 •

edited by HyukjinKwon

Loading

rohitmishr1484 commented Sep 3, 2020 •

edited

Loading

rohitmishr1484 commented Sep 5, 2020 •

edited

Loading