[DOC] Add new download page with Spark Connect description and instru…

…ctions for multiple distributions - Add new download page with Spark Connect description - Added decision matrix - instructions for multiple distributions - instructions for pyspark install - <img width="1715" alt="Screenshot 2025-02-21 at 1 17 35 PM" src="https://github.com/user-attachments/assets/3517392a-5ab8-4f24-994d-fb79714e66b7" /> <img width="1360" alt="Screenshot 2025-02-21 at 1 17 53 PM" src="https://github.com/user-attachments/assets/291761e4-1463-4b71-8ba0-b0c48c729dbf" /> <img width="1353" alt="Screenshot 2025-02-21 at 1 18 06 PM" src="https://github.com/user-attachments/assets/d3f4ee26-4239-4b98-8a9d-543e2a8ac6bc" />  Author: Jules Damji <dmatrix@comast.net> Closes #591 from dmatrix/br_jsd_new_download_page.
apache · Feb 25, 2025 · 8185433 · 8185433
1 parent 70cc6c9
commit 8185433
Show file tree

Hide file tree

Showing 3 changed files with 174 additions and 16 deletions.
diff --git a/Gemfile.lock b/Gemfile.lock
@@ -4,15 +4,15 @@ GEM
     addressable (2.8.7)
       public_suffix (>= 2.0.2, < 7.0)
     colorator (1.1.0)
-    concurrent-ruby (1.3.4)
+    concurrent-ruby (1.3.5)
     em-websocket (0.5.3)
       eventmachine (>= 0.12.9)
       http_parser.rb (~> 0)
     eventmachine (1.2.7)
     ffi (1.17.1)
     forwardable-extended (2.6.0)
     http_parser.rb (0.8.0)
-    i18n (1.14.6)
+    i18n (1.14.7)
       concurrent-ruby (~> 1.0)
     jekyll (4.2.0)
       addressable (~> 2.4)
@@ -48,7 +48,7 @@ GEM
     rb-fsevent (0.11.2)
     rb-inotify (0.11.1)
       ffi (~> 1.0)
-    rexml (3.4.0)
+    rexml (3.4.1)
     rouge (3.26.0)
     safe_yaml (1.0.5)
     sassc (2.4.0)

diff --git a/downloads.md b/downloads.md
@@ -16,6 +16,34 @@ window.onload = function () {
 }
 </script>
 
+## Introduction
+
+Unlike previous Apache Spark™ releases, Spark 4.0 has two distinct distributions: _classic_ and _connect_. As the names suggest, the _classic_ Spark version is the usual distribution you would expect for any new Spark release. The _connect_ distribution, in contrast, is the version with [Spark Connect](https://spark.apache.org/docs/4.0.0-preview2/spark-connect-overview.html) enabled by default. Which one should you download?
+
+Select the _connect_ version if your workloads only use standard [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) and [Spark SQL](https://spark.apache.org/docs/latest/api/sql/) APIs. Choose the _classic_ version for traditional workloads requiring access to [RDD APIs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#rdd-apis), [SparkContext APIs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#spark-context-apis), JVM properties, and custom catalyst rules/plans.
+
+If you are not familiar with Spark Connect, the primary benefit is that it provides a stable client API, decoupling the client from the Spark Driver. This makes Spark projects much easier to maintain over time, allowing you to update the Spark Driver and server-side dependencies without having to update the client. To learn more about Spark Connect, and explore its architecture details and benefits, visit here: [Spark Connect architecture](https://spark.apache.org/spark-connect/).
+
+## Selection Matrix for Spark Distributions
+
+This table guides you to which of the two distributions to select based on the type of Spark workloads.
+
+| Workloads Types                                                                                     | Spark Distribution and PySpark Package Mode| Spark Config Change                         |
+|-----------------------------------------------------------------------------------------------------|--------------------------------------------|---------------------------------------------|
+| - Only use standard DataFrame and Spark SQL APIs                                                    | _connect_                                  | None                                        |
+| - Ability to access and debug Spark from IDE or interact in notebooks                               |                                            |                                             |
+| - Use of thin client to access Spark cluster from non-JVM languages                                 |                                            |                                             |
+||||
+| - Access to RDD APIs                                                                                | _classic_                                  | None                                        |
+| - Access to SparkContext API and properties                                                         |                                            |                                             |
+| - Access to standard DataFrame and Spark SQL APIs                                                   |                                            |                                             |
+| - Ability to access and debug Spark from IDE or interact in notebooks                               |                                            |                                             |
+| - Access to JVM properties                                                                          |                                            |                                             |
+| - Access to private catalyst APIs: custom analyzer/optimizer rules, custom query plans              |                                            |                                             |
+||||
+| - Able to switch between classic and connect                                                        | _classic_                                  | `spark.api.mode = {classic or connect}`     |
+||||
+
 ## Download Apache Spark&trade;
 
 1. Choose a Spark release:
@@ -27,19 +55,39 @@ window.onload = function () {
 3. Download Spark: <span id="spanDownloadLink"></span>
 
 4. Verify this release using the <span id="sparkDownloadVerify"></span> and [project release KEYS](https://downloads.apache.org/spark/KEYS) by following these [procedures](https://www.apache.org/info/verification.html).
+classic
 
-Note that Spark 3 is pre-built with Scala 2.12 in general and Spark 3.2+ provides additional pre-built distribution with Scala 2.13.
+Note that Spark 4 is pre-built with Scala 2.13 in general, and Spark 3.5+ provides additional pre-built distribution with Scala 2.13.
 
-### Link with Spark
+### Link with Spark ###
 Spark artifacts are [hosted in Maven Central](https://search.maven.org/search?q=g:org.apache.spark). You can add a Maven dependency with the following coordinates:
 
     groupId: org.apache.spark
-    artifactId: spark-core_2.12
-    version: 3.5.4
+    artifactId: spark-core_2.13
+    version: 4.0.0
+
+### Installing with PyPI ###
+Like the two distributions mentioned above, PyPI will also have two PySpark package versions. The default is the _classic_ __pyspark__, while the _connect_ version is __pyspark-connect__ and is dependent on __pyspark__.
+
+Use the decision matrix above to select which PyPI PySpark package to use for your Spark workloads. Both <a href="https://pypi.org/project/pyspark/">PySpark</a> package versions are available on PyPI.
+
+### Installing PySpark Connect ###
+
+Since __pyspark-connect__ package is dependent on __pyspark__, __pyspark-connect__ will automatically install __pyspark__ for you. The __pyspark-connect__ package is mostly empty and merely enables Spark config `spark.api.mode` to _connect_ mode in the underlying pyspark package.
+
+`pip install pyspark-connect==4.0.0`
+
+Thereafter, follow the Spark Connect [quickstart guide](https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/quickstart_connect.html) on how to use SparkSession.
+
+### Installing PySpark Classic ###
+
+Simply use `pip install pyspark==4.0.0`
+
+### Installing PySpark Client ###
 
-### Installing with PyPi
-<a href="https://pypi.org/project/pyspark/">PySpark</a> is now available in pypi. To install just run `pip install pyspark`.
+Alternatively, if you only want a pure Python thin library with Spark Connect capabilities, install _pyspark-client_ package: `pip install pyspark-client`.
 
+For more detailed examples of Apache Spark 4.0 features, check the [PySpark User Guide](https://turbo-adventure-1pg35k5.pages.github.io/01-preface.html) and [PySpark installation](https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/install.html).
 
 ### Installing with Docker
 
@@ -58,4 +106,4 @@ but they are still available at [Spark release archives](https://archive.apache.
 
 **NOTE**: Previous releases of Spark may be affected by security issues. Please consult the
 [Security](security.html) page for a list of known issues that may affect the version you download
-before deciding to use it.
+before deciding to use it.
diff --git a/site/downloads.html b/site/downloads.html
@@ -161,6 +161,95 @@
 }
 </script>
 
+<h2 id="introduction">Introduction</h2>
+
+<p>Unlike previous Apache Spark™ releases, Spark 4.0 has two distinct distributions: <em>classic</em> and <em>connect</em>. As the names suggest, the <em>classic</em> Spark version is the usual distribution you would expect for any new Spark release. The <em>connect</em> distribution, in contrast, is the version with <a href="https://spark.apache.org/docs/4.0.0-preview2/spark-connect-overview.html">Spark Connect</a> enabled by default. Which one should you download?</p>
+
+<p>Select the <em>connect</em> version if your workloads only use standard <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html">DataFrame</a> and <a href="https://spark.apache.org/docs/latest/api/sql/">Spark SQL</a> APIs. Choose the <em>classic</em> version for traditional workloads requiring access to <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#rdd-apis">RDD APIs</a>, <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#spark-context-apis">SparkContext APIs</a>, JVM properties, and custom catalyst rules/plans.</p>
+
+<p>If you are not familiar with Spark Connect, the primary benefit is that it provides a stable client API, decoupling the client from the Spark Driver. This makes Spark projects much easier to maintain over time, allowing you to update the Spark Driver and server-side dependencies without having to update the client. To learn more about Spark Connect, and explore its architecture details and benefits, visit here: <a href="https://spark.apache.org/spark-connect/">Spark Connect architecture</a>.</p>
+
+<h2 id="selection-matrix-for-spark-distributions">Selection Matrix for Spark Distributions</h2>
+
+<p>This table guides you to which of the two distributions to select based on the type of Spark workloads.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Workloads Types</th>
+      <th>Spark Distribution and PySpark Package Mode</th>
+      <th>Spark Config Change</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>- Only use standard DataFrame and Spark SQL APIs</td>
+      <td><em>connect</em></td>
+      <td>None</td>
+    </tr>
+    <tr>
+      <td>- Ability to access and debug Spark from IDE or interact in notebooks</td>
+      <td>&#160;</td>
+      <td>&#160;</td>
+    </tr>
+    <tr>
+      <td>- Use of thin client to access Spark cluster from non-JVM languages</td>
+      <td>&#160;</td>
+      <td>&#160;</td>
+    </tr>
+    <tr>
+      <td>&#160;</td>
+      <td>&#160;</td>
+      <td>&#160;</td>
+    </tr>
+    <tr>
+      <td>- Access to RDD APIs</td>
+      <td><em>classic</em></td>
+      <td>None</td>
+    </tr>
+    <tr>
+      <td>- Access to SparkContext API and properties</td>
+      <td>&#160;</td>
+      <td>&#160;</td>
+    </tr>
+    <tr>
+      <td>- Access to standard DataFrame and Spark SQL APIs</td>
+      <td>&#160;</td>
+      <td>&#160;</td>
+    </tr>
+    <tr>
+      <td>- Ability to access and debug Spark from IDE or interact in notebooks</td>
+      <td>&#160;</td>
+      <td>&#160;</td>
+    </tr>
+    <tr>
+      <td>- Access to JVM properties</td>
+      <td>&#160;</td>
+      <td>&#160;</td>
+    </tr>
+    <tr>
+      <td>- Access to private catalyst APIs: custom analyzer/optimizer rules, custom query plans</td>
+      <td>&#160;</td>
+      <td>&#160;</td>
+    </tr>
+    <tr>
+      <td>&#160;</td>
+      <td>&#160;</td>
+      <td>&#160;</td>
+    </tr>
+    <tr>
+      <td>- Able to switch between classic and connect</td>
+      <td><em>classic</em></td>
+      <td><code class="language-plaintext highlighter-rouge">spark.api.mode = {classic or connect}</code></td>
+    </tr>
+    <tr>
+      <td>&#160;</td>
+      <td>&#160;</td>
+      <td>&#160;</td>
+    </tr>
+  </tbody>
+</table>
+
 <h2 id="download-apache-spark">Download Apache Spark&#8482;</h2>
 
 <ol>
@@ -176,22 +265,43 @@ <h2 id="download-apache-spark">Download Apache Spark&#8482;</h2>
     <p>Download Spark: <span id="spanDownloadLink"></span></p>
   </li>
   <li>
-    <p>Verify this release using the <span id="sparkDownloadVerify"></span> and <a href="https://downloads.apache.org/spark/KEYS">project release KEYS</a> by following these <a href="https://www.apache.org/info/verification.html">procedures</a>.</p>
+    <p>Verify this release using the <span id="sparkDownloadVerify"></span> and <a href="https://downloads.apache.org/spark/KEYS">project release KEYS</a> by following these <a href="https://www.apache.org/info/verification.html">procedures</a>.
+classic</p>
   </li>
 </ol>
 
-<p>Note that Spark 3 is pre-built with Scala 2.12 in general and Spark 3.2+ provides additional pre-built distribution with Scala 2.13.</p>
+<p>Note that Spark 4 is pre-built with Scala 2.13 in general, and Spark 3.5+ provides additional pre-built distribution with Scala 2.13.</p>
 
 <h3 id="link-with-spark">Link with Spark</h3>
 <p>Spark artifacts are <a href="https://search.maven.org/search?q=g:org.apache.spark">hosted in Maven Central</a>. You can add a Maven dependency with the following coordinates:</p>
 
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>groupId: org.apache.spark
-artifactId: spark-core_2.12
-version: 3.5.4
+artifactId: spark-core_2.13
+version: 4.0.0
 </code></pre></div></div>
 
-<h3 id="installing-with-pypi">Installing with PyPi</h3>
-<p><a href="https://pypi.org/project/pyspark/">PySpark</a> is now available in pypi. To install just run <code class="language-plaintext highlighter-rouge">pip install pyspark</code>.</p>
+<h3 id="installing-with-pypi">Installing with PyPI</h3>
+<p>Like the two distributions mentioned above, PyPI will also have two PySpark package versions. The default is the <em>classic</em> <strong>pyspark</strong>, while the <em>connect</em> version is <strong>pyspark-connect</strong> and is dependent on <strong>pyspark</strong>.</p>
+
+<p>Use the decision matrix above to select which PyPI PySpark package to use for your Spark workloads. Both <a href="https://pypi.org/project/pyspark/">PySpark</a> package versions are available on PyPI.</p>
+
+<h3 id="installing-pyspark-connect">Installing PySpark Connect</h3>
+
+<p>Since <strong>pyspark-connect</strong> package is dependent on <strong>pyspark</strong>, <strong>pyspark-connect</strong> will automatically install <strong>pyspark</strong> for you. The <strong>pyspark-connect</strong> package is mostly empty and merely enables Spark config <code class="language-plaintext highlighter-rouge">spark.api.mode</code> to <em>connect</em> mode in the underlying pyspark package.</p>
+
+<p><code class="language-plaintext highlighter-rouge">pip install pyspark-connect==4.0.0</code></p>
+
+<p>Thereafter, follow the Spark Connect <a href="https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/quickstart_connect.html">quickstart guide</a> on how to use SparkSession.</p>
+
+<h3 id="installing-pyspark-classic">Installing PySpark Classic</h3>
+
+<p>Simply use <code class="language-plaintext highlighter-rouge">pip install pyspark==4.0.0</code></p>
+
+<h3 id="installing-pyspark-client">Installing PySpark Client</h3>
+
+<p>Alternatively, if you only want a pure Python thin library with Spark Connect capabilities, install <em>pyspark-client</em> package: <code class="language-plaintext highlighter-rouge">pip install pyspark-client</code>.</p>
+
+<p>For more detailed examples of Apache Spark 4.0 features, check the <a href="https://turbo-adventure-1pg35k5.pages.github.io/01-preface.html">PySpark User Guide</a> and <a href="https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/install.html">PySpark installation</a>.</p>
 
 <h3 id="installing-with-docker">Installing with Docker</h3>