Skip to content

Commit

Permalink
[DOC] Add new download page with Spark Connect description and instru…
Browse files Browse the repository at this point in the history
…ctions for multiple distributions

 - Add new download page with Spark Connect description
 - Added decision matrix
 -  instructions for multiple distributions
 - instructions for pyspark install
 -
<img width="1715" alt="Screenshot 2025-02-21 at 1 17 35 PM" src="https://github.com/user-attachments/assets/3517392a-5ab8-4f24-994d-fb79714e66b7" />
<img width="1360" alt="Screenshot 2025-02-21 at 1 17 53 PM" src="https://github.com/user-attachments/assets/291761e4-1463-4b71-8ba0-b0c48c729dbf" />
<img width="1353" alt="Screenshot 2025-02-21 at 1 18 06 PM" src="https://github.com/user-attachments/assets/d3f4ee26-4239-4b98-8a9d-543e2a8ac6bc" />
<!-- *Make sure that you generate site HTML with `bundle exec jekyll build`, and include the changes to the HTML in your pull request. See README.md for more information.* -->

Author: Jules Damji <dmatrix@comast.net>

Closes #591 from dmatrix/br_jsd_new_download_page.
  • Loading branch information
Jules Damji authored and srowen committed Feb 25, 2025
1 parent 70cc6c9 commit 8185433
Show file tree
Hide file tree
Showing 3 changed files with 174 additions and 16 deletions.
6 changes: 3 additions & 3 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ GEM
addressable (2.8.7)
public_suffix (>= 2.0.2, < 7.0)
colorator (1.1.0)
concurrent-ruby (1.3.4)
concurrent-ruby (1.3.5)
em-websocket (0.5.3)
eventmachine (>= 0.12.9)
http_parser.rb (~> 0)
eventmachine (1.2.7)
ffi (1.17.1)
forwardable-extended (2.6.0)
http_parser.rb (0.8.0)
i18n (1.14.6)
i18n (1.14.7)
concurrent-ruby (~> 1.0)
jekyll (4.2.0)
addressable (~> 2.4)
Expand Down Expand Up @@ -48,7 +48,7 @@ GEM
rb-fsevent (0.11.2)
rb-inotify (0.11.1)
ffi (~> 1.0)
rexml (3.4.0)
rexml (3.4.1)
rouge (3.26.0)
safe_yaml (1.0.5)
sassc (2.4.0)
Expand Down
62 changes: 55 additions & 7 deletions downloads.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,34 @@ window.onload = function () {
}
</script>

## Introduction

Unlike previous Apache Spark™ releases, Spark 4.0 has two distinct distributions: _classic_ and _connect_. As the names suggest, the _classic_ Spark version is the usual distribution you would expect for any new Spark release. The _connect_ distribution, in contrast, is the version with [Spark Connect](https://spark.apache.org/docs/4.0.0-preview2/spark-connect-overview.html) enabled by default. Which one should you download?

Select the _connect_ version if your workloads only use standard [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) and [Spark SQL](https://spark.apache.org/docs/latest/api/sql/) APIs. Choose the _classic_ version for traditional workloads requiring access to [RDD APIs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#rdd-apis), [SparkContext APIs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#spark-context-apis), JVM properties, and custom catalyst rules/plans.

If you are not familiar with Spark Connect, the primary benefit is that it provides a stable client API, decoupling the client from the Spark Driver. This makes Spark projects much easier to maintain over time, allowing you to update the Spark Driver and server-side dependencies without having to update the client. To learn more about Spark Connect, and explore its architecture details and benefits, visit here: [Spark Connect architecture](https://spark.apache.org/spark-connect/).

## Selection Matrix for Spark Distributions

This table guides you to which of the two distributions to select based on the type of Spark workloads.

| Workloads Types | Spark Distribution and PySpark Package Mode| Spark Config Change |
|-----------------------------------------------------------------------------------------------------|--------------------------------------------|---------------------------------------------|
| - Only use standard DataFrame and Spark SQL APIs | _connect_ | None |
| - Ability to access and debug Spark from IDE or interact in notebooks | | |
| - Use of thin client to access Spark cluster from non-JVM languages | | |
||||
| - Access to RDD APIs | _classic_ | None |
| - Access to SparkContext API and properties | | |
| - Access to standard DataFrame and Spark SQL APIs | | |
| - Ability to access and debug Spark from IDE or interact in notebooks | | |
| - Access to JVM properties | | |
| - Access to private catalyst APIs: custom analyzer/optimizer rules, custom query plans | | |
||||
| - Able to switch between classic and connect | _classic_ | `spark.api.mode = {classic or connect}` |
||||

## Download Apache Spark&trade;

1. Choose a Spark release:
Expand All @@ -27,19 +55,39 @@ window.onload = function () {
3. Download Spark: <span id="spanDownloadLink"></span>

4. Verify this release using the <span id="sparkDownloadVerify"></span> and [project release KEYS](https://downloads.apache.org/spark/KEYS) by following these [procedures](https://www.apache.org/info/verification.html).
classic

Note that Spark 3 is pre-built with Scala 2.12 in general and Spark 3.2+ provides additional pre-built distribution with Scala 2.13.
Note that Spark 4 is pre-built with Scala 2.13 in general, and Spark 3.5+ provides additional pre-built distribution with Scala 2.13.

### Link with Spark
### Link with Spark ###
Spark artifacts are [hosted in Maven Central](https://search.maven.org/search?q=g:org.apache.spark). You can add a Maven dependency with the following coordinates:

groupId: org.apache.spark
artifactId: spark-core_2.12
version: 3.5.4
artifactId: spark-core_2.13
version: 4.0.0

### Installing with PyPI ###
Like the two distributions mentioned above, PyPI will also have two PySpark package versions. The default is the _classic_ __pyspark__, while the _connect_ version is __pyspark-connect__ and is dependent on __pyspark__.

Use the decision matrix above to select which PyPI PySpark package to use for your Spark workloads. Both <a href="https://pypi.org/project/pyspark/">PySpark</a> package versions are available on PyPI.

### Installing PySpark Connect ###

Since __pyspark-connect__ package is dependent on __pyspark__, __pyspark-connect__ will automatically install __pyspark__ for you. The __pyspark-connect__ package is mostly empty and merely enables Spark config `spark.api.mode` to _connect_ mode in the underlying pyspark package.

`pip install pyspark-connect==4.0.0`

Thereafter, follow the Spark Connect [quickstart guide](https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/quickstart_connect.html) on how to use SparkSession.

### Installing PySpark Classic ###

Simply use `pip install pyspark==4.0.0`

### Installing PySpark Client ###

### Installing with PyPi
<a href="https://pypi.org/project/pyspark/">PySpark</a> is now available in pypi. To install just run `pip install pyspark`.
Alternatively, if you only want a pure Python thin library with Spark Connect capabilities, install _pyspark-client_ package: `pip install pyspark-client`.

For more detailed examples of Apache Spark 4.0 features, check the [PySpark User Guide](https://turbo-adventure-1pg35k5.pages.github.io/01-preface.html) and [PySpark installation](https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/install.html).

### Installing with Docker

Expand All @@ -58,4 +106,4 @@ but they are still available at [Spark release archives](https://archive.apache.

**NOTE**: Previous releases of Spark may be affected by security issues. Please consult the
[Security](security.html) page for a list of known issues that may affect the version you download
before deciding to use it.
before deciding to use it.
122 changes: 116 additions & 6 deletions site/downloads.html
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,95 @@
}
</script>

<h2 id="introduction">Introduction</h2>

<p>Unlike previous Apache Spark™ releases, Spark 4.0 has two distinct distributions: <em>classic</em> and <em>connect</em>. As the names suggest, the <em>classic</em> Spark version is the usual distribution you would expect for any new Spark release. The <em>connect</em> distribution, in contrast, is the version with <a href="https://spark.apache.org/docs/4.0.0-preview2/spark-connect-overview.html">Spark Connect</a> enabled by default. Which one should you download?</p>

<p>Select the <em>connect</em> version if your workloads only use standard <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html">DataFrame</a> and <a href="https://spark.apache.org/docs/latest/api/sql/">Spark SQL</a> APIs. Choose the <em>classic</em> version for traditional workloads requiring access to <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#rdd-apis">RDD APIs</a>, <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#spark-context-apis">SparkContext APIs</a>, JVM properties, and custom catalyst rules/plans.</p>

<p>If you are not familiar with Spark Connect, the primary benefit is that it provides a stable client API, decoupling the client from the Spark Driver. This makes Spark projects much easier to maintain over time, allowing you to update the Spark Driver and server-side dependencies without having to update the client. To learn more about Spark Connect, and explore its architecture details and benefits, visit here: <a href="https://spark.apache.org/spark-connect/">Spark Connect architecture</a>.</p>

<h2 id="selection-matrix-for-spark-distributions">Selection Matrix for Spark Distributions</h2>

<p>This table guides you to which of the two distributions to select based on the type of Spark workloads.</p>

<table>
<thead>
<tr>
<th>Workloads Types</th>
<th>Spark Distribution and PySpark Package Mode</th>
<th>Spark Config Change</th>
</tr>
</thead>
<tbody>
<tr>
<td>- Only use standard DataFrame and Spark SQL APIs</td>
<td><em>connect</em></td>
<td>None</td>
</tr>
<tr>
<td>- Ability to access and debug Spark from IDE or interact in notebooks</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Use of thin client to access Spark cluster from non-JVM languages</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>&#160;</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Access to RDD APIs</td>
<td><em>classic</em></td>
<td>None</td>
</tr>
<tr>
<td>- Access to SparkContext API and properties</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Access to standard DataFrame and Spark SQL APIs</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Ability to access and debug Spark from IDE or interact in notebooks</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Access to JVM properties</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Access to private catalyst APIs: custom analyzer/optimizer rules, custom query plans</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>&#160;</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Able to switch between classic and connect</td>
<td><em>classic</em></td>
<td><code class="language-plaintext highlighter-rouge">spark.api.mode = {classic or connect}</code></td>
</tr>
<tr>
<td>&#160;</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
</tbody>
</table>

<h2 id="download-apache-spark">Download Apache Spark&#8482;</h2>

<ol>
Expand All @@ -176,22 +265,43 @@ <h2 id="download-apache-spark">Download Apache Spark&#8482;</h2>
<p>Download Spark: <span id="spanDownloadLink"></span></p>
</li>
<li>
<p>Verify this release using the <span id="sparkDownloadVerify"></span> and <a href="https://downloads.apache.org/spark/KEYS">project release KEYS</a> by following these <a href="https://www.apache.org/info/verification.html">procedures</a>.</p>
<p>Verify this release using the <span id="sparkDownloadVerify"></span> and <a href="https://downloads.apache.org/spark/KEYS">project release KEYS</a> by following these <a href="https://www.apache.org/info/verification.html">procedures</a>.
classic</p>
</li>
</ol>

<p>Note that Spark 3 is pre-built with Scala 2.12 in general and Spark 3.2+ provides additional pre-built distribution with Scala 2.13.</p>
<p>Note that Spark 4 is pre-built with Scala 2.13 in general, and Spark 3.5+ provides additional pre-built distribution with Scala 2.13.</p>

<h3 id="link-with-spark">Link with Spark</h3>
<p>Spark artifacts are <a href="https://search.maven.org/search?q=g:org.apache.spark">hosted in Maven Central</a>. You can add a Maven dependency with the following coordinates:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>groupId: org.apache.spark
artifactId: spark-core_2.12
version: 3.5.4
artifactId: spark-core_2.13
version: 4.0.0
</code></pre></div></div>

<h3 id="installing-with-pypi">Installing with PyPi</h3>
<p><a href="https://pypi.org/project/pyspark/">PySpark</a> is now available in pypi. To install just run <code class="language-plaintext highlighter-rouge">pip install pyspark</code>.</p>
<h3 id="installing-with-pypi">Installing with PyPI</h3>
<p>Like the two distributions mentioned above, PyPI will also have two PySpark package versions. The default is the <em>classic</em> <strong>pyspark</strong>, while the <em>connect</em> version is <strong>pyspark-connect</strong> and is dependent on <strong>pyspark</strong>.</p>

<p>Use the decision matrix above to select which PyPI PySpark package to use for your Spark workloads. Both <a href="https://pypi.org/project/pyspark/">PySpark</a> package versions are available on PyPI.</p>

<h3 id="installing-pyspark-connect">Installing PySpark Connect</h3>

<p>Since <strong>pyspark-connect</strong> package is dependent on <strong>pyspark</strong>, <strong>pyspark-connect</strong> will automatically install <strong>pyspark</strong> for you. The <strong>pyspark-connect</strong> package is mostly empty and merely enables Spark config <code class="language-plaintext highlighter-rouge">spark.api.mode</code> to <em>connect</em> mode in the underlying pyspark package.</p>

<p><code class="language-plaintext highlighter-rouge">pip install pyspark-connect==4.0.0</code></p>

<p>Thereafter, follow the Spark Connect <a href="https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/quickstart_connect.html">quickstart guide</a> on how to use SparkSession.</p>

<h3 id="installing-pyspark-classic">Installing PySpark Classic</h3>

<p>Simply use <code class="language-plaintext highlighter-rouge">pip install pyspark==4.0.0</code></p>

<h3 id="installing-pyspark-client">Installing PySpark Client</h3>

<p>Alternatively, if you only want a pure Python thin library with Spark Connect capabilities, install <em>pyspark-client</em> package: <code class="language-plaintext highlighter-rouge">pip install pyspark-client</code>.</p>

<p>For more detailed examples of Apache Spark 4.0 features, check the <a href="https://turbo-adventure-1pg35k5.pages.github.io/01-preface.html">PySpark User Guide</a> and <a href="https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/install.html">PySpark installation</a>.</p>

<h3 id="installing-with-docker">Installing with Docker</h3>

Expand Down

0 comments on commit 8185433

Please sign in to comment.