Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "[DOC] Add new download page with Spark Connect description an… #592

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ GEM
addressable (2.8.7)
public_suffix (>= 2.0.2, < 7.0)
colorator (1.1.0)
concurrent-ruby (1.3.5)
concurrent-ruby (1.3.4)
em-websocket (0.5.3)
eventmachine (>= 0.12.9)
http_parser.rb (~> 0)
eventmachine (1.2.7)
ffi (1.17.1)
forwardable-extended (2.6.0)
http_parser.rb (0.8.0)
i18n (1.14.7)
i18n (1.14.6)
concurrent-ruby (~> 1.0)
jekyll (4.2.0)
addressable (~> 2.4)
Expand Down Expand Up @@ -48,7 +48,7 @@ GEM
rb-fsevent (0.11.2)
rb-inotify (0.11.1)
ffi (~> 1.0)
rexml (3.4.1)
rexml (3.4.0)
rouge (3.26.0)
safe_yaml (1.0.5)
sassc (2.4.0)
Expand Down
62 changes: 7 additions & 55 deletions downloads.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,34 +16,6 @@ window.onload = function () {
}
</script>

## Introduction

Unlike previous Apache Spark™ releases, Spark 4.0 has two distinct distributions: _classic_ and _connect_. As the names suggest, the _classic_ Spark version is the usual distribution you would expect for any new Spark release. The _connect_ distribution, in contrast, is the version with [Spark Connect](https://spark.apache.org/docs/4.0.0-preview2/spark-connect-overview.html) enabled by default. Which one should you download?

Select the _connect_ version if your workloads only use standard [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) and [Spark SQL](https://spark.apache.org/docs/latest/api/sql/) APIs. Choose the _classic_ version for traditional workloads requiring access to [RDD APIs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#rdd-apis), [SparkContext APIs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#spark-context-apis), JVM properties, and custom catalyst rules/plans.

If you are not familiar with Spark Connect, the primary benefit is that it provides a stable client API, decoupling the client from the Spark Driver. This makes Spark projects much easier to maintain over time, allowing you to update the Spark Driver and server-side dependencies without having to update the client. To learn more about Spark Connect, and explore its architecture details and benefits, visit here: [Spark Connect architecture](https://spark.apache.org/spark-connect/).

## Selection Matrix for Spark Distributions

This table guides you to which of the two distributions to select based on the type of Spark workloads.

| Workloads Types | Spark Distribution and PySpark Package Mode| Spark Config Change |
|-----------------------------------------------------------------------------------------------------|--------------------------------------------|---------------------------------------------|
| - Only use standard DataFrame and Spark SQL APIs | _connect_ | None |
| - Ability to access and debug Spark from IDE or interact in notebooks | | |
| - Use of thin client to access Spark cluster from non-JVM languages | | |
||||
| - Access to RDD APIs | _classic_ | None |
| - Access to SparkContext API and properties | | |
| - Access to standard DataFrame and Spark SQL APIs | | |
| - Ability to access and debug Spark from IDE or interact in notebooks | | |
| - Access to JVM properties | | |
| - Access to private catalyst APIs: custom analyzer/optimizer rules, custom query plans | | |
||||
| - Able to switch between classic and connect | _classic_ | `spark.api.mode = {classic or connect}` |
||||

## Download Apache Spark&trade;

1. Choose a Spark release:
Expand All @@ -55,39 +27,19 @@ This table guides you to which of the two distributions to select based on the t
3. Download Spark: <span id="spanDownloadLink"></span>

4. Verify this release using the <span id="sparkDownloadVerify"></span> and [project release KEYS](https://downloads.apache.org/spark/KEYS) by following these [procedures](https://www.apache.org/info/verification.html).
classic

Note that Spark 4 is pre-built with Scala 2.13 in general, and Spark 3.5+ provides additional pre-built distribution with Scala 2.13.
Note that Spark 3 is pre-built with Scala 2.12 in general and Spark 3.2+ provides additional pre-built distribution with Scala 2.13.

### Link with Spark ###
### Link with Spark
Spark artifacts are [hosted in Maven Central](https://search.maven.org/search?q=g:org.apache.spark). You can add a Maven dependency with the following coordinates:

groupId: org.apache.spark
artifactId: spark-core_2.13
version: 4.0.0

### Installing with PyPI ###
Like the two distributions mentioned above, PyPI will also have two PySpark package versions. The default is the _classic_ __pyspark__, while the _connect_ version is __pyspark-connect__ and is dependent on __pyspark__.

Use the decision matrix above to select which PyPI PySpark package to use for your Spark workloads. Both <a href="https://pypi.org/project/pyspark/">PySpark</a> package versions are available on PyPI.

### Installing PySpark Connect ###

Since __pyspark-connect__ package is dependent on __pyspark__, __pyspark-connect__ will automatically install __pyspark__ for you. The __pyspark-connect__ package is mostly empty and merely enables Spark config `spark.api.mode` to _connect_ mode in the underlying pyspark package.

`pip install pyspark-connect==4.0.0`

Thereafter, follow the Spark Connect [quickstart guide](https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/quickstart_connect.html) on how to use SparkSession.

### Installing PySpark Classic ###

Simply use `pip install pyspark==4.0.0`

### Installing PySpark Client ###
artifactId: spark-core_2.12
version: 3.5.4

Alternatively, if you only want a pure Python thin library with Spark Connect capabilities, install _pyspark-client_ package: `pip install pyspark-client`.
### Installing with PyPi
<a href="https://pypi.org/project/pyspark/">PySpark</a> is now available in pypi. To install just run `pip install pyspark`.

For more detailed examples of Apache Spark 4.0 features, check the [PySpark User Guide](https://turbo-adventure-1pg35k5.pages.github.io/01-preface.html) and [PySpark installation](https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/install.html).

### Installing with Docker

Expand All @@ -106,4 +58,4 @@ but they are still available at [Spark release archives](https://archive.apache.

**NOTE**: Previous releases of Spark may be affected by security issues. Please consult the
[Security](security.html) page for a list of known issues that may affect the version you download
before deciding to use it.
before deciding to use it.
122 changes: 6 additions & 116 deletions site/downloads.html
Original file line number Diff line number Diff line change
Expand Up @@ -161,95 +161,6 @@
}
</script>

<h2 id="introduction">Introduction</h2>

<p>Unlike previous Apache Spark™ releases, Spark 4.0 has two distinct distributions: <em>classic</em> and <em>connect</em>. As the names suggest, the <em>classic</em> Spark version is the usual distribution you would expect for any new Spark release. The <em>connect</em> distribution, in contrast, is the version with <a href="https://spark.apache.org/docs/4.0.0-preview2/spark-connect-overview.html">Spark Connect</a> enabled by default. Which one should you download?</p>

<p>Select the <em>connect</em> version if your workloads only use standard <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html">DataFrame</a> and <a href="https://spark.apache.org/docs/latest/api/sql/">Spark SQL</a> APIs. Choose the <em>classic</em> version for traditional workloads requiring access to <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#rdd-apis">RDD APIs</a>, <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#spark-context-apis">SparkContext APIs</a>, JVM properties, and custom catalyst rules/plans.</p>

<p>If you are not familiar with Spark Connect, the primary benefit is that it provides a stable client API, decoupling the client from the Spark Driver. This makes Spark projects much easier to maintain over time, allowing you to update the Spark Driver and server-side dependencies without having to update the client. To learn more about Spark Connect, and explore its architecture details and benefits, visit here: <a href="https://spark.apache.org/spark-connect/">Spark Connect architecture</a>.</p>

<h2 id="selection-matrix-for-spark-distributions">Selection Matrix for Spark Distributions</h2>

<p>This table guides you to which of the two distributions to select based on the type of Spark workloads.</p>

<table>
<thead>
<tr>
<th>Workloads Types</th>
<th>Spark Distribution and PySpark Package Mode</th>
<th>Spark Config Change</th>
</tr>
</thead>
<tbody>
<tr>
<td>- Only use standard DataFrame and Spark SQL APIs</td>
<td><em>connect</em></td>
<td>None</td>
</tr>
<tr>
<td>- Ability to access and debug Spark from IDE or interact in notebooks</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Use of thin client to access Spark cluster from non-JVM languages</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>&#160;</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Access to RDD APIs</td>
<td><em>classic</em></td>
<td>None</td>
</tr>
<tr>
<td>- Access to SparkContext API and properties</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Access to standard DataFrame and Spark SQL APIs</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Ability to access and debug Spark from IDE or interact in notebooks</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Access to JVM properties</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Access to private catalyst APIs: custom analyzer/optimizer rules, custom query plans</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>&#160;</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>- Able to switch between classic and connect</td>
<td><em>classic</em></td>
<td><code class="language-plaintext highlighter-rouge">spark.api.mode = {classic or connect}</code></td>
</tr>
<tr>
<td>&#160;</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
</tbody>
</table>

<h2 id="download-apache-spark">Download Apache Spark&#8482;</h2>

<ol>
Expand All @@ -265,43 +176,22 @@ <h2 id="download-apache-spark">Download Apache Spark&#8482;</h2>
<p>Download Spark: <span id="spanDownloadLink"></span></p>
</li>
<li>
<p>Verify this release using the <span id="sparkDownloadVerify"></span> and <a href="https://downloads.apache.org/spark/KEYS">project release KEYS</a> by following these <a href="https://www.apache.org/info/verification.html">procedures</a>.
classic</p>
<p>Verify this release using the <span id="sparkDownloadVerify"></span> and <a href="https://downloads.apache.org/spark/KEYS">project release KEYS</a> by following these <a href="https://www.apache.org/info/verification.html">procedures</a>.</p>
</li>
</ol>

<p>Note that Spark 4 is pre-built with Scala 2.13 in general, and Spark 3.5+ provides additional pre-built distribution with Scala 2.13.</p>
<p>Note that Spark 3 is pre-built with Scala 2.12 in general and Spark 3.2+ provides additional pre-built distribution with Scala 2.13.</p>

<h3 id="link-with-spark">Link with Spark</h3>
<p>Spark artifacts are <a href="https://search.maven.org/search?q=g:org.apache.spark">hosted in Maven Central</a>. You can add a Maven dependency with the following coordinates:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>groupId: org.apache.spark
artifactId: spark-core_2.13
version: 4.0.0
artifactId: spark-core_2.12
version: 3.5.4
</code></pre></div></div>

<h3 id="installing-with-pypi">Installing with PyPI</h3>
<p>Like the two distributions mentioned above, PyPI will also have two PySpark package versions. The default is the <em>classic</em> <strong>pyspark</strong>, while the <em>connect</em> version is <strong>pyspark-connect</strong> and is dependent on <strong>pyspark</strong>.</p>

<p>Use the decision matrix above to select which PyPI PySpark package to use for your Spark workloads. Both <a href="https://pypi.org/project/pyspark/">PySpark</a> package versions are available on PyPI.</p>

<h3 id="installing-pyspark-connect">Installing PySpark Connect</h3>

<p>Since <strong>pyspark-connect</strong> package is dependent on <strong>pyspark</strong>, <strong>pyspark-connect</strong> will automatically install <strong>pyspark</strong> for you. The <strong>pyspark-connect</strong> package is mostly empty and merely enables Spark config <code class="language-plaintext highlighter-rouge">spark.api.mode</code> to <em>connect</em> mode in the underlying pyspark package.</p>

<p><code class="language-plaintext highlighter-rouge">pip install pyspark-connect==4.0.0</code></p>

<p>Thereafter, follow the Spark Connect <a href="https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/quickstart_connect.html">quickstart guide</a> on how to use SparkSession.</p>

<h3 id="installing-pyspark-classic">Installing PySpark Classic</h3>

<p>Simply use <code class="language-plaintext highlighter-rouge">pip install pyspark==4.0.0</code></p>

<h3 id="installing-pyspark-client">Installing PySpark Client</h3>

<p>Alternatively, if you only want a pure Python thin library with Spark Connect capabilities, install <em>pyspark-client</em> package: <code class="language-plaintext highlighter-rouge">pip install pyspark-client</code>.</p>

<p>For more detailed examples of Apache Spark 4.0 features, check the <a href="https://turbo-adventure-1pg35k5.pages.github.io/01-preface.html">PySpark User Guide</a> and <a href="https://spark.apache.org/docs/4.0.0-preview2/api/python/getting_started/install.html">PySpark installation</a>.</p>
<h3 id="installing-with-pypi">Installing with PyPi</h3>
<p><a href="https://pypi.org/project/pyspark/">PySpark</a> is now available in pypi. To install just run <code class="language-plaintext highlighter-rouge">pip install pyspark</code>.</p>

<h3 id="installing-with-docker">Installing with Docker</h3>

Expand Down
Loading