Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50605][CONNECT] Support SQL API mode for easier migration to Spark Connect #49107

Closed
wants to merge 5 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Dec 9, 2024

What changes were proposed in this pull request?

This PR proposes to add a new configuration called spark.api.mode (default classic) that allows existing Spark Classic applications to easily use Spark Connect.

Why are the changes needed?

In order users to easily try Spark Connect with their existing Spark Classic applications.

Does this PR introduce any user-facing change?

It adds a new configuration spark.api.mode without changing the default behaviour.

How was this patch tested?

For PySpark applications, added unittests for Spark Submissions in Yarn and Kubernates, and manual tests.
For Scala applications, it is difficult to add a test because SBT picks up the complied jars into the classpathes automatically (whereas we can easily control it for production environment).

For this case, I manually tested as below:

git clone https://github.com/HyukjinKwon/spark-connect-example
cd spark-connect-example
build/sbt package
cd ..
git clone https://github.com/apache/spark.git
cd spark
build/sbt package
# sbin/start-connect-server.sh
bin/spark-submit --name "testApp" --master "local[*]" --conf spark.api.mode=connect --class com.hyukjinkwon.SparkConnectExample ../spark-connect-example/target/scala-2.13/spark-connect-example_2.13-0.0.1.jar

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon HyukjinKwon marked this pull request as draft December 9, 2024 04:53
@HyukjinKwon HyukjinKwon force-pushed the api-mode-draft-3 branch 2 times, most recently from 9dd1ec7 to 6bdbebc Compare December 9, 2024 05:54
@@ -620,8 +620,18 @@ object SparkConnectClient {
* Configure the builder using the env SPARK_REMOTE environment variable.
*/
def loadFromEnvironment(): Builder = {
lazy val isAPIModeConnect = Option(System.getProperty("spark.api.mode"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because we directly execute ConnectRepl (which is same as Python or R shells).

@github-actions github-actions bot added the DOCS label Dec 18, 2024
@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE] Support SQL API mode, prototype 3 [SPARK-50605][CONNECT] Support SQL API mode for easier migration to Spark Connect Dec 18, 2024
@HyukjinKwon HyukjinKwon marked this pull request as ready for review December 18, 2024 04:18
This is similar to using `--conf spark.api.mode=connect` with `--master ...`. However, note that
`spark.remote` and `--remote` are limited to `local*` values, while `--conf spark.api.mode=connect`
with `--master ...` supports additional cluster URLs, such as spark://, for broader compatibility with
Spark Classic.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell this part is slightly different from what we discussed offline. My logic here is to restrict it first instead of allowing it: --remote and spark.remote will support only local* as it was.

@cloud-fan
Copy link
Contributor

how is it going?

@HyukjinKwon
Copy link
Member Author

Needs some review from @hvanhovell ..

@HyukjinKwon
Copy link
Member Author

rebased ..

HyukjinKwon added a commit that referenced this pull request Feb 4, 2025
…park Connect

### What changes were proposed in this pull request?

This PR proposes to add a new configuration called `spark.api.mode` (default `classic`) that allows existing Spark Classic applications to easily use Spark Connect.

### Why are the changes needed?

In order users to easily try Spark Connect with their existing Spark Classic applications.

### Does this PR introduce _any_ user-facing change?

It adds a new configuration `spark.api.mode` without changing the default behaviour.

### How was this patch tested?

For PySpark applications, added unittests for Spark Submissions in Yarn and Kubernates, and manual tests.
For Scala applications, it is difficult to add a test because SBT picks up the complied jars into the classpathes automatically (whereas we can easily control it for production environment).

For this case, I manually tested as below:

```bash
git clone https://github.com/HyukjinKwon/spark-connect-example
cd spark-connect-example
build/sbt package
cd ..
git clone https://github.com/apache/spark.git
cd spark
build/sbt package
# sbin/start-connect-server.sh
bin/spark-submit --name "testApp" --master "local[*]" --conf spark.api.mode=connect --class com.hyukjinkwon.SparkConnectExample ../spark-connect-example/target/scala-2.13/spark-connect-example_2.13-0.0.1.jar
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49107 from HyukjinKwon/api-mode-draft-3.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 1bc4d15)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@Kimahriman
Copy link
Contributor

Kimahriman commented Feb 5, 2025

Questions based on questions on the mailing list thread

  • Is this supposed to work with cluster deploy mode?
  • If so, how would this work on YARN? It looks like it will only use an ephemeral port when running tests, I think YARN would also need to use an ephemeral port?
  • And if so, how is the connect server being secured? If this is running in a secured but shared environment, the connect server may be accessible by anyone else using the cluster.

@HyukjinKwon
Copy link
Member Author

Is this supposed to work with cluster deploy mode?

Yes. For --master ..., It will run a local Spark Connect server where Spark Driver is running.

If so, how would this work on YARN? It looks like it will only use an ephemeral port when running tests, I think YARN would also need to use an ephemeral port?

Yes, it will launch a Spark Connect server locally.

And if so, how is the connect server being secured? If this is running in a secured but shared environment, the connect server may be accessible by anyone else using the cluster.

This is the same as Py4J server. It does not enable the auth feature by default. We could have the implementation in Spark Connect to but I would prefer to run this separately.

@HyukjinKwon
Copy link
Member Author

I am going to merge this if that sounds fine to you @Kimahriman

@Kimahriman
Copy link
Contributor

If so, how would this work on YARN? It looks like it will only use an ephemeral port when running tests, I think YARN would also need to use an ephemeral port?

Yes, it will launch a Spark Connect server locally.

But if it uses the default 15002 port, it will hit issues when multiple drivers are running on the same node, unless it uses an ephemeral port.

This is the same as Py4J server. It does not enable the auth feature by default. We could have the implementation in Spark Connect to but I would prefer to run this separately.

Where is auth not enabled for Py4J? It looks like it always generates a secret that the client uses to connect to it

private[spark] val secret: String = Utils.createSecret(sparkConf)

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 6, 2025

Oops, sorry you're correct. We enabled it in Spark specifically yes.

But if it uses the default 15002 port, it will hit issues when multiple drivers are running on the same node, unless it uses an ephemeral port.

Yes, we should also use ephemeral port. We're doing it when we running it locally in some cases, e.g., pyspark shell with --remote local (see DefaultChannelBuilder.default_port)

@HyukjinKwon
Copy link
Member Author

I think we can address them all anyway because we already have Py4J that exactly works like Spark Connect server.

@Kimahriman
Copy link
Contributor

Oops, sorry you're correct. We enabled it in Spark specifically yes.

WIthout a default mechanism to authenticate to the connect server, this seems like a massive security vulnerability then, right?

Yes, we should also use ephemeral port. We're doing it when we running it locally in some cases, e.g., pyspark shell with --remote local (see DefaultChannelBuilder.default_port)

Yeah saw that, should an ephemeral port just always be used for the auto launching of the local server?

@HyukjinKwon
Copy link
Member Author

@Kimahriman mind I if I run it separately? We already have the logic of running it in any event.

@Kimahriman
Copy link
Contributor

@Kimahriman mind I if I run it separately? We already have the logic of running it in any event.

Fine either way, brought this up on the mailing list to bring a little more awareness

@HyukjinKwon
Copy link
Member Author

Yup, thanks for chiming in. Let me followup.

Merged to master and branch-4.0.

HyukjinKwon added a commit that referenced this pull request Feb 6, 2025
…park Connect

### What changes were proposed in this pull request?

This PR proposes to add a new configuration called `spark.api.mode` (default `classic`) that allows existing Spark Classic applications to easily use Spark Connect.

### Why are the changes needed?

In order users to easily try Spark Connect with their existing Spark Classic applications.

### Does this PR introduce _any_ user-facing change?

It adds a new configuration `spark.api.mode` without changing the default behaviour.

### How was this patch tested?

For PySpark applications, added unittests for Spark Submissions in Yarn and Kubernates, and manual tests.
For Scala applications, it is difficult to add a test because SBT picks up the complied jars into the classpathes automatically (whereas we can easily control it for production environment).

For this case, I manually tested as below:

```bash
git clone https://github.com/HyukjinKwon/spark-connect-example
cd spark-connect-example
build/sbt package
cd ..
git clone https://github.com/apache/spark.git
cd spark
build/sbt package
# sbin/start-connect-server.sh
bin/spark-submit --name "testApp" --master "local[*]" --conf spark.api.mode=connect --class com.hyukjinkwon.SparkConnectExample ../spark-connect-example/target/scala-2.13/spark-connect-example_2.13-0.0.1.jar
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49107 from HyukjinKwon/api-mode-draft-3.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 96bd8f1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon
Copy link
Member Author

am working on it now

zecookiez pushed a commit to zecookiez/spark that referenced this pull request Feb 6, 2025
…park Connect

### What changes were proposed in this pull request?

This PR proposes to add a new configuration called `spark.api.mode` (default `classic`) that allows existing Spark Classic applications to easily use Spark Connect.

### Why are the changes needed?

In order users to easily try Spark Connect with their existing Spark Classic applications.

### Does this PR introduce _any_ user-facing change?

It adds a new configuration `spark.api.mode` without changing the default behaviour.

### How was this patch tested?

For PySpark applications, added unittests for Spark Submissions in Yarn and Kubernates, and manual tests.
For Scala applications, it is difficult to add a test because SBT picks up the complied jars into the classpathes automatically (whereas we can easily control it for production environment).

For this case, I manually tested as below:

```bash
git clone https://github.com/HyukjinKwon/spark-connect-example
cd spark-connect-example
build/sbt package
cd ..
git clone https://github.com/apache/spark.git
cd spark
build/sbt package
# sbin/start-connect-server.sh
bin/spark-submit --name "testApp" --master "local[*]" --conf spark.api.mode=connect --class com.hyukjinkwon.SparkConnectExample ../spark-connect-example/target/scala-2.13/spark-connect-example_2.13-0.0.1.jar
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#49107 from HyukjinKwon/api-mode-draft-3.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@@ -237,6 +265,16 @@ class YarnClusterSuite extends BaseYarnClusterSuite {
testPySpark(false)
}

test("run Python application with Spark Connect in yarn-client mode") {
Copy link
Contributor

@LuciferYang LuciferYang Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon These two new test cases should assume sql.IntegratedUDFTestUtils.isPandasAvailable, otherwise they may lead to test failures, such as: https://github.com/apache/spark/actions/runs/13179439122/job/36786327176.

image

However, the yarn module currently does not have a dependency on the sql test module. Should we add this dependency to the yarn module, or is there other better alternative solution?

Copy link
Contributor

@LuciferYang LuciferYang Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... It appears that we need to check the entire Connect runtime environment, not just pandas and pyarrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me skip the test for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the reason seems to be:


  Traceback (most recent call last):
    File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py", line 28, in require_minimum_pandas_version
  ModuleNotFoundError: No module named 'pandas'
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/home/runner/work/spark/spark/resource-managers/yarn/target/tmp/spark-ba2c7cc1-250b-4e3d-89aa-a6c729012dcf/test.py", line 13, in <module>
      "spark.api.mode", "connect").master("yarn").getOrCreate()
                                                  ^^^^^^^^^^^^^
    File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 492, in getOrCreate
    File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/connect/session.py", line 19, in <module>
    File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/connect/utils.py", line 35, in check_dependencies
    File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py", line 43, in require_minimum_pandas_version
  pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] Pandas >= 2.0.0 must be installed; however, it was not found.
  14:03:30.599 INFO org.apache.spark.util.ShutdownHookManager: Shutdown hook called
  
  14:03:30.604 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /tmp/spark-f973a6e2-72c5-4759-8709-b18b15afc3d2
  
  14:03:30.608 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /tmp/localPyFiles-ce5279c9-5a6a-4547-84e9-3d01302054d0 (BaseYarnClusterSuite.scala:242)
- run Python application with Spark Connect in yarn-cluster mode *** FAILED ***
  FAILED did not equal FINISHED WARNING: Using incubator modules: jdk.incubator.vector
  Exception in thread "main" org.apache.spark.SparkException: Application application_1738850370406_0018 finished with failed status
  	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1393)
  	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1827)
  	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1032)
  	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:204)
  	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:227)
  	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:96)
  	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1137)
  	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1146)
  	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) (BaseYarnClusterSuite.scala:242)
- run Python application in yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar
- ```

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyukjinKwon added a commit that referenced this pull request Feb 20, 2025
### What changes were proposed in this pull request?

This PR proposes to disallow Spark Connect strings in `--master` when Spark API mode is `connect`. This is Python specific issue.

### Why are the changes needed?

Should work as documented in #49107

### Does this PR introduce _any_ user-facing change?

Not yet because the main change has not been released (#49107)

### How was this patch tested?

Manually tested:

```
 ./bin/pyspark --master "sc://localhost:15002" --conf spark.api.mode=connect
```
```
Python 3.11.9 (main, Apr 19 2024, 11:44:45) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
/.../spark/python/pyspark/shell.py:77: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/.../spark/python/pyspark/shell.py", line 52, in <module>
    spark = SparkSession.builder.getOrCreate()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../spark/python/pyspark/sql/session.py", line 512, in getOrCreate
    raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_INVALID] Master must either be yarn or start with spark, k8s, or local.
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50000 from HyukjinKwon/SPARK-51254.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon added a commit that referenced this pull request Feb 20, 2025
### What changes were proposed in this pull request?

This PR proposes to disallow Spark Connect strings in `--master` when Spark API mode is `connect`. This is Python specific issue.

### Why are the changes needed?

Should work as documented in #49107

### Does this PR introduce _any_ user-facing change?

Not yet because the main change has not been released (#49107)

### How was this patch tested?

Manually tested:

```
 ./bin/pyspark --master "sc://localhost:15002" --conf spark.api.mode=connect
```
```
Python 3.11.9 (main, Apr 19 2024, 11:44:45) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
/.../spark/python/pyspark/shell.py:77: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/.../spark/python/pyspark/shell.py", line 52, in <module>
    spark = SparkSession.builder.getOrCreate()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../spark/python/pyspark/sql/session.py", line 512, in getOrCreate
    raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_INVALID] Master must either be yarn or start with spark, k8s, or local.
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50000 from HyukjinKwon/SPARK-51254.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 6603a4e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants