-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50605][CONNECT] Support SQL API mode for easier migration to Spark Connect #49107
Conversation
9dd1ec7
to
6bdbebc
Compare
@@ -620,8 +620,18 @@ object SparkConnectClient { | |||
* Configure the builder using the env SPARK_REMOTE environment variable. | |||
*/ | |||
def loadFromEnvironment(): Builder = { | |||
lazy val isAPIModeConnect = Option(System.getProperty("spark.api.mode")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed because we directly execute ConnectRepl
(which is same as Python or R shells).
f6df8f4
to
0fabc3a
Compare
bc35f57
to
51186fb
Compare
resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile
Outdated
Show resolved
Hide resolved
48d7163
to
041feed
Compare
041feed
to
be264da
Compare
be264da
to
8907020
Compare
This is similar to using `--conf spark.api.mode=connect` with `--master ...`. However, note that | ||
`spark.remote` and `--remote` are limited to `local*` values, while `--conf spark.api.mode=connect` | ||
with `--master ...` supports additional cluster URLs, such as spark://, for broader compatibility with | ||
Spark Classic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell this part is slightly different from what we discussed offline. My logic here is to restrict it first instead of allowing it: --remote
and spark.remote
will support only local*
as it was.
how is it going? |
Needs some review from @hvanhovell .. |
e4cf517
to
137e68c
Compare
rebased .. |
…park Connect ### What changes were proposed in this pull request? This PR proposes to add a new configuration called `spark.api.mode` (default `classic`) that allows existing Spark Classic applications to easily use Spark Connect. ### Why are the changes needed? In order users to easily try Spark Connect with their existing Spark Classic applications. ### Does this PR introduce _any_ user-facing change? It adds a new configuration `spark.api.mode` without changing the default behaviour. ### How was this patch tested? For PySpark applications, added unittests for Spark Submissions in Yarn and Kubernates, and manual tests. For Scala applications, it is difficult to add a test because SBT picks up the complied jars into the classpathes automatically (whereas we can easily control it for production environment). For this case, I manually tested as below: ```bash git clone https://github.com/HyukjinKwon/spark-connect-example cd spark-connect-example build/sbt package cd .. git clone https://github.com/apache/spark.git cd spark build/sbt package # sbin/start-connect-server.sh bin/spark-submit --name "testApp" --master "local[*]" --conf spark.api.mode=connect --class com.hyukjinkwon.SparkConnectExample ../spark-connect-example/target/scala-2.13/spark-connect-example_2.13-0.0.1.jar ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49107 from HyukjinKwon/api-mode-draft-3. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1bc4d15) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
f019164
to
da90697
Compare
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
Questions based on questions on the mailing list thread
|
Yes. For
Yes, it will launch a Spark Connect server locally.
This is the same as Py4J server. It does not enable the auth feature by default. We could have the implementation in Spark Connect to but I would prefer to run this separately. |
I am going to merge this if that sounds fine to you @Kimahriman |
But if it uses the default 15002 port, it will hit issues when multiple drivers are running on the same node, unless it uses an ephemeral port.
Where is auth not enabled for Py4J? It looks like it always generates a secret that the client uses to connect to it
|
Oops, sorry you're correct. We enabled it in Spark specifically yes.
Yes, we should also use ephemeral port. We're doing it when we running it locally in some cases, e.g., |
I think we can address them all anyway because we already have Py4J that exactly works like Spark Connect server. |
WIthout a default mechanism to authenticate to the connect server, this seems like a massive security vulnerability then, right?
Yeah saw that, should an ephemeral port just always be used for the auto launching of the local server? |
@Kimahriman mind I if I run it separately? We already have the logic of running it in any event. |
Fine either way, brought this up on the mailing list to bring a little more awareness |
Yup, thanks for chiming in. Let me followup. Merged to master and branch-4.0. |
…park Connect ### What changes were proposed in this pull request? This PR proposes to add a new configuration called `spark.api.mode` (default `classic`) that allows existing Spark Classic applications to easily use Spark Connect. ### Why are the changes needed? In order users to easily try Spark Connect with their existing Spark Classic applications. ### Does this PR introduce _any_ user-facing change? It adds a new configuration `spark.api.mode` without changing the default behaviour. ### How was this patch tested? For PySpark applications, added unittests for Spark Submissions in Yarn and Kubernates, and manual tests. For Scala applications, it is difficult to add a test because SBT picks up the complied jars into the classpathes automatically (whereas we can easily control it for production environment). For this case, I manually tested as below: ```bash git clone https://github.com/HyukjinKwon/spark-connect-example cd spark-connect-example build/sbt package cd .. git clone https://github.com/apache/spark.git cd spark build/sbt package # sbin/start-connect-server.sh bin/spark-submit --name "testApp" --master "local[*]" --conf spark.api.mode=connect --class com.hyukjinkwon.SparkConnectExample ../spark-connect-example/target/scala-2.13/spark-connect-example_2.13-0.0.1.jar ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49107 from HyukjinKwon/api-mode-draft-3. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 96bd8f1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
am working on it now |
…park Connect ### What changes were proposed in this pull request? This PR proposes to add a new configuration called `spark.api.mode` (default `classic`) that allows existing Spark Classic applications to easily use Spark Connect. ### Why are the changes needed? In order users to easily try Spark Connect with their existing Spark Classic applications. ### Does this PR introduce _any_ user-facing change? It adds a new configuration `spark.api.mode` without changing the default behaviour. ### How was this patch tested? For PySpark applications, added unittests for Spark Submissions in Yarn and Kubernates, and manual tests. For Scala applications, it is difficult to add a test because SBT picks up the complied jars into the classpathes automatically (whereas we can easily control it for production environment). For this case, I manually tested as below: ```bash git clone https://github.com/HyukjinKwon/spark-connect-example cd spark-connect-example build/sbt package cd .. git clone https://github.com/apache/spark.git cd spark build/sbt package # sbin/start-connect-server.sh bin/spark-submit --name "testApp" --master "local[*]" --conf spark.api.mode=connect --class com.hyukjinkwon.SparkConnectExample ../spark-connect-example/target/scala-2.13/spark-connect-example_2.13-0.0.1.jar ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49107 from HyukjinKwon/api-mode-draft-3. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@@ -237,6 +265,16 @@ class YarnClusterSuite extends BaseYarnClusterSuite { | |||
testPySpark(false) | |||
} | |||
|
|||
test("run Python application with Spark Connect in yarn-client mode") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon These two new test cases should assume sql.IntegratedUDFTestUtils.isPandasAvailable
, otherwise they may lead to test failures, such as: https://github.com/apache/spark/actions/runs/13179439122/job/36786327176.
However, the yarn
module currently does not have a dependency on the sql
test module. Should we add this dependency to the yarn
module, or is there other better alternative solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... It appears that we need to check the entire Connect runtime environment, not just pandas and pyarrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me skip the test for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the reason seems to be:
Traceback (most recent call last):
File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py", line 28, in require_minimum_pandas_version
ModuleNotFoundError: No module named 'pandas'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/spark/spark/resource-managers/yarn/target/tmp/spark-ba2c7cc1-250b-4e3d-89aa-a6c729012dcf/test.py", line 13, in <module>
"spark.api.mode", "connect").master("yarn").getOrCreate()
^^^^^^^^^^^^^
File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 492, in getOrCreate
File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/connect/session.py", line 19, in <module>
File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/connect/utils.py", line 35, in check_dependencies
File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py", line 43, in require_minimum_pandas_version
pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] Pandas >= 2.0.0 must be installed; however, it was not found.
14:03:30.599 INFO org.apache.spark.util.ShutdownHookManager: Shutdown hook called
14:03:30.604 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /tmp/spark-f973a6e2-72c5-4759-8709-b18b15afc3d2
14:03:30.608 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /tmp/localPyFiles-ce5279c9-5a6a-4547-84e9-3d01302054d0 (BaseYarnClusterSuite.scala:242)
- run Python application with Spark Connect in yarn-cluster mode *** FAILED ***
FAILED did not equal FINISHED WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "main" org.apache.spark.SparkException: Application application_1738850370406_0018 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1393)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1827)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1032)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:204)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:96)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1137)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1146)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) (BaseYarnClusterSuite.scala:242)
- run Python application in yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar
- ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### What changes were proposed in this pull request? This PR proposes to disallow Spark Connect strings in `--master` when Spark API mode is `connect`. This is Python specific issue. ### Why are the changes needed? Should work as documented in #49107 ### Does this PR introduce _any_ user-facing change? Not yet because the main change has not been released (#49107) ### How was this patch tested? Manually tested: ``` ./bin/pyspark --master "sc://localhost:15002" --conf spark.api.mode=connect ``` ``` Python 3.11.9 (main, Apr 19 2024, 11:44:45) [Clang 14.0.6 ] on darwin Type "help", "copyright", "credits" or "license" for more information. /.../spark/python/pyspark/shell.py:77: UserWarning: Failed to initialize Spark session. warnings.warn("Failed to initialize Spark session.") Traceback (most recent call last): File "/.../spark/python/pyspark/shell.py", line 52, in <module> spark = SparkSession.builder.getOrCreate() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.../spark/python/pyspark/sql/session.py", line 512, in getOrCreate raise PySparkRuntimeError( pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_INVALID] Master must either be yarn or start with spark, k8s, or local. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50000 from HyukjinKwon/SPARK-51254. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This PR proposes to disallow Spark Connect strings in `--master` when Spark API mode is `connect`. This is Python specific issue. ### Why are the changes needed? Should work as documented in #49107 ### Does this PR introduce _any_ user-facing change? Not yet because the main change has not been released (#49107) ### How was this patch tested? Manually tested: ``` ./bin/pyspark --master "sc://localhost:15002" --conf spark.api.mode=connect ``` ``` Python 3.11.9 (main, Apr 19 2024, 11:44:45) [Clang 14.0.6 ] on darwin Type "help", "copyright", "credits" or "license" for more information. /.../spark/python/pyspark/shell.py:77: UserWarning: Failed to initialize Spark session. warnings.warn("Failed to initialize Spark session.") Traceback (most recent call last): File "/.../spark/python/pyspark/shell.py", line 52, in <module> spark = SparkSession.builder.getOrCreate() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.../spark/python/pyspark/sql/session.py", line 512, in getOrCreate raise PySparkRuntimeError( pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_INVALID] Master must either be yarn or start with spark, k8s, or local. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50000 from HyukjinKwon/SPARK-51254. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6603a4e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR proposes to add a new configuration called
spark.api.mode
(defaultclassic
) that allows existing Spark Classic applications to easily use Spark Connect.Why are the changes needed?
In order users to easily try Spark Connect with their existing Spark Classic applications.
Does this PR introduce any user-facing change?
It adds a new configuration
spark.api.mode
without changing the default behaviour.How was this patch tested?
For PySpark applications, added unittests for Spark Submissions in Yarn and Kubernates, and manual tests.
For Scala applications, it is difficult to add a test because SBT picks up the complied jars into the classpathes automatically (whereas we can easily control it for production environment).
For this case, I manually tested as below:
Was this patch authored or co-authored using generative AI tooling?
No.