This repository has been archived by the owner on Jan 9, 2020. It is now read-only.
forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 118
Docs improvements #176
Merged
Merged
Docs improvements #176
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
0094929
Adding official alpha docker image to docs
foxish b9913c2
Reorder sections and create a specific one for "advanced"
foxish bcb779b
Provide limitations and instructions about running on GKE
foxish 5c7e787
Fix title of advanced section: submission
foxish 0abf312
Improved section on running in the cloud
foxish 37239f3
Update versioning
foxish d9f4eb9
Address comments
foxish 738f791
Address comments
foxish File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
--- | ||
layout: global | ||
title: Running Spark in the cloud with Kubernetes | ||
--- | ||
|
||
For general information about running Spark on Kubernetes, refer to [running Spark on Kubernetes](running-on-kubernetes.md). | ||
|
||
A Kubernetes cluster may be brought up on different cloud providers or on premise. It is commonly provisioned through [Google Container Engine](https://cloud.google.com/container-engine/), or using [kops](https://github.com/kubernetes/kops) on AWS, or on premise using [kubeadm](https://kubernetes.io/docs/getting-started-guides/kubeadm/). | ||
|
||
## Running on Google Container Engine (GKE) | ||
|
||
* Create a GKE [container cluster](https://cloud.google.com/container-engine/docs/clusters/operations). | ||
* Obtain kubectl and [configure](https://cloud.google.com/container-engine/docs/clusters/operations#configuring_kubectl) it appropriately. | ||
* Find the identity of the master associated with this project. | ||
|
||
> kubectl cluster-info | ||
Kubernetes master is running at https://<master-ip>:443 | ||
|
||
* Run spark-submit with the master option set to `k8s://https://<master-ip>:443`. The instructions for running spark-submit are provided in the [running on kubernetes](running-on-kubernetes.md) tutorial. | ||
* Check that your driver pod, and subsequently your executor pods are launched using `kubectl get pods`. | ||
* Read the stdout and stderr of the driver pod using `kubectl logs <name-of-driver-pod>`, or stream the logs using `kubectl logs -f <name-of-driver-pod>`. | ||
|
||
Known issues: | ||
* If you face OAuth token expiry errors when you run spark-submit, it is likely because the token needs to be refreshed. The easiest way to fix this is to run any `kubectl` command, say, `kubectl version` and then retry your submission. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,15 +12,28 @@ currently limited and not well-tested. This should not be used in production env | |
* You must have appropriate permissions to create and list [pods](https://kubernetes.io/docs/user-guide/pods/), [nodes](https://kubernetes.io/docs/admin/node/) and [services](https://kubernetes.io/docs/user-guide/services/) in your cluster. You can verify that you can list these resources by running `kubectl get nodes`, `kubectl get pods` and `kubectl get svc` which should give you a list of nodes, pods and services (if any) respectively. | ||
* You must have an extracted spark distribution with Kubernetes support, or build one from [source](https://github.com/apache-spark-on-k8s/spark). | ||
|
||
## Setting Up Docker Images | ||
## Driver & Executor Images | ||
|
||
Kubernetes requires users to supply images that can be deployed into containers within pods. The images are built to | ||
be run in a container runtime environment that Kubernetes supports. Docker is a container runtime environment that is | ||
frequently used with Kubernetes, so Spark provides some support for working with Docker to get started quickly. | ||
|
||
To use Spark on Kubernetes with Docker, images for the driver and the executors need to built and published to an | ||
accessible Docker registry. Spark distributions include the Docker files for the driver and the executor at | ||
`dockerfiles/driver/Dockerfile` and `docker/executor/Dockerfile`, respectively. Use these Docker files to build the | ||
If you wish to use pre-built docker images, you may use the images published in [kubespark](https://hub.docker.com/u/kubespark/). The images are as follows: | ||
|
||
<table class="table"> | ||
<tr><th>Component</th><th>Image</th></tr> | ||
<tr> | ||
<td>Spark Driver Image</td> | ||
<td><code>kubespark/spark-driver:v2.1.0-k8s-support-0.1.0-alpha.1</code></td> | ||
</tr> | ||
<tr> | ||
<td>Spark Executor Image</td> | ||
<td><code>kubespark/spark-executor:v2.1.0-k8s-support-0.1.0-alpha.1</code></td> | ||
</tr> | ||
</table> | ||
|
||
You may also build these docker images from sources, or customize them as required. Spark distributions include the Docker files for the driver and the executor at | ||
`dockerfiles/driver/Dockerfile` and `dockerfiles/executor/Dockerfile`, respectively. Use these Docker files to build the | ||
Docker images, and then tag them with the registry that the images should be sent to. Finally, push the images to the | ||
registry. | ||
|
||
|
@@ -44,8 +57,8 @@ are set up as described above: | |
--kubernetes-namespace default \ | ||
--conf spark.executor.instances=5 \ | ||
--conf spark.app.name=spark-pi \ | ||
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest \ | ||
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest \ | ||
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-k8s-support-0.1.0-alpha.1 \ | ||
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-k8s-support-0.1.0-alpha.1 \ | ||
examples/jars/spark_examples_2.11-2.2.0.jar | ||
|
||
The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting | ||
|
@@ -55,7 +68,6 @@ being contacted at `api_server_url`. If no HTTP protocol is specified in the URL | |
setting the master to `k8s://example.com:443` is equivalent to setting it to `k8s://https://example.com:443`, but to | ||
connect without SSL on a different port, the master would be set to `k8s://http://example.com:8443`. | ||
|
||
|
||
If you have a Kubernetes cluster setup, one way to discover the apiserver URL is by executing `kubectl cluster-info`. | ||
|
||
> kubectl cluster-info | ||
|
@@ -67,33 +79,17 @@ In the above example, the specific Kubernetes cluster can be used with spark sub | |
Note that applications can currently only be executed in cluster mode, where the driver and its executors are running on | ||
the cluster. | ||
|
||
### Dependency Management and Docker Containers | ||
### Specifying input files | ||
|
||
Spark supports specifying JAR paths that are either on the submitting host's disk, or are located on the disk of the | ||
driver and executors. Refer to the [application submission](submitting-applications.html#advanced-dependency-management) | ||
section for details. Note that files specified with the `local://` scheme should be added to the container image of both | ||
the driver and the executors. Files without a scheme or with the scheme `file://` are treated as being on the disk of | ||
the submitting machine, and are uploaded to the driver running in Kubernetes before launching the application. | ||
|
||
### Setting Up SSL For Submitting the Driver | ||
|
||
When submitting to Kubernetes, a pod is started for the driver, and the pod starts an HTTP server. This HTTP server | ||
receives the driver's configuration, including uploaded driver jars, from the client before starting the application. | ||
Spark supports using SSL to encrypt the traffic in this bootstrapping process. It is recommended to configure this | ||
whenever possible. | ||
### Accessing Kubernetes Clusters | ||
|
||
See the [security page](security.html) and [configuration](configuration.html) sections for more information on | ||
configuring SSL; use the prefix `spark.ssl.kubernetes.submit` in configuring the SSL-related fields in the context | ||
of submitting to Kubernetes. For example, to set the trustStore used when the local machine communicates with the driver | ||
pod in starting the application, set `spark.ssl.kubernetes.submit.trustStore`. | ||
|
||
One note about the keyStore is that it can be specified as either a file on the client machine or a file in the | ||
container image's disk. Thus `spark.ssl.kubernetes.submit.keyStore` can be a URI with a scheme of either `file:` | ||
or `local:`. A scheme of `file:` corresponds to the keyStore being located on the client machine; it is mounted onto | ||
the driver container as a [secret volume](https://kubernetes.io/docs/user-guide/secrets/). When the URI has the scheme | ||
`local:`, the file is assumed to already be on the container's disk at the appropriate path. | ||
|
||
### Kubernetes Clusters and the authenticated proxy endpoint | ||
For details about running on public cloud environments, such as Google Container Engine (GKE), refer to [running Spark in the cloud with Kubernetes](running-on-kubernetes-cloud.md). | ||
|
||
Spark-submit also supports submission through the | ||
[local kubectl proxy](https://kubernetes.io/docs/user-guide/accessing-the-cluster/#using-kubectl-proxy). One can use the | ||
|
@@ -112,16 +108,36 @@ If our local proxy were listening on port 8001, we would have our submission loo | |
--kubernetes-namespace default \ | ||
--conf spark.executor.instances=5 \ | ||
--conf spark.app.name=spark-pi \ | ||
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest \ | ||
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest \ | ||
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-k8s-support-0.1.0-alpha.1 \ | ||
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-k8s-support-0.1.0-alpha.1 \ | ||
examples/jars/spark_examples_2.11-2.2.0.jar | ||
|
||
Communication between Spark and Kubernetes clusters is performed using the fabric8 kubernetes-client library. | ||
The above mechanism using `kubectl proxy` can be used when we have authentication providers that the fabric8 | ||
kubernetes-client library does not support. Authentication using X509 Client Certs and oauth tokens | ||
kubernetes-client library does not support. Authentication using X509 Client Certs and OAuth tokens | ||
is currently supported. | ||
|
||
### Determining the Driver Base URI | ||
## Advanced | ||
|
||
### Setting Up SSL For Submitting the Driver | ||
|
||
When submitting to Kubernetes, a pod is started for the driver, and the pod starts an HTTP server. This HTTP server | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the former is more common. |
||
receives the driver's configuration, including uploaded driver jars, from the client before starting the application. | ||
Spark supports using SSL to encrypt the traffic in this bootstrapping process. It is recommended to configure this | ||
whenever possible. | ||
|
||
See the [security page](security.html) and [configuration](configuration.html) sections for more information on | ||
configuring SSL; use the prefix `spark.ssl.kubernetes.submit` in configuring the SSL-related fields in the context | ||
of submitting to Kubernetes. For example, to set the trustStore used when the local machine communicates with the driver | ||
pod in starting the application, set `spark.ssl.kubernetes.submit.trustStore`. | ||
|
||
One note about the keyStore is that it can be specified as either a file on the client machine or a file in the | ||
container image's disk. Thus `spark.ssl.kubernetes.submit.keyStore` can be a URI with a scheme of either `file:` | ||
or `local:`. A scheme of `file:` corresponds to the keyStore being located on the client machine; it is mounted onto | ||
the driver container as a [secret volume](https://kubernetes.io/docs/user-guide/secrets/). When the URI has the scheme | ||
`local:`, the file is assumed to already be on the container's disk at the appropriate path. | ||
|
||
### Submission of Local Files through Ingress/External controller | ||
|
||
Kubernetes pods run with their own IP address space. If Spark is run in cluster mode, the driver pod may not be | ||
accessible to the submitter. However, the submitter needs to send local dependencies from its local disk to the driver | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these don't match the tags I see at https://hub.docker.com/r/kubespark/spark-executor/tags/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed the tag an hour ago and it seems to be in there. Will rebuild and update after the rebase also.