Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30626][K8S] Add SPARK_APPLICATION_ID into driver pod env #27347

Closed
wants to merge 1 commit into from

Conversation

Jeffwan
Copy link
Contributor

@Jeffwan Jeffwan commented Jan 23, 2020

What changes were proposed in this pull request?

Add SPARK_APPLICATION_ID environment when spark configure driver pod.

Why are the changes needed?

Currently, driver doesn't have this in environments and it's no convenient to retrieve spark id.
The use case is we want to look up spark application id and create application folder and redirect driver logs to application folder.

Does this PR introduce any user-facing change?

no

How was this patch tested?

unit tested. I also build new distribution and container image to kick off a job in Kubernetes and I do see SPARK_APPLICATION_ID added there. .

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 23, 2020

@Jeffwan . The use case looks reasonable.

The use case is we want to look up spark application id and create application folder and redirect driver logs to application folder.

BTW, as a side-note, please participate VOTE if you can. You don't need to run the full test. You can test only what you are interested in (maybe EKS?) and vote.

@dongjoon-hyun
Copy link
Member

cc @dbtsai and @holdenk

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30626][K8S] Add SPARK_APPLICATION_ID env in driver pod env [SPARK-30626][K8S] Add SPARK_APPLICATION_ID into driver pod env Jan 23, 2020
@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Test build #117322 has finished for PR 27347 at commit 1af2e75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22082/

@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22082/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. There are two failures. Could you run the K8s Integration Tests locally and share the result here?

KubernetesSuite:
- Run SparkPi with no resources *** FAILED ***
  The code passed to eventually never returned normally. Attempted 70 times over 2.0010790142166663 minutes. Last failure message: false was not true. (KubernetesSuite.scala:315)
- Run SparkPi with a very long application name. *** FAILED ***
  The code passed to eventually never returned normally. Attempted 70 times over 2.0007419548333334 minutes. Last failure message: false was not true. (KubernetesSuite.scala:315)

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 24, 2020

@dongjoon-hyun Thanks for the review. Sure. I would love to. You mean 2.4.5 and 3.0.0-preview2?

I manually build spark and test sparkpi and let me run integration test and see anything different.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 24, 2020

I mean 2.4.5 RC2 vote. 2.4.5 RC1 vote happened last week. I tested EKS at this time, but it would be great if you can join the party. After merging some correctness fixes, we will start 2.4.5 RC2 vote soon.

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 24, 2020

Run integration-tests locally and notice the problem. Seems

+ [[ docker.io/kubespark == gcr.io* ]]
+ /Users/shjiaxin/Github/spark/bin/docker-image-tool.sh -r docker.io/kubespark -t 82565652-8092-4BF5-A804-9FCDC6CCE5AA push
docker.io/kubespark/spark:82565652-8092-4BF5-A804-9FCDC6CCE5AA image not found. Skipping push for this image.
docker.io/kubespark/spark-py:82565652-8092-4BF5-A804-9FCDC6CCE5AA image not found. Skipping push for this image.
docker.io/kubespark/spark-r:82565652-8092-4BF5-A804-9FCDC6CCE5AA image not found. Skipping push for this image.
+ cd -

Notice there's pod failure in my cluster.

kubectl get pods -n -n c99e1fb3b1b04171baf8f24b0f4a6666
NAME                                              READY   STATUS         RESTARTS   AGE
spark-test-app-2559fdde4ab749dfb45b03a59bdfed68   0/1     ErrImagePull   0          10s

Here's the pod spec. It's pretty clear because pods image doesn't push to registry and pod can not fetch image correctly. But I am not sure if this has exact same reason to 2 failures in CI. Because of missing container image, all my tests failed. If there're only 2 failures, it's probably due to some other problems.

  ----     ------       ----               ----                                                  -------
  Normal   Scheduled    59s                default-scheduler                                     Successfully assigned c99e1fb3b1b04171baf8f24b0f4a6666/spark-test-app-a6037cf7dfa34ed9a500c1f752d82688 to ip-192-168-3-231.us-west-2.compute.internal
  Warning  FailedMount  58s (x2 over 59s)  kubelet, ip-192-168-3-231.us-west-2.compute.internal  MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "longlonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglonglong-1981f16fd6d3f76f-driver-conf-map" not found
  Normal   Pulling      18s (x3 over 57s)  kubelet, ip-192-168-3-231.us-west-2.compute.internal  Pulling image "docker.io/kubespark/spark:82565652-8092-4BF5-A804-9FCDC6CCE5AA"
  Warning  Failed       17s (x3 over 56s)  kubelet, ip-192-168-3-231.us-west-2.compute.internal  Failed to pull image "docker.io/kubespark/spark:82565652-8092-4BF5-A804-9FCDC6CCE5AA": rpc error: code = Unknown desc = Error response from daemon: manifest for kubespark/spark:82565652-8092-4BF5-A804-9FCDC6CCE5AA not found
  Warning  Failed       17s (x3 over 56s)  kubelet, ip-192-168-3-231.us-west-2.compute.internal  Error: ErrImagePull
  Normal   BackOff      5s (x3 over 55s)   kubelet, ip-192-168-3-231.us-west-2.compute.internal  Back-off pulling image "docker.io/kubespark/spark:82565652-8092-4BF5-A804-9FCDC6CCE5AA"
  Warning  Failed       5s (x3 over 55s)   kubelet, ip-192-168-3-231.us-west-2.compute.internal  Error: ImagePullBackOff

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 24, 2020

@dongjoon-hyun BTW, when I run ./dev/dev-run-integration-tests.sh I notice a problem. Based container image doesn't have package cache, and apt install encounter problems

Step 6/12 : RUN apt install -y python python-pip &&     apt install -y python3 python3-pip &&     rm -r /usr/lib/python*/ensurepip &&     pip install --upgrade pip setuptools &&     rm -r /root/.cache && rm -rf /var/cache/apt/*
 ---> Running in 74256170f862

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  ........
  python2-minimal python2.7 python2.7-dev python2.7-minimal readline-common
  shared-mime-info xdg-user-dirs xz-utils
Suggested packages:
 .....
  python2.7-doc binfmt-support readline-doc
The following NEW packages will be installed:
 ....
  python2.7-minimal readline-common shared-mime-info xdg-user-dirs xz-utils
0 upgraded, 130 newly installed, 0 to remove and 0 not upgraded.
Need to get 112 MB of archives.
After this operation, 367 MB of additional disk space will be used.
Get:1 http://deb.debian.org/debian buster/main amd64 perl-modules-5.28 all 5.28.1-6 [2873 kB]
Get:3 http://deb.debian.org/debian buster/main amd64 libgdbm6 amd64 1.18.1-4 [64.7 kB]
Get:4 http://deb.debian.org/debian buster/main amd64 libgdbm-compat4 amd64 1.18.1-4 [44.1 kB]
.....
Get:128 http://deb.debian.org/debian buster/main amd64 python-xdg all 0.25-5 [35.9 kB]
Get:129 http://deb.debian.org/debian buster/main amd64 shared-mime-info amd64 1.10-1 [766 kB]
Get:130 http://deb.debian.org/debian buster/main amd64 xdg-user-dirs amd64 0.17-2 [53.8 kB]
E: Failed to fetch http://deb.debian.org/debian/pool/main/p/python2.7/libpython2.7-minimal_2.7.16-2_amd64.deb  404  Not Found [IP: 151.101.54.133 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/p/python2.7/python2.7-minimal_2.7.16-2_amd64.deb  404  Not Found [IP: 151.101.54.133 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/p/python2.7/libpython2.7-stdlib_2.7.16-2_amd64.deb  404  Not Found [IP: 151.101.54.133 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/p/python2.7/python2.7_2.7.16-2_amd64.deb  404  Not Found [IP: 151.101.54.133 80]
E: Failed to fetch http://security-cdn.debian.org/debian-security/pool/updates/main/l/linux/linux-libc-dev_4.19.67-2+deb10u1_amd64.deb  404  Not Found [IP: 151.101.52.204 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/g/glib2.0/libglib2.0-0_2.58.3-2+deb10u1_amd64.deb  404  Not Found [IP: 151.101.54.133 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/g/glib2.0/libglib2.0-data_2.58.3-2+deb10u1_all.deb  404  Not Found [IP: 151.101.54.133 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/p/python2.7/libpython2.7_2.7.16-2_amd64.deb  404  Not Found [IP: 151.101.54.133 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/p/python2.7/libpython2.7-dev_2.7.16-2_amd64.deb  404  Not Found [IP: 151.101.54.133 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/p/python2.7/python2.7-dev_2.7.16-2_amd64.deb  404  Not Found [IP: 151.101.54.133 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/p/python-cryptography/python-cryptography_2.6.1-3_amd64.deb  404  Not Found [IP: 151.101#
.54.133 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Fetched 71.1 MB in 6s (12.3 MB/s)
The command '/bin/sh -c apt install -y python python-pip &&     apt install -y python3 python3-pip &&     rm -r /usr/lib/python*/ensurepip &&     pip install --upgrade pip setuptools &&     rm -r /root/.cache && rm -rf /var/cache/apt/*' returned a non-zero code: 100
Failed to build PySpark Docker image, please refer to Docker build output for details.

Some paths may not be available and paths needs to be updated, I add apt update before apt install in the Python and R docker files. Do you think it's worth to file a PR to mitigate this issue?

@SparkQA
Copy link

SparkQA commented Jan 24, 2020

Test build #117348 has finished for PR 27347 at commit ac50803.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

It's weird. I'll take a look tomorrow~

@SparkQA
Copy link

SparkQA commented Jan 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22108/

@SparkQA
Copy link

SparkQA commented Jan 24, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/22108/

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 24, 2020

Oh, the second run looks good. All tests passed.

KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Run SparkR on simple dataframe.R example
Run completed in 17 minutes, 27 seconds.
Total number of tests run: 19
Suites: completed 2, aborted 0
Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @Jeffwan . This is merged for 3.0.0.

@dongjoon-hyun
Copy link
Member

I added your new Jira ID to the Apache Spark contributor group, too. I'd like to recommend you to use a single ID. That will be easier to find your contribution.

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 24, 2020

I added your new Jira ID to the Apache Spark contributor group, too. I'd like to recommend you to use a single ID. That will be easier to find your contribution.

I appreciate that! Thanks! I will definitely get more involved in the community. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants