Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Discuss how to improve Spark-on-K8s developer workflow #253

Open
kimoonkim opened this issue Apr 28, 2017 · 5 comments
Open

Discuss how to improve Spark-on-K8s developer workflow #253

kimoonkim opened this issue Apr 28, 2017 · 5 comments

Comments

@kimoonkim
Copy link
Member

@foxish @varunkatta

We had a discussion on how to improve our workflow. In particular, how we can shorten the 1. maven/sbt build time and 2. docker image build/push time.

I was chatting a bit more on this with @varunkatta. He suggested one could only build k8s module jars and update a distribution with the new jars. @varunkatta Can you comment on how you did this?

It occurred to me that we could extend this idea to docker image building. We could have a base image that has spark core jars, but not k8s module jars. Then the driver and executor images could extend from the base image by adding k8s module jars. Most of the time when we modified our code, we'll only build the child docker images. Hopefully, this will shorten the build/push time.

What do you think?

@varunkatta
Copy link
Member

This is what I did at a high-level. Built kubernetes spark core, and updated the jar which is present in prebuilt tar zipped distribution with the new updated jar freshly built. Steps outline below.

# assumes you are in top level project dir.
./build/mvn -pl resource-managers/kubernetes/core -Pkubernetes -am package -DskipTests;
# assumes you made a distribution named spark-2.1.0-k8s-0.1.0-SNAPSHOT-bin-k8s-20170301.tgz
tar -zxf spark-2.1.0-k8s-0.1.0-SNAPSHOT-bin-k8s-20170301.tgz; 
scp resource-managers/kubernetes/core/target/spark-kubernetes_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar spark-2.1.0-k8s-0.1.0-SNAPSHOT-bin-k8s-20170301/jars
tar -zcf spark-2.1.0-k8s-0.1.0-SNAPSHOT-bin-k8s-20170301.tgz spark-2.1.0-k8s-0.1.0-SNAPSHOT-bin-k8s-20170301;

@varunkatta
Copy link
Member

I like the idea of having a base docker with all the required spark jars except the spark-kubernetes jar, and incrementally building the new docker image with the freshly baked spark-kubernetes jar. Curious to see how many seconds it will shave off of current docker image building time.

@mccheah
Copy link

mccheah commented May 1, 2017

@kimoonkim or @varunkatta - can either of you ship a PR to use base docker images? Keep in mind to also update SparkDockerImageBuilder in the integration test package also.

One question though is what tag of the base image the child images should depend on. The child dockerfiles reference the parent image by image-name:image-tag, e.g. spark-base:latest or spark-base:2.1. I'm not sure what the right version for the children to depend on is here. Whatever that version is, the SparkDockerImageBuilder in the integration test project needs to tag the base image accordingly.

@varunkatta
Copy link
Member

I will send a send a PR for this. I am thinking we should use spark-base:latest over using the version number in the tag.

@varunkatta
Copy link
Member

There is PR in the Kubernetes ansible project for this - apache-spark-on-k8s/ansible#4

ifilonenko pushed a commit to ifilonenko/spark that referenced this issue Feb 26, 2019
ifilonenko pushed a commit to ifilonenko/spark that referenced this issue Apr 18, 2019
…is reused

## What changes were proposed in this pull request?
With this change, we can easily identify the plan difference when subquery is reused.

When the reuse is enabled, the plan looks like
```
== Physical Plan ==
CollectLimit 1
+- *(1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())apache-spark-on-k8s#253]
   :  :- Subquery subquery240
   :  :  +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache-spark-on-k8s#250])
   :  :     +- Exchange SinglePartition
   :  :        +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L])
   :  :           +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
   :  :              +- Scan[obj#12]
   :  +- ReusedSubquery Subquery subquery240
   +- *(1) SerializeFromObject
      +- Scan[obj#12]
```

When the reuse is disabled, the plan looks like
```
== Physical Plan ==
CollectLimit 1
+- *(1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())apache-spark-on-k8s#299]
   :  :- Subquery subquery286
   :  :  +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache-spark-on-k8s#296])
   :  :     +- Exchange SinglePartition
   :  :        +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L])
   :  :           +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
   :  :              +- Scan[obj#12]
   :  +- Subquery subquery287
   :     +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache-spark-on-k8s#298])
   :        +- Exchange SinglePartition
   :           +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L])
   :              +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
   :                 +- Scan[obj#12]
   +- *(1) SerializeFromObject
      +- Scan[obj#12]
```

## How was this patch tested?
Modified the existing test.

Closes apache#24258 from gatorsmile/followupSPARK-27279.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants