Discuss how to improve Spark-on-K8s developer workflow #253

kimoonkim · 2017-04-28T16:14:04Z

We had a discussion on how to improve our workflow. In particular, how we can shorten the 1. maven/sbt build time and 2. docker image build/push time.

I was chatting a bit more on this with @varunkatta. He suggested one could only build k8s module jars and update a distribution with the new jars. @varunkatta Can you comment on how you did this?

It occurred to me that we could extend this idea to docker image building. We could have a base image that has spark core jars, but not k8s module jars. Then the driver and executor images could extend from the base image by adding k8s module jars. Most of the time when we modified our code, we'll only build the child docker images. Hopefully, this will shorten the build/push time.

What do you think?

varunkatta · 2017-04-28T19:30:47Z

This is what I did at a high-level. Built kubernetes spark core, and updated the jar which is present in prebuilt tar zipped distribution with the new updated jar freshly built. Steps outline below.

# assumes you are in top level project dir.
./build/mvn -pl resource-managers/kubernetes/core -Pkubernetes -am package -DskipTests;
# assumes you made a distribution named spark-2.1.0-k8s-0.1.0-SNAPSHOT-bin-k8s-20170301.tgz
tar -zxf spark-2.1.0-k8s-0.1.0-SNAPSHOT-bin-k8s-20170301.tgz; 
scp resource-managers/kubernetes/core/target/spark-kubernetes_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar spark-2.1.0-k8s-0.1.0-SNAPSHOT-bin-k8s-20170301/jars
tar -zcf spark-2.1.0-k8s-0.1.0-SNAPSHOT-bin-k8s-20170301.tgz spark-2.1.0-k8s-0.1.0-SNAPSHOT-bin-k8s-20170301;

varunkatta · 2017-04-28T19:33:16Z

I like the idea of having a base docker with all the required spark jars except the spark-kubernetes jar, and incrementally building the new docker image with the freshly baked spark-kubernetes jar. Curious to see how many seconds it will shave off of current docker image building time.

mccheah · 2017-05-01T17:30:30Z

@kimoonkim or @varunkatta - can either of you ship a PR to use base docker images? Keep in mind to also update SparkDockerImageBuilder in the integration test package also.

One question though is what tag of the base image the child images should depend on. The child dockerfiles reference the parent image by image-name:image-tag, e.g. spark-base:latest or spark-base:2.1. I'm not sure what the right version for the children to depend on is here. Whatever that version is, the SparkDockerImageBuilder in the integration test project needs to tag the base image accordingly.

varunkatta · 2017-05-02T18:13:14Z

I will send a send a PR for this. I am thinking we should use spark-base:latest over using the version number in the tag.

varunkatta · 2017-05-17T17:24:11Z

There is PR in the Kubernetes ansible project for this - apache-spark-on-k8s/ansible#4

…is reused ## What changes were proposed in this pull request? With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())apache-spark-on-k8s#253] : :- Subquery subquery240 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache-spark-on-k8s#250]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- *(1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())apache-spark-on-k8s#299] : :- Subquery subquery286 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache-spark-on-k8s#296]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache-spark-on-k8s#298]) : +- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` ## How was this patch tested? Modified the existing test. Closes apache#24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

ifilonenko pushed a commit to ifilonenko/spark that referenced this issue Feb 26, 2019

Merge pull request apache-spark-on-k8s#253 from palantir/resync-apache

105e1b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss how to improve Spark-on-K8s developer workflow #253

Discuss how to improve Spark-on-K8s developer workflow #253

kimoonkim commented Apr 28, 2017

varunkatta commented Apr 28, 2017

varunkatta commented Apr 28, 2017

mccheah commented May 1, 2017

varunkatta commented May 2, 2017

varunkatta commented May 17, 2017

Discuss how to improve Spark-on-K8s developer workflow #253

Discuss how to improve Spark-on-K8s developer workflow #253

Comments

kimoonkim commented Apr 28, 2017

varunkatta commented Apr 28, 2017

varunkatta commented Apr 28, 2017

mccheah commented May 1, 2017

varunkatta commented May 2, 2017

varunkatta commented May 17, 2017