-
Notifications
You must be signed in to change notification settings - Fork 118
Discuss how to improve Spark-on-K8s developer workflow #253
Comments
This is what I did at a high-level. Built kubernetes spark core, and updated the jar which is present in prebuilt tar zipped distribution with the new updated jar freshly built. Steps outline below.
|
I like the idea of having a base docker with all the required spark jars except the spark-kubernetes jar, and incrementally building the new docker image with the freshly baked spark-kubernetes jar. Curious to see how many seconds it will shave off of current docker image building time. |
@kimoonkim or @varunkatta - can either of you ship a PR to use base docker images? Keep in mind to also update One question though is what tag of the base image the child images should depend on. The child dockerfiles reference the parent image by |
I will send a send a PR for this. I am thinking we should use spark-base:latest over using the version number in the tag. |
There is PR in the Kubernetes ansible project for this - apache-spark-on-k8s/ansible#4 |
…is reused ## What changes were proposed in this pull request? With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())apache-spark-on-k8s#253] : :- Subquery subquery240 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache-spark-on-k8s#250]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- *(1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())apache-spark-on-k8s#299] : :- Subquery subquery286 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache-spark-on-k8s#296]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache-spark-on-k8s#298]) : +- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` ## How was this patch tested? Modified the existing test. Closes apache#24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@foxish @varunkatta
We had a discussion on how to improve our workflow. In particular, how we can shorten the 1. maven/sbt build time and 2. docker image build/push time.
I was chatting a bit more on this with @varunkatta. He suggested one could only build k8s module jars and update a distribution with the new jars. @varunkatta Can you comment on how you did this?
It occurred to me that we could extend this idea to docker image building. We could have a base image that has spark core jars, but not k8s module jars. Then the driver and executor images could extend from the base image by adding k8s module jars. Most of the time when we modified our code, we'll only build the child docker images. Hopefully, this will shorten the build/push time.
What do you think?
The text was updated successfully, but these errors were encountered: