Allow "spark.files" to be shipped through secrets or configmaps #393

mccheah · 2017-07-26T21:35:41Z

Many applications will include all of their binaries in Docker images, but will require setting configurations dynamically upon their submission. There is a separate discussion to have about allowing arbitrary secret and configmap mounts to be pushed onto the driver and executor pods. However, application deployment strategies that are being ported over from YARN, Mesos, or Standalone mode will expect these files to be easily provided through spark.files.

Currently, Spark applications need to submit their local files through the resource staging server. Given the use case described above, however, it would be more convenient if application submitters did not need to use the resource staging server to ship their configuration files through. This is further confirmed by the general impression from the Spark community that data shipped through spark.files is intended to be small.

The proposed scheme to consider all of these factors is as follows:

If a resource staging server URI is provided, ship all dependencies through the resource staging server.
If a resource staging server is not provided, then prevent local jars from being submitted. We make this simplification because we assume that jars are generally going to be larger than files, and also that generally we expect a larger number of jars to be sent per-submission than files would be.
If a resource staging server is not provided, the size of each local file sent in spark.files is examined. We provide a configuration option called spark.kubernetes.files.maxSize (there's probably a better name for this to denote that we're submitting through a Kubernetes secret). If any file exceeds the max size then we fail the submission. The maximum size has a reasonable default that ensures that users do not accidentally overload etcd, but it can be adjusted if the submitter is aware of the potential consequences.
Assuming every file passes the check described in (3), the files are pushed into a Kubernetes secret where every secret key-value pair corresponds to a single submitted file. We use secrets instead of ConfigMaps here in order to support binary data and not just textual files, although the files need not be sensitive data by any means.
The main container for the driver and executor, prior to starting the JVM, copies the files from the secret mount point into the working directory. This is to satisfy the contract that files sent through spark.files are in the working directory of the driver and executors, but it has to be through a copy since secret mounts cannot be in the working directory of the container itself.

Again - this is strictly independent of the discussion on how custom volume mounts can be provided for the containers. This is a simpler scheme that basically makes spark.files easier to manage in the absence of a resource staging server. More complex use cases that require arbitrary mount points for arbitrary volume types should use something like a pod preset.

The text was updated successfully, but these errors were encountered:

mccheah · 2017-07-26T22:09:35Z

I'm not sure how large we can expect Python files to be relative to jars - does it make sense for Python files to require the resource staging server as well? @ifilonenko

erikerlandson · 2017-07-26T22:32:17Z

IMO, python files should use the resource staging server. They may be zip-files, so I assume the zips might be "large" in some cases.

I don't know how many people use this feature, but it is possible to include .pyc files in a maven jar artifact, and Spark knows how to find and use those:
https://spark-packages.org/artifact-help

mccheah · 2017-07-26T22:37:47Z

The size checks would catch the big zips, but I agree that it is consistent to ship application "binaries" through the resource staging server.

ash211 · 2017-07-26T23:22:54Z

Further context around this change is that we are in the process of moving our applications from using RSS to distribute jars and files, to baking as much as possible into the docker images themselves. This improves performance and also immutability of application launches.

As of now, we have applications running successfully with jars baked into the images rather than distributed through spark.jars, but the RSS is still needed to distribute the files. Those files are a logback.xml file that is configured at launch time, and a truststore that includes trusted TLS certificates for that particular launch environment. On a recent run the logback.xml file was 377 bytes and the truststore.jks was 119328 bytes.

It's quite heavyweight to have an RSS for those two tiny files.

In our observations, the init containers specifically add significantly to the startup time of both the driver and the executor.

Here's an observation where the overhead from the init container was 14 seconds:

kubectl describe pod -n $NS $POD
<snip>
  FirstSeen	LastSeen	Count	From					SubObjectPath					Type		Reason		Message
  ---------	--------	-----	----					-------------					--------	------		-------
<snip>
  47s		47s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{other-init-container}		Normal		Pulled		Container image "<snip>other-init:<snip>" already present on machine
  46s		46s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{other-init-container}		Normal		Created		Created container with id 5a0e67abc90aa37501880f590cc67a54de38d135371b4615bf58faf49fd34427
  46s		46s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{other-init-container}		Normal		Started		Started container with id 5a0e67abc90aa37501880f590cc67a54de38d135371b4615bf58faf49fd34427
  42s		42s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{spark-init}			Normal		Pulled		Container image "<snip>spark/init-container:<snip>" already present on machine
  41s		41s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{spark-init}			Normal		Created		Created container with id 2619e8047d64bc4a679dd46ee694e547875033cce55443278335f2e909e3691d
  41s		41s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{spark-init}			Normal		Started		Started container with id 2619e8047d64bc4a679dd46ee694e547875033cce55443278335f2e909e3691d
  28s		28s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.containers{spark-kubernetes-driver}	Normal		Pulled		Container image "<snip>spark-driver:<snip>" already present on machine
  27s		27s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.containers{spark-kubernetes-driver}	Normal		Created		Created container with id 7fda530a4d9e93766a2d320998433a4522ae12deefb92439dedc4970e588ba44
  26s		26s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.containers{spark-kubernetes-driver}	Normal		Started		Started container with id 7fda530a4d9e93766a2d320998433a4522ae12deefb92439dedc4970e588ba44

with logs from the spark-init container:

2017-07-26 22:58:55 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting init-container to download Spark application dependencies.
2017-07-26 22:58:57 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing view acls to: root
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing modify acls to: root
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing view acls groups to:
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing modify acls groups to:
2017-07-26 22:58:57 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing view acls to: root
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing modify acls to: root
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing view acls groups to:
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing modify acls groups to:
2017-07-26 22:58:57 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2017-07-26 22:58:59 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting to download files from resource staging server...
2017-07-26 22:58:59 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting to download jars from resource staging server...
2017-07-26 22:58:59 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Finished downloading jars from resource staging server.
2017-07-26 22:58:59 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Finished downloading files from resource staging server.
2017-07-26 22:58:59 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Finished downloading application dependencies.
2017-07-26 22:59:00 INFO  ShutdownHookManager:54 - Shutdown hook called
2017-07-26 22:59:00 INFO  ShutdownHookManager:54 - Deleting directory /tmp/temp-trustStore-fa9e7aaf-a81d-496a-a46c-52510d137dba

From my read of this, the total time the spark-init container takes (measuring from "image pulled" message of spark-init vs spark-driver) is 14 seconds.

Of that 14 seconds, the time spent actually doing work (downloading the two files and the zero jars) is sub-second: the KubernetesSparkDependencyDownloadInitContainer:54 lines are all in the same second.

Not only that, this 14 seconds is spent twice -- once for the driver, and second for the executors (in parallel).

So I believe that we have about 30 seconds we could cut off the launch times of Spark in Kubernetes for our application by eliminating the init containers from RSS usage. I suspect that creating a configmap/secret out of the files and letting kubernetes place them on the pod will be much faster than 30 seconds.

ifilonenko · 2017-07-27T00:24:46Z

+1 I agree with @erikerlandson in that python submission files could vary in size

erikerlandson · 2017-07-27T16:24:16Z

@ash211 I'm not sure if this is what you mean by "heavyweight" but I was noticing that the RSS image is based on spark-base, which means it has the entire Spark distribution on it. I was wondering if that was necessary, since it doesn't actually run spark per se.
cc @mccheah

erikerlandson · 2017-07-27T17:40:31Z

Examining the secret restrictions.

The maximum size of a secret is 1M, which seems somewhat restrictive, even for "small" files, but maybe that satisfies community standards. This comment seems like a potential issue down the road, in terms of limiting the total data being pushed in this way:

However, creation of many smaller secrets could also exhaust memory. More comprehensive limits on memory usage due to secrets is a planned feature.

Also, they apparently have to be create before any pod that uses them. Which seems possible, but is a logistical requirement for the submission logic.

A missing secret prevents pod startup. It sounds like this might show up to the pod watcher as a particular failure mode.

mccheah · 2017-07-27T18:01:39Z

The plan is not to create many secrets but to create one secret that has a secret key-value pair for each added file. Is the restriction on secrets applicable for each secret key or for the entire secret bundle? How about for ConfigMaps - does the ConfigMap have a similar size restriction?

Regarding creation before or after - we can create the single secret in the otherKubernetesResources list in the driver spec as we do for everything else the driver pod depends on.

ash211 · 2017-07-27T18:04:29Z

@erikerlandson when I say the RSS is heavyweight, I mean more the interacting with it than launching it. We could probably trim down both the RSS pod and the init container pod, but I suspect that eliminating the need to interact with the RSS entirely for certain jobs that have no jars and only "small" files (which are the majority of what we run) would be much higher impact on perf improvements instead.

mccheah · 2017-07-27T18:10:50Z

If we need a single secret bundle to be < 1MB then we'll run into problems with larger numbers of spark.files or even big entries in spark.files. We might have to compress them before putting them in the secret and decompress them on the driver, but I fear that's also going to be heavyweight anyways - though probably still faster than running an init-container.

mccheah · 2017-07-27T18:11:55Z

Decompression may also make the docker image commands more complex than we would perhaps like.

erikerlandson · 2017-07-27T18:17:12Z

My interpretation of the doc wording is that the total size of all k/v pairs in a single secret must be <= 1MB, although the wording is a bit vague on that distinction. Maybe @foxish can clarify. At this point its hypothetical, but the idea that they might also somehow limit the total data in all secrets seems like it could cause problem if it becomes a policy.

mccheah · 2017-07-27T18:24:31Z

I would imagine it would be a restriction on the total size since the restriction corresponds to the restriction on a single entry in etcd. I don't think we want to compress for the user - the user should handle compression themselves or else just use the resource staging server for bigger files. For numerous small files we can try to group them into ~1MB "blocks" and create a single bundle for every 1MB group, but again we need to guard against having too many secrets created this way. I suggest therefore also having an upper limit on the total amount of MB for all files submitted this way before recommending the usage of the resource staging server - something on the order of 5-10MB.

I think even 1MB is sufficient for many use cases where only configuration is being sent to the application. In those cases the configuration files are just text files which would be on the order of KB at most.

erikerlandson · 2017-07-27T18:24:35Z

The size limit is the same for both secrets and configmap, which is driven by etcd limits:
kubernetes/kubernetes#19781

erikerlandson · 2017-07-27T18:28:58Z

If the primary use case is lightweight data like config, then this should still be helpful. The RSS provides the fallback for anything that exceeds the limit.

ash211 · 2017-07-27T18:30:14Z

I like using k8s configmap/secret for distributing "small" files, and RSS for distributing "large" files.

mccheah · 2017-07-27T18:32:07Z

Should we then just gate at a total of 1MB? Would be the easiest implementation.

erikerlandson · 2017-07-27T18:33:55Z

Makes sense to me. If there is enough mandate later, fancier options like compression can always be added in the future.

mccheah · 2017-07-27T18:34:02Z

If we gate at a total of 1MB then theoretically we're still not able to handle "only small" files because we can't handle 100 small files. But maybe that use case is also justifiable to enforce the usage of the resource staging server anyways.

erikerlandson · 2017-07-27T18:35:25Z

I think the simple design should hit the 80/20 case

ash211 · 2017-07-27T18:36:05Z

So the logic could be:

if sum(files) < 1MB
- distribute files via CM/secret
else
- require RSS, and distribute via RSS
if any(jars)
- require RSS, and distribute via RSS
if RSS in use:
- submit files/jars to RSS
- apply spark-init containers to pods to download those files/jars
else:
- no spark-init containers on pods
- files retrieved from cm/secret

That way if you have only small spark.files, and you have no spark.jars (because e.g. they're already baked into docker images), then you don't need an RSS at all.

ash211 · 2017-07-27T18:37:48Z

In our particular case we are distributing two files, sized 377 and 119328 bytes, and are working to eliminate the need to distribute the larger file. So our use case falls into the working 80 side of the 80/20 case.

foxish · 2017-07-27T20:04:25Z

Storing lots of files which are close to the limit(900k/1M) is going to cause cluster performance issues, and we're probably hitting the upper limits at that point, and we don't want to affect the cluster's operations because of too many running jobs. Secrets may additionally be encrypted in etcd, making them a less than ideal choice for shipping config.

It does seem like the truststore should be shipped in a secret, and that could probably leverage the mechanism that Yinan and I are designing for referencing and mounting secrets. However, for general config files, that appears less than ideal. The way the RSS is designed, we should have long-lived instances storing files in it.

The underlying problem here seems to be the init-container startup time. If that's taking 14 seconds, that's something we should be trying to fix so that it doesn't take that long, and we can prioritize that. Can you guys provide the kubelet log from one of the nodes for this? I'll also try a local repro.

liyinan926 · 2017-07-27T20:16:17Z

Agreed with @foxish. It seems less ideal using secrets to ship file dependencies. Another concern that I think worths being called out is secrets will stay unless being deleted explicitly. This needs to be taken care of by the driver as the submission client is gone once it's done with submission. So the driver needs to remember which files in the set of spark.files were put into secrets and the corresponding secret names and need to guarantee to delete them successfully upon job completion.

foxish · 2017-07-27T20:22:50Z

The thing that feels missing before we can reliably allow people to ship things and utilize etcd with their own "files", is a feedback loop/monitoring/rate limiting which stops them from impacting cluster operations. This would make it more likely that they stop the cluster from functioning correctly. However, I do think init-containers are not fulfilling the purpose if they add that much overhead and we should address that upstream.

mccheah · 2017-07-27T20:24:21Z

Another concern that I think worths being called out is secrets will stay unless being deleted explicitly. This needs to be taken care of by the driver as the submission client is gone once it's done with submission.

This can be achieved just by using owner references, as we do with the other resources the driver depends on.

It does seem like the truststore should be shipped in a secret, and that could probably leverage the mechanism that Yinan and I are designing for referencing and mounting secrets. However, for general config files, that appears less than ideal. The way the RSS is designed, we should have long-lived instances storing files in it.

We could put the files in config maps by base64 encoding the files and decoding them before launching the driver process in the driver docker image. This assumes that all content sent in spark.files isn't sensitive but I suppose we have a separate discussion for having secret volume mounts.

Storing lots of files which are close to the limit(900k/1M) is going to cause cluster performance issues...

We can make the max size of the file bundle smaller if necessary.

foxish · 2017-07-27T20:26:04Z

The secrets mounting mechanism should let us reference arbitrary secrets and mounts for them within the pods. Does that alleviate your problem? That has similar risks, but is explicit in that the user knows they're creating secrets, and if they choose to store files/config in secrets, that is transparent. I'm not sure we want to implicitly do that for the user however, as part of spark-submit.

mccheah · 2017-07-27T20:27:20Z

The problem is that we want to port our application over from YARN submission which expects the files to be sent via spark.files. We would like to be able to use the same configuration option as-is without having to change our application, and I suspect this will be the case for other YARN to K8s migrations.

erikerlandson · 2017-07-27T20:39:03Z

In theory, the purely feature-parity issue of "supporting spark.files" could be met using the RSS under the hood - just take files listed in spark.files and stage them using the RSS. OTOH, one drawback is that you'd lose the advantage of having a mechanism that doesn't require RSS.

The init container performance issue is a separate thing. No doubt speeding it up benefits the entire community. Is that something that would have to work its way downstream as a core kube enhancement?

erikerlandson · 2017-07-27T20:41:33Z

@foxish, does secrets mounting also include configmap mounting? Is it about allowing a user to create them, and then instruct them to be mounted on the driver pod?

foxish · 2017-07-27T20:49:41Z

We didn't plan for configmap mounts so far, but the same mechanism (as the one for secrets in #397) could be generalized, if needed. Yeah, a user could create them prior to launching the job, and just reference them in spark config, and specify a mount-path for them.

foxish · 2017-07-27T21:03:32Z

So, I just tested on a 1.6.4 cluster. The general performance of init-containers appears to be fine. The kubelet will check the docker container status every 1 second and post that back to the APIServer.

Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath				TypeReason		Message
  ---------	--------	-----	----							-------------				--------	------		-------
  10s		10s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice}	Normal		Created		Created container with id 0b191e5104b85221d72f57c449cc3161bed43fa43c769c1fc0117e9762755a01
  10s		10s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice}	Normal		Started		Started container with id 0b191e5104b85221d72f57c449cc3161bed43fa43c769c1fc0117e9762755a01
  8s		8s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice2}	Normal		Created		Created container with id 8987a35639cd2514ccdc52d8eec4b4062d35cd4a6726e27fb2242c222e768f69
  8s		8s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice2}	Normal		Started		Started container with id 8987a35639cd2514ccdc52d8eec4b4062d35cd4a6726e27fb2242c222e768f69
  7s		7s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice3}	Normal		Created		Created container with id b3334441e403b55c5cf67ad387f8e77db6a652ccfc50621cdc5ea97558c66f45
  7s		7s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice3}	Normal		Started		Started container with id b3334441e403b55c5cf67ad387f8e77db6a652ccfc50621cdc5ea97558c66f45
  6s		6s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.containers{myapp-container}	Normal		Created		Created container with id 05e91d159f83dd59808301786f13ae5f13777444f964696cb311d08d4790365f
  6s		6s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.containers{myapp-container}	Normal		Started		Started container with id 05e91d159f83dd59808301786f13ae5f13777444f964696cb311d08d4790365f

So, chaining multiple init-containers is maybe not the problem here. We should investigate more into what's causing the delays. Maybe we could make the init container that fetches resources be lighter weight?

ash211 · 2017-07-27T21:33:49Z

@foxish I don't think it's general init-container overhead that's causing the 14sec we're observing -- it's the specific spark-init container that 1) is kinda large in byte weight, 2) has a lot of jars for the JVM to parse through, and 3) has a heavyweight service start process (my observation above was that it takes 4sec from first log line Starting init-container to download Spark application dependencies. to starting jar download).

How long do you see the spark-init container running on your cluster?

foxish · 2017-07-27T22:08:05Z

I see. I'll run a couple of experiments and report back. But if it is the spark-init container, should we look at profiling and optimizing that instead? For the function that it performs, the overhead seems disproportionate.

mccheah · 2017-07-27T22:54:17Z

I think it's fine to also implement the workaround for the specific situation that for users with only very small files that are being ported over from YARN, they don't need to stand up the resource staging server if they put dependencies in Docker images. If we made the file size limit something like 10kb by default then we should be fine.

foxish · 2017-07-31T16:42:42Z

10k limit and making this behavior opt-in sounds reasonable to me for now. It's not ideal to use secrets and I expect we'll use configmaps after kubernetes/kubernetes#32432 is resolved.

mccheah mentioned this issue Jul 27, 2017

Allow injection of arbitrary application secrets into driver/executor Pods #397

Closed

erikerlandson mentioned this issue Jul 27, 2017

Can RSS image avoid using 'spark-base' ? #396

Open

liyinan926 mentioned this issue Jul 27, 2017

Support custom YAML for the driver pod spec #38

Open

erikerlandson mentioned this issue Aug 13, 2017

spark log level support #425

Open

This was referenced Aug 14, 2017

Submission client should provide the driver with a pod template to use for executors #434

Closed

Use a secret to mount small files in driver and executors. #437

Merged

ash211 closed this as completed in #437 Aug 21, 2017

ifilonenko pushed a commit to ifilonenko/spark that referenced this issue Feb 26, 2019

Merge pull request apache-spark-on-k8s#393 from palantir/dv/upstream

dcd5aae

Allow "spark.files" to be shipped through secrets or configmaps #393

Allow "spark.files" to be shipped through secrets or configmaps #393

Comments

mccheah commented Jul 26, 2017 • edited Loading

mccheah commented Jul 26, 2017

erikerlandson commented Jul 26, 2017

mccheah commented Jul 26, 2017

ash211 commented Jul 26, 2017

ifilonenko commented Jul 27, 2017

erikerlandson commented Jul 27, 2017

erikerlandson commented Jul 27, 2017 • edited Loading

mccheah commented Jul 27, 2017

ash211 commented Jul 27, 2017

mccheah commented Jul 27, 2017

mccheah commented Jul 27, 2017

erikerlandson commented Jul 27, 2017

mccheah commented Jul 27, 2017

erikerlandson commented Jul 27, 2017

erikerlandson commented Jul 27, 2017

ash211 commented Jul 27, 2017

mccheah commented Jul 27, 2017

erikerlandson commented Jul 27, 2017

mccheah commented Jul 27, 2017

erikerlandson commented Jul 27, 2017

ash211 commented Jul 27, 2017

ash211 commented Jul 27, 2017

foxish commented Jul 27, 2017 • edited Loading

liyinan926 commented Jul 27, 2017

foxish commented Jul 27, 2017

mccheah commented Jul 27, 2017

foxish commented Jul 27, 2017

mccheah commented Jul 27, 2017

erikerlandson commented Jul 27, 2017

erikerlandson commented Jul 27, 2017

foxish commented Jul 27, 2017 • edited Loading

foxish commented Jul 27, 2017 • edited Loading

ash211 commented Jul 27, 2017

foxish commented Jul 27, 2017

mccheah commented Jul 27, 2017

foxish commented Jul 31, 2017

mccheah commented Jul 26, 2017 •

edited

Loading

erikerlandson commented Jul 27, 2017 •

edited

Loading

foxish commented Jul 27, 2017 •

edited

Loading

foxish commented Jul 27, 2017 •

edited

Loading

foxish commented Jul 27, 2017 •

edited

Loading