Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Allow "spark.files" to be shipped through secrets or configmaps #393

Closed
mccheah opened this issue Jul 26, 2017 · 36 comments
Closed

Allow "spark.files" to be shipped through secrets or configmaps #393

mccheah opened this issue Jul 26, 2017 · 36 comments

Comments

@mccheah
Copy link

mccheah commented Jul 26, 2017

Many applications will include all of their binaries in Docker images, but will require setting configurations dynamically upon their submission. There is a separate discussion to have about allowing arbitrary secret and configmap mounts to be pushed onto the driver and executor pods. However, application deployment strategies that are being ported over from YARN, Mesos, or Standalone mode will expect these files to be easily provided through spark.files.

Currently, Spark applications need to submit their local files through the resource staging server. Given the use case described above, however, it would be more convenient if application submitters did not need to use the resource staging server to ship their configuration files through. This is further confirmed by the general impression from the Spark community that data shipped through spark.files is intended to be small.

The proposed scheme to consider all of these factors is as follows:

  1. If a resource staging server URI is provided, ship all dependencies through the resource staging server.
  2. If a resource staging server is not provided, then prevent local jars from being submitted. We make this simplification because we assume that jars are generally going to be larger than files, and also that generally we expect a larger number of jars to be sent per-submission than files would be.
  3. If a resource staging server is not provided, the size of each local file sent in spark.files is examined. We provide a configuration option called spark.kubernetes.files.maxSize (there's probably a better name for this to denote that we're submitting through a Kubernetes secret). If any file exceeds the max size then we fail the submission. The maximum size has a reasonable default that ensures that users do not accidentally overload etcd, but it can be adjusted if the submitter is aware of the potential consequences.
  4. Assuming every file passes the check described in (3), the files are pushed into a Kubernetes secret where every secret key-value pair corresponds to a single submitted file. We use secrets instead of ConfigMaps here in order to support binary data and not just textual files, although the files need not be sensitive data by any means.
  5. The main container for the driver and executor, prior to starting the JVM, copies the files from the secret mount point into the working directory. This is to satisfy the contract that files sent through spark.files are in the working directory of the driver and executors, but it has to be through a copy since secret mounts cannot be in the working directory of the container itself.

Again - this is strictly independent of the discussion on how custom volume mounts can be provided for the containers. This is a simpler scheme that basically makes spark.files easier to manage in the absence of a resource staging server. More complex use cases that require arbitrary mount points for arbitrary volume types should use something like a pod preset.

@mccheah
Copy link
Author

mccheah commented Jul 26, 2017

I'm not sure how large we can expect Python files to be relative to jars - does it make sense for Python files to require the resource staging server as well? @ifilonenko

@erikerlandson
Copy link
Member

IMO, python files should use the resource staging server. They may be zip-files, so I assume the zips might be "large" in some cases.

I don't know how many people use this feature, but it is possible to include .pyc files in a maven jar artifact, and Spark knows how to find and use those:
https://spark-packages.org/artifact-help

@mccheah
Copy link
Author

mccheah commented Jul 26, 2017

The size checks would catch the big zips, but I agree that it is consistent to ship application "binaries" through the resource staging server.

@ash211
Copy link

ash211 commented Jul 26, 2017

Further context around this change is that we are in the process of moving our applications from using RSS to distribute jars and files, to baking as much as possible into the docker images themselves. This improves performance and also immutability of application launches.

As of now, we have applications running successfully with jars baked into the images rather than distributed through spark.jars, but the RSS is still needed to distribute the files. Those files are a logback.xml file that is configured at launch time, and a truststore that includes trusted TLS certificates for that particular launch environment. On a recent run the logback.xml file was 377 bytes and the truststore.jks was 119328 bytes.

It's quite heavyweight to have an RSS for those two tiny files.

In our observations, the init containers specifically add significantly to the startup time of both the driver and the executor.

Here's an observation where the overhead from the init container was 14 seconds:

kubectl describe pod -n $NS $POD
<snip>
  FirstSeen	LastSeen	Count	From					SubObjectPath					Type		Reason		Message
  ---------	--------	-----	----					-------------					--------	------		-------
<snip>
  47s		47s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{other-init-container}		Normal		Pulled		Container image "<snip>other-init:<snip>" already present on machine
  46s		46s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{other-init-container}		Normal		Created		Created container with id 5a0e67abc90aa37501880f590cc67a54de38d135371b4615bf58faf49fd34427
  46s		46s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{other-init-container}		Normal		Started		Started container with id 5a0e67abc90aa37501880f590cc67a54de38d135371b4615bf58faf49fd34427
  42s		42s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{spark-init}			Normal		Pulled		Container image "<snip>spark/init-container:<snip>" already present on machine
  41s		41s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{spark-init}			Normal		Created		Created container with id 2619e8047d64bc4a679dd46ee694e547875033cce55443278335f2e909e3691d
  41s		41s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.initContainers{spark-init}			Normal		Started		Started container with id 2619e8047d64bc4a679dd46ee694e547875033cce55443278335f2e909e3691d
  28s		28s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.containers{spark-kubernetes-driver}	Normal		Pulled		Container image "<snip>spark-driver:<snip>" already present on machine
  27s		27s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.containers{spark-kubernetes-driver}	Normal		Created		Created container with id 7fda530a4d9e93766a2d320998433a4522ae12deefb92439dedc4970e588ba44
  26s		26s		1	kubelet, ip-10-0-10-178.ec2.internal	spec.containers{spark-kubernetes-driver}	Normal		Started		Started container with id 7fda530a4d9e93766a2d320998433a4522ae12deefb92439dedc4970e588ba44

with logs from the spark-init container:

2017-07-26 22:58:55 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting init-container to download Spark application dependencies.
2017-07-26 22:58:57 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing view acls to: root
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing modify acls to: root
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing view acls groups to:
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing modify acls groups to:
2017-07-26 22:58:57 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing view acls to: root
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing modify acls to: root
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing view acls groups to:
2017-07-26 22:58:57 INFO  SecurityManager:54 - Changing modify acls groups to:
2017-07-26 22:58:57 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2017-07-26 22:58:59 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting to download files from resource staging server...
2017-07-26 22:58:59 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Starting to download jars from resource staging server...
2017-07-26 22:58:59 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Finished downloading jars from resource staging server.
2017-07-26 22:58:59 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Finished downloading files from resource staging server.
2017-07-26 22:58:59 INFO  KubernetesSparkDependencyDownloadInitContainer:54 - Finished downloading application dependencies.
2017-07-26 22:59:00 INFO  ShutdownHookManager:54 - Shutdown hook called
2017-07-26 22:59:00 INFO  ShutdownHookManager:54 - Deleting directory /tmp/temp-trustStore-fa9e7aaf-a81d-496a-a46c-52510d137dba

From my read of this, the total time the spark-init container takes (measuring from "image pulled" message of spark-init vs spark-driver) is 14 seconds.

Of that 14 seconds, the time spent actually doing work (downloading the two files and the zero jars) is sub-second: the KubernetesSparkDependencyDownloadInitContainer:54 lines are all in the same second.

Not only that, this 14 seconds is spent twice -- once for the driver, and second for the executors (in parallel).

So I believe that we have about 30 seconds we could cut off the launch times of Spark in Kubernetes for our application by eliminating the init containers from RSS usage. I suspect that creating a configmap/secret out of the files and letting kubernetes place them on the pod will be much faster than 30 seconds.

@ifilonenko
Copy link
Member

+1 I agree with @erikerlandson in that python submission files could vary in size

@erikerlandson
Copy link
Member

@ash211 I'm not sure if this is what you mean by "heavyweight" but I was noticing that the RSS image is based on spark-base, which means it has the entire Spark distribution on it. I was wondering if that was necessary, since it doesn't actually run spark per se.
cc @mccheah

@erikerlandson
Copy link
Member

erikerlandson commented Jul 27, 2017

Examining the secret restrictions.

The maximum size of a secret is 1M, which seems somewhat restrictive, even for "small" files, but maybe that satisfies community standards. This comment seems like a potential issue down the road, in terms of limiting the total data being pushed in this way:

However, creation of many smaller secrets could also exhaust memory. More comprehensive limits on memory usage due to secrets is a planned feature.

Also, they apparently have to be create before any pod that uses them. Which seems possible, but is a logistical requirement for the submission logic.

A missing secret prevents pod startup. It sounds like this might show up to the pod watcher as a particular failure mode.

@mccheah
Copy link
Author

mccheah commented Jul 27, 2017

The plan is not to create many secrets but to create one secret that has a secret key-value pair for each added file. Is the restriction on secrets applicable for each secret key or for the entire secret bundle? How about for ConfigMaps - does the ConfigMap have a similar size restriction?

Regarding creation before or after - we can create the single secret in the otherKubernetesResources list in the driver spec as we do for everything else the driver pod depends on.

@ash211
Copy link

ash211 commented Jul 27, 2017

@erikerlandson when I say the RSS is heavyweight, I mean more the interacting with it than launching it. We could probably trim down both the RSS pod and the init container pod, but I suspect that eliminating the need to interact with the RSS entirely for certain jobs that have no jars and only "small" files (which are the majority of what we run) would be much higher impact on perf improvements instead.

@mccheah
Copy link
Author

mccheah commented Jul 27, 2017

If we need a single secret bundle to be < 1MB then we'll run into problems with larger numbers of spark.files or even big entries in spark.files. We might have to compress them before putting them in the secret and decompress them on the driver, but I fear that's also going to be heavyweight anyways - though probably still faster than running an init-container.

@mccheah
Copy link
Author

mccheah commented Jul 27, 2017

Decompression may also make the docker image commands more complex than we would perhaps like.

@erikerlandson
Copy link
Member

My interpretation of the doc wording is that the total size of all k/v pairs in a single secret must be <= 1MB, although the wording is a bit vague on that distinction. Maybe @foxish can clarify. At this point its hypothetical, but the idea that they might also somehow limit the total data in all secrets seems like it could cause problem if it becomes a policy.

@mccheah
Copy link
Author

mccheah commented Jul 27, 2017

I would imagine it would be a restriction on the total size since the restriction corresponds to the restriction on a single entry in etcd. I don't think we want to compress for the user - the user should handle compression themselves or else just use the resource staging server for bigger files. For numerous small files we can try to group them into ~1MB "blocks" and create a single bundle for every 1MB group, but again we need to guard against having too many secrets created this way. I suggest therefore also having an upper limit on the total amount of MB for all files submitted this way before recommending the usage of the resource staging server - something on the order of 5-10MB.

I think even 1MB is sufficient for many use cases where only configuration is being sent to the application. In those cases the configuration files are just text files which would be on the order of KB at most.

@erikerlandson
Copy link
Member

The size limit is the same for both secrets and configmap, which is driven by etcd limits:
kubernetes/kubernetes#19781

@erikerlandson
Copy link
Member

If the primary use case is lightweight data like config, then this should still be helpful. The RSS provides the fallback for anything that exceeds the limit.

@ash211
Copy link

ash211 commented Jul 27, 2017

I like using k8s configmap/secret for distributing "small" files, and RSS for distributing "large" files.

@mccheah
Copy link
Author

mccheah commented Jul 27, 2017

Should we then just gate at a total of 1MB? Would be the easiest implementation.

@erikerlandson
Copy link
Member

Makes sense to me. If there is enough mandate later, fancier options like compression can always be added in the future.

@mccheah
Copy link
Author

mccheah commented Jul 27, 2017

If we gate at a total of 1MB then theoretically we're still not able to handle "only small" files because we can't handle 100 small files. But maybe that use case is also justifiable to enforce the usage of the resource staging server anyways.

@erikerlandson
Copy link
Member

I think the simple design should hit the 80/20 case

@ash211
Copy link

ash211 commented Jul 27, 2017

So the logic could be:

  • if sum(files) < 1MB
    • distribute files via CM/secret
  • else
    • require RSS, and distribute via RSS
  • if any(jars)
    • require RSS, and distribute via RSS
  • if RSS in use:
    • submit files/jars to RSS
    • apply spark-init containers to pods to download those files/jars
  • else:
    • no spark-init containers on pods
    • files retrieved from cm/secret

That way if you have only small spark.files, and you have no spark.jars (because e.g. they're already baked into docker images), then you don't need an RSS at all.

@ash211
Copy link

ash211 commented Jul 27, 2017

In our particular case we are distributing two files, sized 377 and 119328 bytes, and are working to eliminate the need to distribute the larger file. So our use case falls into the working 80 side of the 80/20 case.

@foxish
Copy link
Member

foxish commented Jul 27, 2017

Storing lots of files which are close to the limit(900k/1M) is going to cause cluster performance issues, and we're probably hitting the upper limits at that point, and we don't want to affect the cluster's operations because of too many running jobs. Secrets may additionally be encrypted in etcd, making them a less than ideal choice for shipping config.

It does seem like the truststore should be shipped in a secret, and that could probably leverage the mechanism that Yinan and I are designing for referencing and mounting secrets. However, for general config files, that appears less than ideal. The way the RSS is designed, we should have long-lived instances storing files in it.

The underlying problem here seems to be the init-container startup time. If that's taking 14 seconds, that's something we should be trying to fix so that it doesn't take that long, and we can prioritize that. Can you guys provide the kubelet log from one of the nodes for this? I'll also try a local repro.

@liyinan926
Copy link
Member

Agreed with @foxish. It seems less ideal using secrets to ship file dependencies. Another concern that I think worths being called out is secrets will stay unless being deleted explicitly. This needs to be taken care of by the driver as the submission client is gone once it's done with submission. So the driver needs to remember which files in the set of spark.files were put into secrets and the corresponding secret names and need to guarantee to delete them successfully upon job completion.

@foxish
Copy link
Member

foxish commented Jul 27, 2017

The thing that feels missing before we can reliably allow people to ship things and utilize etcd with their own "files", is a feedback loop/monitoring/rate limiting which stops them from impacting cluster operations. This would make it more likely that they stop the cluster from functioning correctly. However, I do think init-containers are not fulfilling the purpose if they add that much overhead and we should address that upstream.

@mccheah
Copy link
Author

mccheah commented Jul 27, 2017

Another concern that I think worths being called out is secrets will stay unless being deleted explicitly. This needs to be taken care of by the driver as the submission client is gone once it's done with submission.

This can be achieved just by using owner references, as we do with the other resources the driver depends on.

It does seem like the truststore should be shipped in a secret, and that could probably leverage the mechanism that Yinan and I are designing for referencing and mounting secrets. However, for general config files, that appears less than ideal. The way the RSS is designed, we should have long-lived instances storing files in it.

We could put the files in config maps by base64 encoding the files and decoding them before launching the driver process in the driver docker image. This assumes that all content sent in spark.files isn't sensitive but I suppose we have a separate discussion for having secret volume mounts.

Storing lots of files which are close to the limit(900k/1M) is going to cause cluster performance issues...

We can make the max size of the file bundle smaller if necessary.

@foxish
Copy link
Member

foxish commented Jul 27, 2017

The secrets mounting mechanism should let us reference arbitrary secrets and mounts for them within the pods. Does that alleviate your problem? That has similar risks, but is explicit in that the user knows they're creating secrets, and if they choose to store files/config in secrets, that is transparent. I'm not sure we want to implicitly do that for the user however, as part of spark-submit.

@mccheah
Copy link
Author

mccheah commented Jul 27, 2017

The problem is that we want to port our application over from YARN submission which expects the files to be sent via spark.files. We would like to be able to use the same configuration option as-is without having to change our application, and I suspect this will be the case for other YARN to K8s migrations.

@erikerlandson
Copy link
Member

In theory, the purely feature-parity issue of "supporting spark.files" could be met using the RSS under the hood - just take files listed in spark.files and stage them using the RSS. OTOH, one drawback is that you'd lose the advantage of having a mechanism that doesn't require RSS.

The init container performance issue is a separate thing. No doubt speeding it up benefits the entire community. Is that something that would have to work its way downstream as a core kube enhancement?

@erikerlandson
Copy link
Member

@foxish, does secrets mounting also include configmap mounting? Is it about allowing a user to create them, and then instruct them to be mounted on the driver pod?

@foxish
Copy link
Member

foxish commented Jul 27, 2017

We didn't plan for configmap mounts so far, but the same mechanism (as the one for secrets in #397) could be generalized, if needed. Yeah, a user could create them prior to launching the job, and just reference them in spark config, and specify a mount-path for them.

@foxish
Copy link
Member

foxish commented Jul 27, 2017

So, I just tested on a 1.6.4 cluster. The general performance of init-containers appears to be fine. The kubelet will check the docker container status every 1 second and post that back to the APIServer.

Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath				TypeReason		Message
  ---------	--------	-----	----							-------------				--------	------		-------
  10s		10s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice}	Normal		Created		Created container with id 0b191e5104b85221d72f57c449cc3161bed43fa43c769c1fc0117e9762755a01
  10s		10s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice}	Normal		Started		Started container with id 0b191e5104b85221d72f57c449cc3161bed43fa43c769c1fc0117e9762755a01
  8s		8s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice2}	Normal		Created		Created container with id 8987a35639cd2514ccdc52d8eec4b4062d35cd4a6726e27fb2242c222e768f69
  8s		8s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice2}	Normal		Started		Started container with id 8987a35639cd2514ccdc52d8eec4b4062d35cd4a6726e27fb2242c222e768f69
  7s		7s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice3}	Normal		Created		Created container with id b3334441e403b55c5cf67ad387f8e77db6a652ccfc50621cdc5ea97558c66f45
  7s		7s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.initContainers{init-myservice3}	Normal		Started		Started container with id b3334441e403b55c5cf67ad387f8e77db6a652ccfc50621cdc5ea97558c66f45
  6s		6s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.containers{myapp-container}	Normal		Created		Created container with id 05e91d159f83dd59808301786f13ae5f13777444f964696cb311d08d4790365f
  6s		6s		1	kubelet, gke-sparky-2-default-pool-7e62541a-mlvh	spec.containers{myapp-container}	Normal		Started		Started container with id 05e91d159f83dd59808301786f13ae5f13777444f964696cb311d08d4790365f

So, chaining multiple init-containers is maybe not the problem here. We should investigate more into what's causing the delays. Maybe we could make the init container that fetches resources be lighter weight?

@ash211
Copy link

ash211 commented Jul 27, 2017

@foxish I don't think it's general init-container overhead that's causing the 14sec we're observing -- it's the specific spark-init container that 1) is kinda large in byte weight, 2) has a lot of jars for the JVM to parse through, and 3) has a heavyweight service start process (my observation above was that it takes 4sec from first log line Starting init-container to download Spark application dependencies. to starting jar download).

How long do you see the spark-init container running on your cluster?

@foxish
Copy link
Member

foxish commented Jul 27, 2017

I see. I'll run a couple of experiments and report back. But if it is the spark-init container, should we look at profiling and optimizing that instead? For the function that it performs, the overhead seems disproportionate.

@mccheah
Copy link
Author

mccheah commented Jul 27, 2017

I think it's fine to also implement the workaround for the specific situation that for users with only very small files that are being ported over from YARN, they don't need to stand up the resource staging server if they put dependencies in Docker images. If we made the file size limit something like 10kb by default then we should be fine.

@foxish
Copy link
Member

foxish commented Jul 31, 2017

10k limit and making this behavior opt-in sounds reasonable to me for now. It's not ideal to use secrets and I expect we'll use configmaps after kubernetes/kubernetes#32432 is resolved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants