-
Notifications
You must be signed in to change notification settings - Fork 118
Allow "spark.files" to be shipped through secrets or configmaps #393
Comments
I'm not sure how large we can expect Python files to be relative to jars - does it make sense for Python files to require the resource staging server as well? @ifilonenko |
IMO, python files should use the resource staging server. They may be zip-files, so I assume the zips might be "large" in some cases. I don't know how many people use this feature, but it is possible to include .pyc files in a maven jar artifact, and Spark knows how to find and use those: |
The size checks would catch the big zips, but I agree that it is consistent to ship application "binaries" through the resource staging server. |
Further context around this change is that we are in the process of moving our applications from using RSS to distribute jars and files, to baking as much as possible into the docker images themselves. This improves performance and also immutability of application launches. As of now, we have applications running successfully with jars baked into the images rather than distributed through It's quite heavyweight to have an RSS for those two tiny files. In our observations, the init containers specifically add significantly to the startup time of both the driver and the executor. Here's an observation where the overhead from the init container was 14 seconds:
with logs from the
From my read of this, the total time the spark-init container takes (measuring from "image pulled" message of spark-init vs spark-driver) is 14 seconds. Of that 14 seconds, the time spent actually doing work (downloading the two files and the zero jars) is sub-second: the Not only that, this 14 seconds is spent twice -- once for the driver, and second for the executors (in parallel). So I believe that we have about 30 seconds we could cut off the launch times of Spark in Kubernetes for our application by eliminating the init containers from RSS usage. I suspect that creating a configmap/secret out of the files and letting kubernetes place them on the pod will be much faster than 30 seconds. |
+1 I agree with @erikerlandson in that python submission files could vary in size |
Examining the secret restrictions. The maximum size of a secret is 1M, which seems somewhat restrictive, even for "small" files, but maybe that satisfies community standards. This comment seems like a potential issue down the road, in terms of limiting the total data being pushed in this way:
Also, they apparently have to be create before any pod that uses them. Which seems possible, but is a logistical requirement for the submission logic. A missing secret prevents pod startup. It sounds like this might show up to the pod watcher as a particular failure mode. |
The plan is not to create many secrets but to create one secret that has a secret key-value pair for each added file. Is the restriction on secrets applicable for each secret key or for the entire secret bundle? How about for ConfigMaps - does the ConfigMap have a similar size restriction? Regarding creation before or after - we can create the single secret in the |
@erikerlandson when I say the RSS is heavyweight, I mean more the interacting with it than launching it. We could probably trim down both the RSS pod and the init container pod, but I suspect that eliminating the need to interact with the RSS entirely for certain jobs that have no jars and only "small" files (which are the majority of what we run) would be much higher impact on perf improvements instead. |
If we need a single secret bundle to be < 1MB then we'll run into problems with larger numbers of |
Decompression may also make the docker image commands more complex than we would perhaps like. |
My interpretation of the doc wording is that the total size of all k/v pairs in a single secret must be <= 1MB, although the wording is a bit vague on that distinction. Maybe @foxish can clarify. At this point its hypothetical, but the idea that they might also somehow limit the total data in all secrets seems like it could cause problem if it becomes a policy. |
I would imagine it would be a restriction on the total size since the restriction corresponds to the restriction on a single entry in etcd. I don't think we want to compress for the user - the user should handle compression themselves or else just use the resource staging server for bigger files. For numerous small files we can try to group them into ~1MB "blocks" and create a single bundle for every 1MB group, but again we need to guard against having too many secrets created this way. I suggest therefore also having an upper limit on the total amount of MB for all files submitted this way before recommending the usage of the resource staging server - something on the order of 5-10MB. I think even 1MB is sufficient for many use cases where only configuration is being sent to the application. In those cases the configuration files are just text files which would be on the order of KB at most. |
The size limit is the same for both secrets and configmap, which is driven by etcd limits: |
If the primary use case is lightweight data like config, then this should still be helpful. The RSS provides the fallback for anything that exceeds the limit. |
I like using k8s configmap/secret for distributing "small" files, and RSS for distributing "large" files. |
Should we then just gate at a total of 1MB? Would be the easiest implementation. |
Makes sense to me. If there is enough mandate later, fancier options like compression can always be added in the future. |
If we gate at a total of 1MB then theoretically we're still not able to handle "only small" files because we can't handle 100 small files. But maybe that use case is also justifiable to enforce the usage of the resource staging server anyways. |
I think the simple design should hit the 80/20 case |
So the logic could be:
That way if you have only small spark.files, and you have no spark.jars (because e.g. they're already baked into docker images), then you don't need an RSS at all. |
In our particular case we are distributing two files, sized 377 and 119328 bytes, and are working to eliminate the need to distribute the larger file. So our use case falls into the working 80 side of the 80/20 case. |
Storing lots of files which are close to the limit(900k/1M) is going to cause cluster performance issues, and we're probably hitting the upper limits at that point, and we don't want to affect the cluster's operations because of too many running jobs. Secrets may additionally be encrypted in etcd, making them a less than ideal choice for shipping config. It does seem like the The underlying problem here seems to be the init-container startup time. If that's taking 14 seconds, that's something we should be trying to fix so that it doesn't take that long, and we can prioritize that. Can you guys provide the kubelet log from one of the nodes for this? I'll also try a local repro. |
Agreed with @foxish. It seems less ideal using secrets to ship file dependencies. Another concern that I think worths being called out is secrets will stay unless being deleted explicitly. This needs to be taken care of by the driver as the submission client is gone once it's done with submission. So the driver needs to remember which files in the set of |
The thing that feels missing before we can reliably allow people to ship things and utilize etcd with their own "files", is a feedback loop/monitoring/rate limiting which stops them from impacting cluster operations. This would make it more likely that they stop the cluster from functioning correctly. However, I do think init-containers are not fulfilling the purpose if they add that much overhead and we should address that upstream. |
This can be achieved just by using owner references, as we do with the other resources the driver depends on.
We could put the files in config maps by base64 encoding the files and decoding them before launching the driver process in the driver docker image. This assumes that all content sent in
We can make the max size of the file bundle smaller if necessary. |
The secrets mounting mechanism should let us reference arbitrary secrets and mounts for them within the pods. Does that alleviate your problem? That has similar risks, but is explicit in that the user knows they're creating secrets, and if they choose to store files/config in secrets, that is transparent. I'm not sure we want to implicitly do that for the user however, as part of spark-submit. |
The problem is that we want to port our application over from YARN submission which expects the files to be sent via |
In theory, the purely feature-parity issue of "supporting spark.files" could be met using the RSS under the hood - just take files listed in spark.files and stage them using the RSS. OTOH, one drawback is that you'd lose the advantage of having a mechanism that doesn't require RSS. The init container performance issue is a separate thing. No doubt speeding it up benefits the entire community. Is that something that would have to work its way downstream as a core kube enhancement? |
@foxish, does secrets mounting also include configmap mounting? Is it about allowing a user to create them, and then instruct them to be mounted on the driver pod? |
We didn't plan for configmap mounts so far, but the same mechanism (as the one for secrets in #397) could be generalized, if needed. Yeah, a user could create them prior to launching the job, and just reference them in spark config, and specify a mount-path for them. |
So, I just tested on a 1.6.4 cluster. The general performance of init-containers appears to be fine. The kubelet will check the docker container status every 1 second and post that back to the APIServer.
So, chaining multiple init-containers is maybe not the problem here. We should investigate more into what's causing the delays. Maybe we could make the init container that fetches resources be lighter weight? |
@foxish I don't think it's general init-container overhead that's causing the 14sec we're observing -- it's the specific spark-init container that 1) is kinda large in byte weight, 2) has a lot of jars for the JVM to parse through, and 3) has a heavyweight service start process (my observation above was that it takes 4sec from first log line How long do you see the spark-init container running on your cluster? |
I see. I'll run a couple of experiments and report back. But if it is the spark-init container, should we look at profiling and optimizing that instead? For the function that it performs, the overhead seems disproportionate. |
I think it's fine to also implement the workaround for the specific situation that for users with only very small files that are being ported over from YARN, they don't need to stand up the resource staging server if they put dependencies in Docker images. If we made the file size limit something like 10kb by default then we should be fine. |
10k limit and making this behavior opt-in sounds reasonable to me for now. It's not ideal to use secrets and I expect we'll use configmaps after kubernetes/kubernetes#32432 is resolved. |
Many applications will include all of their binaries in Docker images, but will require setting configurations dynamically upon their submission. There is a separate discussion to have about allowing arbitrary secret and configmap mounts to be pushed onto the driver and executor pods. However, application deployment strategies that are being ported over from YARN, Mesos, or Standalone mode will expect these files to be easily provided through
spark.files
.Currently, Spark applications need to submit their local files through the resource staging server. Given the use case described above, however, it would be more convenient if application submitters did not need to use the resource staging server to ship their configuration files through. This is further confirmed by the general impression from the Spark community that data shipped through
spark.files
is intended to be small.The proposed scheme to consider all of these factors is as follows:
spark.files
is examined. We provide a configuration option calledspark.kubernetes.files.maxSize
(there's probably a better name for this to denote that we're submitting through a Kubernetes secret). If any file exceeds the max size then we fail the submission. The maximum size has a reasonable default that ensures that users do not accidentally overload etcd, but it can be adjusted if the submitter is aware of the potential consequences.spark.files
are in the working directory of the driver and executors, but it has to be through a copy since secret mounts cannot be in the working directory of the container itself.Again - this is strictly independent of the discussion on how custom volume mounts can be provided for the containers. This is a simpler scheme that basically makes
spark.files
easier to manage in the absence of a resource staging server. More complex use cases that require arbitrary mount points for arbitrary volume types should use something like a pod preset.The text was updated successfully, but these errors were encountered: