-
Notifications
You must be signed in to change notification settings - Fork 118
documentation on resource staging server #386
Comments
on further reflection, there doesn't seem to be a way to use config maps / secrets. It would seem we'd want some way to inject env variables outside of baking into the images so as to support the same image in different environments. Is there a mechanism in the resource staging server to provide for that? If not it could be useful to provide a more kubernetes specific approach ie be able to specify templates for this purpose. |
So after doing some more investigation it looks like PodPresets may be a way to go. Apply a label to our spark jobs and then provide a PodPreset with the env variables we need. Possibly problematical as that is v1alpha1 API which may not be something we can float in production. |
There's still basically no documentation on the resource staging server. I'm looking at what mechanisms I can use to get my dependencies available to spark, hadoop-aws in this case. I could bake that into the image, and probably will. But it looks like the resource server is intended for this use case but there's little to no documentation on it. Thanks! |
Found this: Should be linked in the top level docs ? https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/architecture-docs/submission-client.md |
@luck02 custom environment variables I believe are not supported right now, but we could easily add configuration options for that. Architecture documentation has been improved somewhat but there might still be gaps there.
A valid point, but in at least the YARN docs there isn't much description on how it works (https://spark.apache.org/docs/latest/running-on-yarn.html) and for both YARN and Mesos there isn't a description of the motivation for using the cluster manager in question. I think understanding when each cluster manager is appropriate to use is not so much dependent on Spark, but rather an understanding of the cluster manager itself. In other words, I think the Kubernetes, YARN, and Mesos documentation is the right place to look when considering which cluster manager to use, and not Spark's documentation.
I believe we discuss this here. |
Slightly confused, I was referring to
I don't think it is described there. There appears to be no language that spells out what the resource staging server does or how it does it in those paragraphs. from: docs on dependency management
What's missing from that section is language that clearly describes what the resource server does to facilitate this. What I found eventually was this: https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/architecture-docs/submission-client.md which actually does have language that was useful for me and I was able to put it together and close the last few unknowns and get my jobs functional. Specifically here was the description of how the resource staging server works and what it accomplishes:
Is there way to provide a set of environment variables via this same approach? I see the following issues address this: #424 however I'm not sure if I want to attach 30 environment variables to the spark submit client, in an ideal world i'd be able to provide a config map. Right now as I mention above I'm using a podpreset. If it would help I'd be happy to just submit a documentation PR that would describe what I consider the shortfalls to be. |
I'm not sure if the user documentation should be describing exactly how the resource staging server works. This should be an implementation detail for end users. I suppose the main bit of information that's important for an administrator to decide how to deploy the server is the storage backend that's holding the files, since that requires possibly provisioning volumes, etc. But I don't think we need too much detail in the user documentation since the resource staging server should more or less be abstracted away from the user. This should be a component that should be similar to the external shuffle service in that one needs to know how to install it but does not need to know how it works. Feel free to submit a documentation PR to suggest otherwise, but in doing so we should probably avoid going into too much technical detail and being too specific.
In this case I think that the PodPreset is the correct mechanism to use for this. The submission client is mainly for users who are deploying the application via We discussed trying to have arbitrary pod YAML "templates" in #38 but we concluded that Pod Presets are the correct approach here. Basically, |
The docs at: https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dependency-management don't do a good job describing what this is and how it works. I've also read / searched: https://docs.google.com/document/d/1_bBzOZ8rKiOSjQg78DXOA3ZBIo_KkDJjqxVuq0yXdew/edit#heading=h.22iurepifhgt
From an end user standpoint this is problematic as I am not sure which problems this is intended to solve. If there are fuller docs I'd be happy to edit what's currently there.
This came up when I was trying to figure out the best way to load environment variables and I was trying to decide if it was best to bake them into the images or to provide them some other way (kub secrets / config maps).
Thanks!
The text was updated successfully, but these errors were encountered: