-
Notifications
You must be signed in to change notification settings - Fork 118
Add node selectors for driver and executor pods #355
Add node selectors for driver and executor pods #355
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @sandflee !
Can you explain more about the use case you need this for? I can imagine one around wanting to run Spark pods only nodes labeled as running HDFS for data locality purposes. Are there others you have in mind?
On annotations, one of the things we've been trying to do in this project is maintain compatibility with two versions of k8s at the same time -- "current" and "previous". Right now that's 1.6 and 1.5. Is the syntax you used supported in kubernetes 1.5?
@@ -430,6 +433,7 @@ private[spark] class KubernetesClusterSchedulerBackend( | |||
.endMetadata() | |||
.withNewSpec() | |||
.withHostname(hostname) | |||
.withNodeSelector(nodeSelector.asJava) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimoonkim was this the k8s 1.6+ feature you were referencing in https://github.com/apache-spark-on-k8s/spark/pull/316/files#diff-206ce9343d722622e995071cbb69c330R338 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main reason to add this is easy debuging, there are several kubelets, I just want the driver and executor run on the specified machine. so I could analyze the process(such as look at cpu usage).
we had a yarn cluster running mr and spark jobs, there is a big nessesary for node labels(run some jobs on high mem machine), so I think spark on k8s jobs also need it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the high memory machine use case, I would expect setting an appropriately large memory request on the driver/executor pods would cause the k8s scheduler to place them only in places where they fit, so here the high mem machine
the performance benchmarking use case is a good one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high mem is just a example, there are some other factors like ssd, ppc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think trying to restrict the nodes a job runs on is a use case several people will have. But I like the solution of using the node affinity (annotation till 1.5, field in 1.6+), because it lets us express a superset of what we can express using node selectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be valuable to just have support for node selectors and then later, have custom pod yamls (#38) for affinity, but we should have a discussion about this before adding this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ash211 This node selector
is not same as node affinity
that I referred to as 1.6+ featured. As @foxish mentioned, node affinity
is a superset of node selector
. K8s added node affinity
later to support more general use cases. From the reference doc:
nodeSelector provides a very simple way to constrain pods to nodes with particular labels. The affinity/anti-affinity feature, currently in beta, greatly expands the types of constraints you can express.
...
nodeSelector continues to work as usual, but will eventually be deprecated, as node affinity can express everything that nodeSelector can express
@kimoonkim we probably want to think about how this would interact with the locality work you've been doing. Should the preferences stack on top of each other, or should one replace the other? Curious on your thoughts |
Linking to #38 which is the umbrella issue for fully-generic customization of things like this via user-provided yaml files instead of incrementally adding additional options for every k8s feature. |
@ash211, it seems the main use case of @sandflee is to debug a job, which I think will be useful. Then this, if specified as an option, would take precedence and suppress the locality node affinity for the job. I'd prefer renaming the option more clearly to express the intent for debugging, so people would be less confused about precedences or how multiple node preferences work together. In terms of implementation, it might be better to switch this code to the node affinity
|
I think this is for more than just debugging use cases. Suppose you have a Spark job that requires a GPU, so you have to schedule Spark pods only on nodes with the special GPU hardware. You would still want to use HDFS locality where possible though, in addition to the required GPU locality (the job fails without GPU). So we need:
Am I over-thinking this? |
I like the @ash211 proposal. See an example from the reference doc below.
|
As discussed in our weekly call, I think we can take this for now, and support the more complex affinity semantics at a later time. @sandflee can you rebase this PR and resolve conflicts? |
@foxish patch rebased |
rerun unit tests please |
rerun integration test please |
@@ -385,6 +385,9 @@ private[spark] class KubernetesClusterSchedulerBackend( | |||
.withValue(cp) | |||
.build() | |||
} | |||
val nodeSelector = ConfigurationUtils.parseKeyValuePairs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the new key-value pair structure. See how we handle labels and annotations.
spark.kubernetes.driver.nodeselector.<key>
= `
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the new key-value paradigm.
docs/running-on-kubernetes.md
Outdated
</td> | ||
</tr> | ||
<tr> | ||
<td><code>spark.kubernetes.node.selector</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't introduce this config at all if it's already deprecated
@@ -82,6 +82,12 @@ private[spark] class KubernetesClusterSchedulerBackend( | |||
KUBERNETES_EXECUTOR_ANNOTATION_PREFIX, | |||
KUBERNETES_EXECUTOR_ANNOTATIONS, | |||
"executor annotation") | |||
private val nodeSelector = | |||
ConfigurationUtils.combinePrefixedKeyValuePairsWithDeprecatedConf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use just ConfigurationUtils.parseKeyValuePairs
so we don't introduce deprecated conf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConfigurationUtils.parseKeyValuePairs
only parses the deprecated config format. It's probably sufficient to use SparkConf.getAllWithPrefix
instead of using anything in ConfigurationUtils
.
@sandflee looks like there's a merge conflict with the latest master -- are you able to fix the conflicts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- @mccheah ?
Will merge once build is green |
rerun integration test please |
Thanks @sandflee for the contribution! |
…ream Update from upstream
Fixes #358
Allows assigning pod to specific
nodes, annotations seems a little complicated to use, we could simple use node selector