-
Notifications
You must be signed in to change notification settings - Fork 118
Allow setting memory on the driver submission server. #161
Allow setting memory on the driver submission server. #161
Conversation
@@ -156,6 +168,14 @@ package object config { | |||
.stringConf | |||
.createOptional | |||
|
|||
private[spark] val KUBERNETES_DRIVER_SUBMIT_SERVER_MEMORY = | |||
ConfigBuilder("spark.kubernetes.driver.submitServerMemory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submit server is another name for the rest server that receives the submit command? We might have some terminology inconsistencies for this service in docs/code at this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to refer to this as the submission server everywhere but we can audit at a later point.
| allocated for the driver. This is memory that accounts for | ||
| things like VM overheads, interned strings, other native | ||
| overheads, etc. This tends to grow with the driver's memory | ||
| size (typically 6-10%). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc should say whether this overhead applies to (driver + submit server) or just the driver
.withNewResources() | ||
.addToRequests("memory", driverMemoryQuantity) | ||
.addToLimits("memory", driverMemoryLimitQuantity) | ||
.endResources() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what were the defaults that we were relying on before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no reliable default - probably just defaults to the underlying container runtime.
docs/running-on-kubernetes.md
Outdated
</tr> | ||
<tr> | ||
<td><code>spark.kubernetes.driver.memoryOverhead</code></td> | ||
<td>(driverMemory + driverSubmissionServerMemory) * 0.10, with minimum of 384 </td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space right before td close tag
This looks like it does indeed expose a config value for the size of the driver and the submission service separately, and pass that config through to JVMs and the k8s pod request. For working around #146 I would set @foxish are you able to take a quick look? Also curious if @ssuchter has seen any OOM issues with real-world Spark jobs, where files and jars are much larger than Spark's example jobs jar. |
@mccheah Compile fails here:
|
Ok verified that this change does indeed make the submission server's heap configurable. I tested by creating a 32MB random file and attempting to submit it using
shows this failure in driver pod's logs that the heap filled up:
but running with the new higher memory flag:
causes the job to complete successfully. So I think that confirms that this config does what we expect. And as a side note, it's kind of crazy that there are enough copies and base64 inflation in the current implementation that a 256mb heap doesn't support uploading files that are only 32mb in the default config.
|
@@ -63,6 +63,19 @@ private[spark] class Client( | |||
.map(_.split(",")) | |||
.getOrElse(Array.empty[String]) | |||
|
|||
// Memory settings | |||
private val driverMemory = sparkConf.get("spark.driver.memory", "1g") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use sparkConf.get(org.apache.spark.internal.config.DRIVER_MEMORY)
?
.getOrElse(math.max((MEMORY_OVERHEAD_FACTOR * driverContainerMemoryBytes).toInt, | ||
MEMORY_OVERHEAD_MIN)) | ||
private val driverContainerMemoryWithOverhead = driverContainerMemoryBytes + memoryOverheadBytes | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the above calculation to config.scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather leave config.scala
for just the config values and keys - there should be no logic there.
@@ -60,7 +60,7 @@ private[spark] class KubernetesClusterSchedulerBackend( | |||
.getOrElse( | |||
throw new SparkException("Must specify the driver pod name")) | |||
|
|||
private val executorMemory = conf.getOption("spark.executor.memory").getOrElse("1g") | |||
private val executorMemory = conf.get("spark.executor.memory", "1g") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can also use conf.get(EXECUTOR_MEMORY)
@mccheah a couple small changes on here but I think this is basically there. |
a0d6165
to
7321a3e
Compare
…te-incremental' into configurable-driver-memory
Any further comments @lins05 ? |
Want to make sure this gets in before the code freeze today. |
Looking at this now. |
private val driverSubmitServerMemoryString = sparkConf.get( | ||
KUBERNETES_DRIVER_SUBMIT_SERVER_MEMORY.key, | ||
KUBERNETES_DRIVER_SUBMIT_SERVER_MEMORY.defaultValueString) | ||
private val driverContainerMemoryMb = driverMemoryMb + driverSubmitServerMemoryMb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we add the two here? Shouldn't we ideally take the max(...) since only one of them is running at a time, either the submit server or the driver code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we don't shut down the submit server after it starts the application, so both will be running for the whole time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw @ash211 opened an issue. We should solve this in a better way after the alpha.
// Memory settings | ||
private val driverMemoryMb = sparkConf.get(org.apache.spark.internal.config.DRIVER_MEMORY) | ||
private val driverSubmitServerMemoryMb = sparkConf.get(KUBERNETES_DRIVER_SUBMIT_SERVER_MEMORY) | ||
private val driverSubmitServerMemoryString = sparkConf.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have separate driverSubmitServerMemoryMb
and driverSubmitServerMemoryString
and similarly for executors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strings are used in the environment variables so that they can be passed directly to the JVMs launch commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we could achieve this using Utils.memoryStringToMb
. Just a minor nit because it seems like repeated logic to fetch the same parameter in different ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trouble there is when we use SparkConf.get(config)
it returns us a long, not a string - where the long is the number of megabytes pre-converted from the string value. I think the issue is that we want to go in the other direction; that is, to convert the numeric value we get from SparkConf.get
into a memory string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I wonder if the Utils package has a helper for that. But in any case, it isn't a major concern. Merging. Thanks!
@@ -63,6 +63,7 @@ package object constants { | |||
private[spark] val ENV_EXECUTOR_MEMORY = "SPARK_EXECUTOR_MEMORY" | |||
private[spark] val ENV_APPLICATION_ID = "SPARK_APPLICATION_ID" | |||
private[spark] val ENV_EXECUTOR_ID = "SPARK_EXECUTOR_ID" | |||
private[spark] val ENV_DRIVER_MEMORY = "SPARK_DRIVER_MEMORY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this environment variable used? I didn't see any corresponding change to the dockerfile launching the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in the spark-class
script which launches the REST server itself: https://github.com/apache-spark-on-k8s/spark/blob/k8s-support-alternate-incremental/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java#L78
LGTM except for minor nit in comment above |
Tested this change with our internal app and verified that it can now complete the submission process by setting a large heap size for the rest service (2g suffices) |
* Allow setting memory on the driver submission server. * Address comments * Address comments
* Allow setting memory on the driver submission server. * Address comments * Address comments
…n-k8s#161) * Allow setting memory on the driver submission server. * Address comments * Address comments (cherry picked from commit f6823f3)
…n-k8s#161) * Allow setting memory on the driver submission server. * Address comments * Address comments
Note that the setting here is in addition to setting
spark.driver.memory
. This is so that the memory request and memory limit on the driver can take into account both of these amounts.