-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LOcal REplay framework: dump and replay GpuProjectExec runtime [databricks] #10825
Conversation
d704928
to
2702c0a
Compare
docs/dev/replay-exec.md
Outdated
store files in. Remote path is supported e.g.: `hdfs://url:9000/path/to/save` | ||
|
||
``` | ||
spark.conf.set("spark.rapids.sql.test.replay.exec.threshold.timeMS", 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Hmm, let's start with 1000 ms as a default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/dev/replay-exec.md
Outdated
Only dump the column batches when it's executing time exceeds threshold time. | ||
|
||
``` | ||
spark.conf.set("spark.rapids.sql.test.replay.exec.maxBatchNum", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: will be more suitable saying spark.rapids.sql.test.replay.batch.limit
and mention it's per executor base or task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -2188,6 +2188,40 @@ val SHUFFLE_COMPRESSION_LZ4_CHUNK_SIZE = conf("spark.rapids.shuffle.compression. | |||
.integerConf | |||
.createWithDefault(1024) | |||
|
|||
/** | |||
* refer to dev doc: `replay-exec.md` | |||
* only supports "project", will supports "agg" later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: TODO. any ticket to link here?
override def otherCopyArgs: Seq[AnyRef] = | ||
Seq[AnyRef](useTieredProject.asInstanceOf[java.lang.Boolean]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not necessary for extra lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
override def output: Seq[Attribute] = projectList.map(_.toAttribute) | ||
|
||
override lazy val additionalMetrics: Map[String, GpuMetric] = Map( | ||
OP_TIME -> createNanoTimingMetric(MODERATE_LEVEL, DESCRIPTION_OP_TIME)) | ||
|
||
override def internalDoExecuteColumnar() : RDD[ColumnarBatch] = { | ||
override def internalDoExecuteColumnar(): RDD[ColumnarBatch] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep it, because it follows the code format.
val numOutputRows = gpuLongMetric(NUM_OUTPUT_ROWS) | ||
val numOutputBatches = gpuLongMetric(NUM_OUTPUT_BATCHES) | ||
val opTime = gpuLongMetric(OP_TIME) | ||
val boundProjectList = GpuBindReferences.bindGpuReferencesTiered(projectList, child.output, | ||
useTieredProject) | ||
|
||
val rdd = child.executeColumnar() | ||
|
||
// This is for test purpose; dump project list | ||
replayDumper.foreach(d => d.dumpMeta[GpuTieredProject]("GpuTieredProject", boundProjectList)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hide this behind dumpForReplay = true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to replayDumperOpt
, so it shows it's an option.
Now using replayDumperOpt
to distinguish dumpForReplay = true
, dumpForReplay = false
f => f.getName.startsWith(s"${projectHash}_cb_data_") && | ||
f.getName.endsWith(".parquet")) | ||
if (parquets == null || parquets.isEmpty) { | ||
logError(s"Project Exec replayer: there is no cb_data_xxx.parquet file in $replayDir") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add an usage
method and replace all those logError
with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
build |
dee4dfc
to
fee9e38
Compare
Tested dump/replay with remote directory(e.g.: hdfs://path/to/dir) successfully. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mostly looked at docs, not full code, but here are some comments
docs/dev/replay-exec.md
Outdated
Set the following configurations to enable this feature: | ||
|
||
``` | ||
spark.conf.set("spark.rapids.sql.test.replay.exec.type", "project") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is there a ".test." in the name of this config? isn't this for debugging, testing makes it sound not to be used in real workloads and for testing purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now updated to .debug.
docs/dev/replay-exec.md
Outdated
spark.conf.set("spark.rapids.sql.test.replay.exec.type", "project") | ||
``` | ||
Default `type` value is empty which means do not dump. | ||
Set this `type` to `project` if you want to dump Project Exec runtime data. Currently only support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you set it to multiple execs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to exec.types
Doc: Define the Exec types for dumping, separated by comma, e.g.: project,aggregate,sort
.
docs/dev/replay-exec.md
Outdated
`project` and empty. | ||
|
||
``` | ||
spark.conf.set("spark.rapids.sql.test.replay.exec.dumpDir", "file:/tmp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here, shouldn't have .test in the name of config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/dev/replay-exec.md
Outdated
spark.conf.set("spark.rapids.sql.test.replay.exec.dumpDir", "file:/tmp") | ||
``` | ||
Default value is `file:/tmp`. | ||
specify the dump directory, e.g.: `file:/tmp/my-debug-path`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be distributed filesystem, you should specify very specifically what all is supported. I think we do this in other places in docs for like dumping parquet files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, can be distributed/remote path. Now we use hadoop FileSystem to open/write a file stream. If user specify the corresponding config to get access to remote file system, then our code can handle this.
Refer to https://github.com/NVIDIA/spark-rapids/blob/branch-24.06/docs/dev/get-json-object-dump-tool.md
'spark.rapids.sql.expression.GetJsonObject.debugPath': '/tmp/DEBUG_JSON_DUMP/'
docs/dev/replay-exec.md
Outdated
``` | ||
|
||
<path_to_saved_replay_dir> is the replay directory | ||
<hash_code_of_project_exec> is the hash code for a specific Project Exec. Dumping may generate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this just replays one exec then? What does the replay do specifically?
What about other things you may want to collect like number of workers, executor sizes, etc. to try to get same environment, seems like that is also good data to have to try to reproduce same issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically, this tool is used to run a single batch (fits in memory) to debug underlying cuDF/JNI kernel, so it's not related to number of workers
, executor sizes, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats fine but a user who uses this tool likely wants to get that other information and when they replay likely wants as close to customer setup as possible so making a comment about that would help I think.
* refer to dev doc: `replay-exec.md` | ||
* only supports "project" now | ||
*/ | ||
val TEST_REPLAY_EXEC_TYPE = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above I would rather see these called like DEBUG instead of test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change the defined from TEST to DEBUG_ as well or just remove it.
TEST_REPLAY_EXEC_TYPE -> REPLAY_EXEC_TYPE or DEBUG_REPLAY_EXEC_TYPE
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of high level questions. How does this scale to other execs? I assume you have to modify every single exec separately?
Also does this actually work for other execs - hash aggregate, join. What about when multiple batches come in and the exec generates multiple batches? What about if the exec carries some state across batches? Wondering if you have thought about all the different cases?
``` | ||
mvn clean install -DskipTests -pl dist -am -DallowConventionalDistJar=true -Dbuildver=330 | ||
``` | ||
Note: Should specify `-DallowConventionalDistJar`, this config will make sure to generate a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather see this first saying you must build with this option and then below that have an example and point to our build docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/dev/replay-exec.md
Outdated
spark.conf.set("spark.rapids.sql.debug.replay.exec.types", "project") | ||
``` | ||
Default `types` value is empty which means do not dump. | ||
Define the Exec types for dumping, separated by comma, e.g.: `project,aggregate,sort`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you specify for different types of joins and other execs - like broadcastjoin vs hashjoin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For agg, sort, it's a future feature. This PR only handle project.
Will change to:
Default `types` value is empty which means do not dump.
Define the Exec types for dumping, separated by comma, e.g.: `project`.
Note currently only support `project`, so it's no need to specify comma.
``` | ||
spark.conf.set("spark.rapids.sql.debug.replay.exec.batch.limit", 1) | ||
``` | ||
This config defines the max dumping number of column batches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean exactly, if it exceeds threshold of processing multiple batches then it dumps all of them? If it exceeds threshold of processing one batch then we dump that batch and others after it?
I assume this is column batches per exec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a limit for per Exec instance.
If current batches exceeds the max limit, then the batches will be skipped to dump.
docs/dev/replay-exec.md
Outdated
After the job is done, check the dump path will have files like: | ||
``` | ||
/tmp/replay-exec: | ||
- xxx_GpuTieredProject.meta // this is serialized GPU Tiered Project case class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I have a huge job that has say 10000 projects is this directory going to become to big? I assume the job keeps running after dumping this out, correct? Is there a way to limit the total number you get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I have a huge job that has say 10000 projects is this directory going to become to big?
Yes, it will be big. We have a config threshold.timeMS
to reduce the number of batches to dump.
Typically usage is: Running the query at customer env, then check the eventlog to find a threshold time; then run at customer env again with dump enabled and with this threshold time. Usually we only need one batch to reproduce the slowness, and reproduce with NSYS.
I assume the job keeps running after dumping this out, correct?
Yes. Keep runnig, if it's needed we can terminate the process when the first dumping is done.
Is there a way to limit the total number you get?
No. In future will collect batches for multiple Execs. We will do not know how many slowness batches in advance.
* refer to dev doc: `replay-exec.md` | ||
* only supports "project" now | ||
*/ | ||
val TEST_REPLAY_EXEC_TYPE = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change the defined from TEST to DEBUG_ as well or just remove it.
TEST_REPLAY_EXEC_TYPE -> REPLAY_EXEC_TYPE or DEBUG_REPLAY_EXEC_TYPE
Yes. This PR is only dump Project which is a simple Exec.
Do not work for hash aggregate, join.
Does cover in this PR. For agg Exec, it does handle multiple batches.
Allen and I went into Hash Agg, and found what you mentioned. Maybe it's a follow-up to have a convenient method to handle all the Execs. |
build |
Changed to draft, because there are error when testing Dataproc. |
06aeb96
to
5742c4b
Compare
Signed-off-by: Chong Gao <res_life@163.com>
build |
Has this been superceded by #11084? |
Should we close this PR? |
Have other implement, close this. |
Closes #10862
Collecting nsys/ncu files are time-consuming when running customer data because usually customer data is huge.
Actually we only need a small data segment which is running in a JNI/cuDF kernel.
This PR aims to dump/replay the runtime(data and meta) when Project Exec(s) execution time on a batch exceeds the threshold time.
This is a feature to dump and replay Project Exec runtime (by column batch) for performance purpose debug.
store/restore GpuTieredProject case class
store/restore ColumnarBatch data
replay GpuProejctExec