Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add class for documented Nvtx Ranges #12035

Draft
wants to merge 8 commits into
base: branch-25.04
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions dist/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,9 @@ self.log("... OK")
<arg value="${project.basedir}/../docs/configs.md"/>
<arg value="${project.basedir}/../docs/additional-functionality/advanced_configs.md"/>
</java>
<java classname="com.nvidia.spark.rapids.NvtxRangeDocs" failonerror="true">
<arg value="${project.basedir}/../docs/dev/nvtx_ranges.md"/>
</java>
<java classname="com.nvidia.spark.rapids.SupportedOpsDocs" failonerror="true">
<arg value="${project.basedir}/../docs/supported_ops.md"/>
</java>
Expand Down
2 changes: 1 addition & 1 deletion docs/dev/compute_sanitizer.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Compute Sanitizer
nav_order: 7
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are adding in a new page in the developer docs. Additionally, I noticed that there were two existing pages with the same nav_order, which I am also fixing, so the end result is all the back pages get moved back by 2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a separate concern worth its own PR

nav_order: 9
parent: Developer Overview
---

Expand Down
2 changes: 1 addition & 1 deletion docs/dev/get-json-object-dump-tool.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Dump tool for get_json_object
nav_order: 12
nav_order: 14
parent: Developer Overview
---

Expand Down
2 changes: 1 addition & 1 deletion docs/dev/gpu-core-dumps.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: GPU Core Dumps
nav_order: 9
nav_order: 11
parent: Developer Overview
---
# GPU Core Dumps
Expand Down
2 changes: 1 addition & 1 deletion docs/dev/idea-code-style-settings.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: IDEA Code Style Settings
nav_order: 5
nav_order: 7
parent: Developer Overview
---
```xml
Expand Down
2 changes: 1 addition & 1 deletion docs/dev/lore.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: The Local Replay Framework
nav_order: 13
nav_order: 15
parent: Developer Overview
---

Expand Down
2 changes: 1 addition & 1 deletion docs/dev/mem_debug.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Memory Debugging
nav_order: 10
nav_order: 12
parent: Developer Overview
---

Expand Down
2 changes: 1 addition & 1 deletion docs/dev/microk8s.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Setting up a Microk8s Environment
nav_order: 6
nav_order: 8
parent: Developer Overview
---

Expand Down
8 changes: 5 additions & 3 deletions docs/dev/nvtx_profiling.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: NVTX Ranges
nav_order: 3
title: NVTX Profiling
nav_order: 4
parent: Developer Overview
---
# Using NVTX Ranges with the RAPIDS Plugin for Spark
Expand Down Expand Up @@ -46,13 +46,15 @@ You should have a *.qdrep file once the trace completes. This can now be opened
If you are in Java or Scala land you can do the following:

```
val nvtxRange = new NvtxRange(<name of the range>, NvtxColor.YELLOW)
val nvtxRange = new NvtxRangeWithDoc(<NvtxId>, NvtxColor.YELLOW)
try {
// the code you want to profile
} finally {
nvtxRange.close()
}
```
See [nvtx_ranges.md](https://nvidia.github.io/spark-rapids/docs/dev/nvtx_ranges.html) for documentation on existing ranges and registering a new range.

In C++ land:
```
gdf_nvtx_range_push_hex("write_orc_all", 0xffff0000);
Expand Down
23 changes: 23 additions & 0 deletions docs/dev/nvtx_ranges.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
layout: page
title: NVTX Ranges
nav_order: 5
parent: Developer Overview
---
<!-- Generated by NvtxRangeDocs.help. DO NOT EDIT! -->
# RAPIDS Accelerator for Apache Spark Nvtx Range Glossary
The following is the list of Nvtx ranges that are used throughout
the plugin. To add your own Nvtx range to the code, create an NvtxId
entry in NvtxRangeWithDoc.scala and create an `NvtxRangeWithDoc` in the
code location that you want to cover, passing in the newly created NvtxId.

See [nvtx_profiling.md](https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html) for more info.



## Nvtx Ranges

Name | Description
-----|-------------
Acquire GPU|Time waiting for GPU semaphore to be acquired
Release GPU|Releasing the GPU semaphore
2 changes: 1 addition & 1 deletion docs/dev/shimplify.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Shim Source Code Layout Simplification with Shimplify
nav_order: 8
nav_order: 10
parent: Developer Overview
---

Expand Down
2 changes: 1 addition & 1 deletion docs/dev/shims.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Shim Development
nav_order: 4
nav_order: 6
parent: Developer Overview
---

Expand Down
2 changes: 1 addition & 1 deletion docs/dev/testing.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Testing
nav_order: 2
nav_order: 3
parent: Developer Overview
---
An overview of testing can be found within the repository at:
Expand Down
3 changes: 3 additions & 0 deletions scala2.13/dist/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,9 @@ self.log("... OK")
<arg value="${project.basedir}/../docs/configs.md"/>
<arg value="${project.basedir}/../docs/additional-functionality/advanced_configs.md"/>
</java>
<java classname="com.nvidia.spark.rapids.NvtxRangeDocs" failonerror="true">
<arg value="${project.basedir}/../docs/dev/nvtx_ranges.md"/>
</java>
<java classname="com.nvidia.spark.rapids.SupportedOpsDocs" failonerror="true">
<arg value="${project.basedir}/../docs/supported_ops.md"/>
</java>
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2024, NVIDIA CORPORATION.
* Copyright (c) 2019-2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -22,7 +22,7 @@ import java.util.concurrent.{ConcurrentHashMap, LinkedBlockingQueue}
import scala.collection.mutable
import scala.collection.mutable.ArrayBuffer

import ai.rapids.cudf.{NvtxColor, NvtxRange, NvtxUniqueRange}
import ai.rapids.cudf.{NvtxColor, NvtxUniqueRange}
import com.nvidia.spark.rapids.ScalableTaskCompletion.onTaskCompletion

import org.apache.spark.TaskContext
Expand Down Expand Up @@ -382,7 +382,7 @@ private final class GpuSemaphore() extends Logging {
}

def releaseIfNecessary(context: TaskContext): Unit = {
val nvtxRange = new NvtxRange("Release GPU", NvtxColor.RED)
val nvtxRange = new NvtxRangeWithDoc(NvtxId.RELEASE_GPU, NvtxColor.RED)
try {
val taskAttemptId = context.taskAttemptId()
GpuTaskMetrics.get.updateRetry(taskAttemptId)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
/*
* Copyright (c) 2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package com.nvidia.spark.rapids

import ai.rapids.cudf.{NvtxColor, NvtxRange}
import java.io.{File, FileOutputStream}
import scala.collection.mutable.ListBuffer

sealed class NvtxId private(val name: String, val doc: String) {
def help(): Unit = println(s"$name|$doc")
}

object NvtxId {
val registeredRanges = new ListBuffer[NvtxId]()

private def register(nvtxId: NvtxId): Unit = registeredRanges += nvtxId
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it a Map and detect collisions


private def apply(name: String, doc: String): NvtxId = {
val ret = new NvtxId(name, doc)
register(ret)
ret
}

val ACQUIRE_GPU: NvtxId = NvtxId(name = "Acquire GPU", doc = "Time waiting for GPU semaphore " +
"to be acquired")

val RELEASE_GPU: NvtxId = NvtxId(name = "Release GPU", doc = "Releasing the GPU semaphore")
}

object NvtxRangeDocs {
def helpCommon(): Unit = {
println("---")
println("layout: page")
println("title: NVTX Ranges")
println("nav_order: 5")
println("parent: Developer Overview")
println("---")
println(s"<!-- Generated by NvtxRangeDocs.help. DO NOT EDIT! -->")
// scalastyle:off line.size.limit
println("""# RAPIDS Accelerator for Apache Spark Nvtx Range Glossary
|The following is the list of Nvtx ranges that are used throughout
|the plugin. To add your own Nvtx range to the code, create an NvtxId
|entry in NvtxRangeWithDoc.scala and create an `NvtxRangeWithDoc` in the
|code location that you want to cover, passing in the newly created NvtxId.
|
|See [nvtx_profiling.md](https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html) for more info.
|
|""".stripMargin)
// scalastyle:on line.size.limit
println("\n## Nvtx Ranges\n")
println("Name | Description")
println("-----|-------------")
}

def main(args: Array[String]): Unit = {
val configs = new FileOutputStream(new File(args(0)))
Console.withOut(configs) {
Console.withErr(configs) {
helpCommon()
NvtxId.registeredRanges.foreach(_.help())
}
}
}
}

class NvtxRangeWithDoc(val id: NvtxId, color: NvtxColor) extends AutoCloseable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
class NvtxRangeWithDoc(val id: NvtxId, color: NvtxColor) extends AutoCloseable {
case class NvtxRangeWithDoc(val id: NvtxId, color: NvtxColor) extends AutoCloseable {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please change the API? The AutoClosable was me trying to do an RAII like API, but that is not a great API for java/scala. I know this will require changes to the CUDF code to open up the API, but can we please add some static push and pop methods in NvtxRange that take the name and the color bits but do check if nvtx is enabled.

Then we can have each of the NvtxIds, might want to rename them, have an apply method so we get the following code instead.

Nntx.ACQUIRE_GPU(semWaitTimeNs) {
  f()
}
Nvtx.RELEASE_GPU {
  ...
}

You might have guessed from this that I would like the color to be a part of the NvtxId as well. I think it is better if we are consistent in the color for a given range and that we have it documented.

private val nvtxRange: NvtxRange = new NvtxRange(id.name, color)

override def close(): Unit = nvtxRange.close()
}
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023-2024, NVIDIA CORPORATION.
* Copyright (c) 2023-2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -24,6 +24,7 @@ import java.util.concurrent.TimeUnit
import scala.collection.mutable

import ai.rapids.cudf.{NvtxColor, NvtxRange}
import com.nvidia.spark.rapids.{NvtxId, NvtxRangeWithDoc}
import com.nvidia.spark.rapids.Arm.withResource
import com.nvidia.spark.rapids.ScalableTaskCompletion.onTaskCompletion
import com.nvidia.spark.rapids.jni.RmmSpark
Expand Down Expand Up @@ -289,11 +290,26 @@ class GpuTaskMetrics extends Serializable {
}
}

private def timeIt[A](timer: NanoSecondAccumulator,
range: NvtxId,
color: NvtxColor,
f: => A): A = {
val start = System.nanoTime()
withResource(new NvtxRangeWithDoc(range, color)) { _ =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you make it a case class or add a companion with factory apply manually

Suggested change
withResource(new NvtxRangeWithDoc(range, color)) { _ =>
withResource(NvtxRangeWithDoc(range, color)) { _ =>

try {
f
} finally {
timer.add(System.nanoTime() - start)
}
}
}

def addSemaphoreHoldingTime(duration: Long): Unit = semaphoreHoldingTime.add(duration)

def getSemWaitTime(): Long = semWaitTimeNs.value.value

def semWaitTime[A](f: => A): A = timeIt(semWaitTimeNs, "Acquire GPU", NvtxColor.RED, f)
def semWaitTime[A](f: => A): A = timeIt(semWaitTimeNs, NvtxId.ACQUIRE_GPU,
NvtxColor.RED, f)

def spillToHostTime[A](f: => A): A = {
timeIt(spillToHostTimeNs, "spillToHostTime", NvtxColor.RED, f)
Expand Down
Loading