This repository is a collection of examples to enable new users to start using ROCm, as well as provide more advanced examples for experienced users.
The examples are structured in several categories:
-
HIP-Basic showcases some basic functionality without any additional dependencies
-
Libraries contains examples for ROCm-libraries, that provide higher-level functionality
-
Applications showcases some common applications, using HIP to accelerate them
-
AI contains instructions on how to use ROCm for AI
-
Tutorials contains the code accompanying the HIP Tutorials that can be found in the HIP documentation.
For a full overview over the examples see the section repository contents.
- CMake (at least version 3.21)
- A number of examples also support building via GNU Make - available through the distribution's package manager
- ROCm (at least version 6.x.x)
- For example-specific prerequisites, see the example subdirectories.
- Visual Studio 2019 or 2022 with the "Desktop Development with C++" workload
- HIP SDK for Windows
- The Visual Studio ROCm extension needs to be installed to build with the solution files.
- CMake (optional, to build with CMake. Requires at least version 3.21)
- Ninja (optional, to build with CMake)
These instructions assume that the prerequisites for every example are installed on the system.
See CMake build options for an overview of build options.
$ git clone https://github.com/ROCm/rocm-examples.git
$ cd rocm-examples
$ cmake -S . -B build
(on ROCm) or$ cmake -S . -B build -D GPU_RUNTIME=CUDA
(on CUDA)$ cmake --build build
$ cmake --install build --prefix install
Beware that only a subset of the examples support building via Make.
$ git clone https://github.com/ROCm/rocm-examples.git
$ cd rocm-examples
$ make
(on ROCm) or$ make GPU_RUNTIME=CUDA
(on CUDA)
Alternatively, instead of installing the prerequisites on the system, the Dockerfiles in this repository can be used to build images that provide all required prerequisites. Note, that the ROCm kernel GPU driver still needs to be installed on the host system.
The following instructions showcase building the Docker image and full example suite inside the container using CMake:
$ git clone https://github.com/ROCm/rocm-examples.git
$ cd rocm-examples/Dockerfiles
$ docker build . -t rocm-examples -f hip-libraries-rocm-ubuntu.Dockerfile --build-arg GID="$(getent group render | cut -d':' -f 3)"
(on ROCm) or$ docker build . -t rocm-examples -f hip-libraries-cuda-ubuntu.Dockerfile
(on CUDA)$ docker run -it --device /dev/kfd --device /dev/dri rocm-examples bash
(on ROCm) or$ docker run -it --gpus=all rocm-examples bash
(on CUDA)# git clone https://github.com/ROCm/rocm-examples.git
# cd rocm-examples
# cmake -S . -B build
(on ROCm) or$ cmake -S . -B build -D GPU_RUNTIME=CUDA
(on CUDA)# cmake --build build
The built executables can be found and run in the build
directory:
# ./build/Libraries/rocRAND/simple_distributions_cpp/simple_distributions_cpp
The repository has Visual Studio project files for all examples and individually for each example.
- Project files for Visual Studio are named as the example with
_vs<Visual Studio Version>
suffix added e.g.device_sum_vs2019.sln
for the device sum example. - The project files can be built from Visual Studio or from the command line using MSBuild.
- Use the build solution command in Visual Studio to build.
- To build from the command line execute
C:\Program Files (x86)\Microsoft Visual Studio\<Visual Studio Version>\<Edition>\MSBuild\Current\Bin\MSBuild.exe <path to project folder>
.- To build in Release mode pass the
/p:Configuration=Release
option to MSBuild. - The executables will be created in a subfolder named "Debug" or "Release" inside the project folder.
- To build in Release mode pass the
- The HIP specific project settings like the GPU architectures targeted can be set on the
General [AMD HIP C++]
tab of project properties. - The top level solution files come in two flavors:
ROCm-Examples-VS<Visual Studio Verson>.sln
andROCm-Examples-Portable-VS<Visual Studio Version>.sln
. The former contains all examples, while the latter contains the examples that support both ROCm and CUDA.
First, clone the repository and go to the source directory.
git clone https://github.com/ROCm/rocm-examples.git
cd rocm-examples
There are two ways to build the project using CMake: with the Visual Studio Developer Command Prompt (recommended) or with a standard Command Prompt. See CMake build options for an overview of build options.
Select Start, search for "x64 Native Tools Command Prompt for VS 2019", and the resulting Command Prompt. Ninja must be selected as generator, and Clang as C++ compiler.
cmake -S . -B build -G Ninja -D CMAKE_CXX_COMPILER=clang
cmake --build build
Run the standard Command Prompt. When using the standard Command Prompt to build the project, the Resource Compiler (RC) path must be specified. The RC is a tool used to build Windows-based applications, its default path is C:/Program Files (x86)/Windows Kits/10/bin/<Windows version>/x64/rc.exe
. Finally, the generator must be set to Ninja.
cmake -S . -B build -G Ninja -D CMAKE_RC_COMPILER="<path to rc compiler>"
cmake --build build
The following options are available when building with CMake.
Option | Relevant to | Default value | Description |
---|---|---|---|
GPU_RUNTIME |
HIP / CUDA | "HIP" |
GPU runtime to compile for. Set to "CUDA" to compile for NVIDIA GPUs and to "HIP" for AMD GPUs. |
CMAKE_HIP_ARCHITECTURES |
HIP | Compiler default | HIP device architectures to target, e.g. "gfx908;gfx1030" to target architectures gfx908 and gfx1030. |
CMAKE_CUDA_ARCHITECTURES |
CUDA | Compiler default | CUDA architecture to compile for e.g. "50;72" to target compute capibility 50 and 72. |
- AI Showcases the functionality for executing quantized models using Torch-MIGraphX.
-
Applications groups a number of examples ... .
-
bitonic_sort: Showcases how to order an array of
$n$ elements using a GPU implementation of the bitonic sort. - convolution: A simple GPU implementation for the calculation of discrete convolutions.
- floyd_warshall: Showcases a GPU implementation of the Floyd-Warshall algorithm for finding shortest paths in certain types of graphs.
- histogram: Histogram over a byte array with memory bank optimization.
-
monte_carlo_pi: Monte Carlo estimation of
$\pi$ using hipRAND for random number generation and hipCUB for evaluation. - prefix_sum: Showcases a GPU implementation of a prefix sum with a 2-kernel scan algorithm.
-
bitonic_sort: Showcases how to order an array of
- Common contains common utility functionality shared between the examples.
-
HIP-Basic hosts self-contained recipes showcasing HIP runtime functionality.
- assembly_to_executable: Program and accompanying build systems that show how to manually compile and link a HIP application from host and device code.
- bandwidth: Program that measures memory bandwidth from host to device, device to host, and device to device.
- bit_extract: Program that showcases how to use HIP built-in bit extract.
- device_globals: Show cases how to set global variables on the device from the host.
- device_query: Program that showcases how properties from the device may be queried.
- dynamic_shared: Program that showcases how to use dynamic shared memory with the help of a simple matrix transpose kernel.
- events: Measuring execution time and synchronizing with HIP events.
- gpu_arch: Program that showcases how to implement GPU architecture-specific code.
- hello_world: Simple program that showcases launching kernels and printing from the device.
- hello_world_cuda: Simple HIP program that showcases setting up CMake to target the CUDA platform.
-
hipify: Simple program and build definitions that showcase automatically converting a CUDA
.cu
source into portable HIP.hip
source. - llvm_ir_to_executable: Shows how to create a HIP executable from LLVM IR.
- inline_assembly: Program that showcases how to use inline assembly in a portable manner.
- matrix_multiplication: Multiply two dynamically sized matrices utilizing shared memory.
- module_api: Shows how to load and execute a HIP module in runtime.
- moving_average: Simple program that demonstrates parallel computation of a moving average of one-dimensional data.
- multi_gpu_data_transfer: Performs two matrix transposes on two different devices (one on each) to showcase how to use peer-to-peer communication among devices.
- occupancy: Shows how to find optimal configuration parameters for a kernel launch with maximum occupancy.
- opengl_interop: Showcases how to share resources and computation between HIP and OpenGL.
- runtime_compilation: Simple program that showcases how to use HIP runtime compilation (hipRTC) to compile a kernel and launch it on a device.
-
saxpy: Implements the
$y_i=ax_i+y_i$ kernel and explains basic HIP functionality. - shared_memory: Showcases how to use static shared memory by implementing a simple matrix transpose kernel.
- static_device_library: Shows how to create a static library containing device functions, and how to link it with an executable.
- static_host_library: Shows how to create a static library containing HIP host functions, and how to link it with an executable.
- streams: Program that showcases usage of multiple streams each with their own tasks.
- texture_management: Shows the usage of texture memory.
- vulkan_interop: Showcases how to share resources and computation between HIP and Vulkan.
- warp_shuffle: Uses a simple matrix transpose kernel to showcase how to use warp shuffle operations.
- Dockerfiles hosts Dockerfiles with ready-to-use environments for the various samples. See Dockerfiles/README.md for details.
-
Docs
- CONTRIBUTING.md contains information on how to contribute to the examples.
-
Libraries
-
hipBLAS
- gemm_strided_batched: Showcases the general matrix product operation with strided and batched matrices.
- her: Showcases a rank-2 update of a Hermitian matrix with complex values.
- scal: Simple program that showcases vector scaling (SCAL) operation.
-
hipCUB
-
device_radix_sort: Simple program that showcases
hipcub::DeviceRadixSort::SortPairs
. -
device_sum: Simple program that showcases
hipcub::DeviceReduce::Sum
.
-
device_radix_sort: Simple program that showcases
-
hipSOLVER
-
gels: Solve a linear system of the form
$A\times X=B$ . - geqrf: Program that showcases how to obtain a QR decomposition with the hipSOLVER API.
- gesvd: Program that showcases how to obtain a singular value decomposition with the hipSOLVER API.
- getrf: Program that showcases how to perform a LU factorization with hipSOLVER.
- potrf: Perform Cholesky factorization and solve linear system with result.
- syevd: Program that showcases how to calculate the eigenvalues of a matrix using a divide-and-conquer algorithm in hipSOLVER.
- syevdx: Shows how to compute a subset of the eigenvalues and the corresponding eigenvectors of a real symmetric matrix A using the Compatibility API of hipSOLVER.
-
sygvd: Showcases how to obtain a solution
$(X, \Lambda)$ for a generalized symmetric-definite eigenvalue problem of the form$A \cdot X = B\cdot X \cdot \Lambda$ . - syevj: Calculates the eigenvalues and eigenvectors from a real symmetric matrix using the Jacobi method.
- syevj_batched: Showcases how to compute the eigenvalues and eigenvectors (via Jacobi method) of each matrix in a batch of real symmetric matrices.
- sygvj: Calculates the generalized eigenvalues and eigenvectors from a pair of real symmetric matrices using the Jacobi method.
-
gels: Solve a linear system of the form
-
rocBLAS
- level_1: Operations between vectors and vectors.
- level_2: Operations between vectors and matrices.
-
level_3: Operations between matrices and matrices.
- gemm: Showcases the general matrix product operation.
- gemm_strided_batched: Showcases the general matrix product operation with strided and batched matrices.
- rocFFT
-
rocPRIM
-
block_sum: Simple program that showcases
rocprim::block_reduce
with an addition operator. -
device_sum: Simple program that showcases
rocprim::reduce
with an addition operator.
-
block_sum: Simple program that showcases
- hipFFT
-
rocRAND
- simple_distributions_cpp: A command-line app to compare random number generation on the CPU and on the GPU with rocRAND.
-
rocSOLVER
- getf2: Program that showcases how to perform a LU factorization with rocSOLVER.
- getri: Program that showcases matrix inversion by LU-decomposition using rocSOLVER.
- syev: Shows how to compute the eigenvalues and eigenvectors from a symmetrical real matrix.
- syev_batched: Shows how to compute the eigenvalues and eigenvectors for each matrix in a batch of real symmetric matrices.
- syev_strided_batched: Shows how to compute the eigenvalues and eigenvectors for multiple symmetrical real matrices, that are stored with an arbitrary stride.
-
rocSPARSE
-
level_2: Operations between sparse matrices and dense vectors.
- bsrmv: Showcases a sparse matrix-vector multiplication using BSR storage format.
- bsrxmv: Showcases a masked sparse matrix-vector multiplication using BSR storage format.
- bsrsv: Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix.
- coomv: Showcases a sparse matrix-vector multiplication using COO storage format.
- csritsv: Showcases how find an iterative solution with the Jacobi method for a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
- csrmv: Showcases a sparse matrix-vector multiplication using CSR storage format.
- csrsv: Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
- ellmv: Showcases a sparse matrix-vector multiplication using ELL storage format.
- gebsrmv: Showcases a sparse matrix-dense vector multiplication using GEBSR storage format.
- gemvi: Showcases a dense matrix-sparse vector multiplication.
- spitsv: Showcases how to solve iteratively a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
- spmv: Showcases a general sparse matrix-dense vector multiplication.
- spsv: Showcases how to solve a linear system of equations whose coefficients are stored in a sparse triangular matrix.
-
level_3: Operations between sparse and dense matrices.
- bsrmm: Showcases a sparse matrix-matrix multiplication using BSR storage format.
- bsrsm: Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix, with solution and right-hand side stored in dense matrices.
- csrmm: Showcases a sparse matrix-matrix multiplication using CSR storage format.
- csrsm: Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix, with solution and right-hand side stored in dense matrices.
- gebsrmm: Showcases a sparse matrix-matrix multiplication using GEBSR storage format.
- gemmi: Showcases a dense matrix sparse matrix multiplication using CSR storage format.
- sddmm: Showcases a sampled dense-dense matrix multiplication using CSR storage format.
- spmm: Showcases a sparse matrix-dense matrix multiplication.
- spsm: Showcases a sparse triangular linear system solver using CSR storage format.
-
preconditioner: Manipulations on sparse matrices to obtain sparse preconditioner matrices.
- bsric0: Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse BSR matrix.
- bsrilu0: Showcases how to obtain the incomplete LU decomposition of a sparse BSR square matrix.
- csric0: Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse CSR matrix.
- csrilu0: Showcases how to obtain the incomplete LU decomposition of a sparse CSR square matrix.
- csritilu0: Showcases how to obtain iteratively the incomplete LU decomposition of a sparse CSR square matrix.
- gpsv: Shows how to compute the solution of pentadiagonal linear system.
- gtsv: Shows how to compute the solution of a tridiagonal linear system.
-
level_2: Operations between sparse matrices and dense vectors.
-
rocThrust
-
device_ptr: Simple program that showcases the usage of the
thrust::device_ptr
template. -
norm: An example that computes the Euclidean norm of a
thrust::device_vector
. -
reduce_sum: An example that computes the sum of a
thrust::device_vector
integer vector using thethrust::reduce()
generalized summation and thethrust::plus
operator. -
remove_points: Simple program that demonstrates the usage of the
thrust
random number generation, host vector, generation, tuple, zip iterator, and conditional removal templates. It generates a number of random points in a unit square and then removes all of them outside the unit circle. -
saxpy: Simple program that implements the SAXPY operation (
y[i] = a * x[i] + y[i]
) using rocThrust and showcases the usage of the vector and functor templates and ofthrust::fill
andthrust::transform
operations. -
vectors: Simple program that showcases the
host_vector
and thedevice_vector
of rocThrust.
-
device_ptr: Simple program that showcases the usage of the
-
hipBLAS
-
Tutorials: Showcases HIP Documentation Tutorials.
- reduction: Showcases a reduction tutorial for HIP Documentation.