This is an NVIDIA AI Workbench example project that provides a short introduction of the cuDF library, a Python GPU-accelerated DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF also provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming. Users who have installed AI Workbench can get up and running with this project in minutes.
Have questions? Please direct any issues, fixes, suggestions, and discussion on this project to the DevZone Members Only Forum thread here.
Included in this project are eight tutorial notebooks. The first five are relatively easy to run; the last three (*) may require a low GPU RAM user ( < 16GB) to push the project to heavier hardware to run all of the performance benchmarks. Good news: Workbench makes this easy!
-
cudf-pandas-demo: This notebook demonstrates the acceleration that
cudf.pandas
gives over vanilla Pandas. The example runs through loading some data with Pandas and getting some performance numbers, then running the same code again with thecudf.pandas
plugin to show the speedup that is possible with NVIDIA hardware. -
10min: This is a short introduction to cuDF and Dask-cuDF, geared mainly towards new users.
cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.
Dask is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple. On the CPU, Dask uses Pandas to execute operations in parallel on DataFrame partitions.
Dask-cuDF extends Dask where necessary to allow its DataFrame partitions to be processed using cuDF GPU DataFrames instead of Pandas DataFrames. For instance, when you call dask_cudf.read_csv(...), your cluster’s GPUs do the work of parsing the CSV file(s) by calling cudf.read_csv().
Which libraries do I use? If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF.
-
cupy-interop: This notebook provides introductory examples of how you can use cuDF and CuPy together to take advantage of CuPy array functionality (such as advanced linear algebra operations).
-
missing-data: In this section, we will discuss missing (also referred to as NA) values in cudf. cudf supports having missing values in all dtypes. These missing values are represented by . These values are also referenced as “null values”.
-
Introduction_to_Strings: This notebook shows how to manipulate strings with cuDF DataFrames.
-
Introduction_to_Exploratory_Data_Analysis_using_cuDF: This notebook shows how to perform basic EDA with cuDF DataFrames
-
Introduction_to_Time_Series_Data_Analysis_using_cuDF: This notebook shows how to do EDA on time-series DataFrame with cuDF
-
performance-comparisons (*): This notebook compares the performance of cuDF and pandas. The comparisons performed are on identical data sizes. This notebook primarily showcases the factor of speedups users can have when the similar pandas APIs are run on GPUs using cudf. This notebook is written to measure performance on NVIDIA GPUs with large memory. Performance results may vary by data size, as well as the CPU and GPU used.
Important Considerations:
-
The notebook titled
performance-comparisons.ipynb
may take a long time to execute on laptop and/or workstation hardware. This is because we are running benchmarks and conducting dataframe operations on massive datasets using both Pandas and cuDF. Feel free to adjust thenum_rows
variable as needed. -
If working locally on a laptop or workstation, also consider pushing this project to heavier hardware (original notebook authors used 2x H100 GPUs) to run this notebook. Good news: NVIDIA AI Workbench makes this push easy!
- Operating System: Ubuntu 22.04
- CPU requirements: None, tested with Intel® Xeon® Gold 6240R CPU @ 2.40GHz
- GPU requirements: Any NVIDIA training GPU, tested with NVIDIA A100-40GB
- NVIDIA driver requirements: Latest driver version
- Storage requirements: 40GB
The notebook(s) in this project were adapted from the RAPIDS cuDF Github repository, which can be found here.
If you have NVIDIA AI Workbench already installed, you can use this Project in AI Workbench on your choice of machine by:
-
Forking this Project to your own GitHub namespace and copying the clone link
https://github.com/[your_namespace]/<project_name>.git
-
Opening a shell and activating the Context you want to clone into by
$ nvwb list contexts $ nvwb activate <desired_context>
-
Cloning this Project onto your desired machine by running
$ nvwb clone project <your_project_url>
-
Opening the Project by
$ nvwb list projects $ nvwb open <project_name>
-
Starting JupyterLab by
$ nvwb start jupyterlab
-
Navigate to the code directory of the project. Then, open the notebooks provided and begin working through them at your own pace. Happy coding!
Tip: Use nvwb help
to see a full list of commands.
This notebook has been tested with an NVIDIA A100-40gb GPU and an Intel(R) Xeon(R) Gold 6240R CPU (2.40GHz) on the following version of NVIDIA AI Workbench: nvwb 0.2.66 (internal; linux; amd64; go1.18.10; Tue Sep 12 18:50:21 UTC 2023)
This NVIDIA AI Workbench example project is under the Apache 2.0 License