Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Support reduce and scan for more than 2^31 items #584

Closed
milthorpe opened this issue Oct 25, 2022 · 6 comments
Closed

Support reduce and scan for more than 2^31 items #584

milthorpe opened this issue Oct 25, 2022 · 6 comments
Assignees
Labels

Comments

@milthorpe
Copy link

milthorpe commented Oct 25, 2022

cub::DeviceReduce and cub::DeviceScan specify the input size as int num_items, which limits reductions/scans to 2^31-1 items. Other CUB functions have the type of the number of items as a template parameter e.g. cub::DeviceRadixSort accepts NumItemsT num_items.

The num_items parameter should be replaced with a templated parameter NumItemsT num_items for both cub::DeviceReduce and cub::DeviceScan, to allow reductions and scans of more than 2^31 items.

A simple test code for DeviceReduce with size_t num_items would be as follows:

#include <cub/cub.cuh>

using namespace cub;

template <typename T> void cubSum(const T *d_in, T *d_out, size_t num_items) {
  void *d_temp_storage = NULL;
  size_t temp_storage_bytes = 0;
  CubDebugExit(DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items));
  cudaMalloc(&d_temp_storage, temp_storage_bytes);

  CubDebugExit(DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items));

  cudaFree(d_temp_storage);
}

int main(int argc, char* argv[]) {
    size_t num_items = 2l<<30;
    double *h_in = new double[num_items];
    for (size_t i=0; i<num_items; i++) h_in[i] = i;
    double result = 0;

    CachingDeviceAllocator  g_allocator(true);

    double *d_in = NULL;
    CubDebugExit(g_allocator.DeviceAllocate((void**)&d_in, sizeof(double) * num_items));

    // Initialize device input
    CubDebugExit(cudaMemcpy(d_in, h_in, sizeof(double) * num_items, cudaMemcpyHostToDevice));

    double *d_out = NULL;
    CubDebugExit(g_allocator.DeviceAllocate((void**)&d_out, sizeof(double)));

    cubSum(d_in, d_out, num_items);

    CubDebugExit(cudaMemcpy(&result, d_out, sizeof(double), cudaMemcpyDeviceToHost));
    printf("result: %e\n", result);

    if (h_in) delete[] h_in;
    if (d_in) CubDebugExit(g_allocator.DeviceFree(d_in));
    if (d_out) CubDebugExit(g_allocator.DeviceFree(d_out));
}
@milthorpe
Copy link
Author

I'd be happy to work on a patch for this if the CUB developers think the proposed change would be acceptable.

@gevtushenko
Copy link
Collaborator

Hello, @milthorpe!

Thank you for offering help!
I've just filed a PR that illustrates what we expect from patches that introduce > 2^31 items support into our device algorithms. If you'd like, you can start looking into scan as soon as the PR is merged. Alternatively, you can wait till we prioritize this work, it's definitely on our list.

@jrhemstad jrhemstad added the cub label Feb 22, 2023
@wmaxey wmaxey moved this to Todo in CCCL Feb 23, 2023
@wmaxey wmaxey moved this from Todo to Awaiting Feedback in CCCL Feb 23, 2023
@wmaxey
Copy link
Member

wmaxey commented Feb 23, 2023

Does anyone have plans on tackling this change?

@gevtushenko
Copy link
Collaborator

@wmaxey unless @milthorpe would like to contribute 64-bit support for scan, we'll create a separate issue for that.

@milthorpe
Copy link
Author

I don't actually have a current use-case for scan (only reduce, which @senior-zero kindly fixed already), which is why I've been so slow to respond - sorry! I think it is better to create a separate issue.

@gevtushenko
Copy link
Collaborator

@milthorpe I'm closing this issue then

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Archived in project
Development

No branches or pull requests

4 participants