SingleDecodeWithKVCache meets illegal memory access when setting input tensors to cuda:1 #452

jason-huang03 · 2024-08-17T15:47:54Z

This is from the given example in the repo:

import torch
import flashinfer

device_id = 1

kv_len = 2048
num_kv_heads = 32
head_dim = 128

k = torch.randn(kv_len, num_kv_heads, head_dim).half().to(device_id) 
v = torch.randn(kv_len, num_kv_heads, head_dim).half().to(device_id) 

# decode attention

num_qo_heads = 32
q = torch.randn(num_qo_heads, head_dim).half().to(device_id)

o = flashinfer.single_decode_with_kv_cache(q, k, v) # decode attention without RoPE on-the-fly
o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly

# append attention
append_qo_len = 128
q = torch.randn(append_qo_len, num_qo_heads, head_dim).half().to(device_id) # append attention, the last 128 tokens in the KV-Cache are the new tokens
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True) # append attention without RoPE on-the-fly, apply causal mask
o_rope_on_the_fly = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, pos_encoding_mode="ROPE_LLAMA") # append attention with LLaMA style RoPE on-the-fly, apply causal mask

# prefill attention
qo_len = 2048
q = torch.randn(qo_len, num_qo_heads, head_dim).half().to(device_id) # prefill attention
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=False) # prefill attention without RoPE on-the-fly, do not apply causal mask

When device_id=0, everything is fine. However, when device_id=1, the following error is thrown:

    out = _decode.single_decode_with_kv_cache(
RuntimeError: SingleDecodeWithKVCache kernel launch failed, error: an illegal memory access was encountered

I am using A100 SM 80. I find that the problem should have been solved in the commit related to #349 but I still meet this weird problem. Can you see why it happens? Thanks a lot! I want to deploy 70B model on multiple gpus so I think being able to run the kernel on different gpus is really important. Can you see why it happens?

The text was updated successfully, but these errors were encountered:

yzh119 · 2024-08-17T23:36:13Z

Hi @jason-huang03 , which version of flashinfer you were using? I suppose the issue should have been fixed in 0.0.9.

I can't reproduce it with the latest version of flashinfer (v0.1.5).

jason-huang03 · 2024-08-18T02:02:18Z

I checkout to v0.1.5 and rebuild using pip install --no-cache-dir --force-reinstall -e . . However, the problem persists. The whole error message is

CUDA Error: an illegal memory access was encountered (700) /mnt/huanghaofeng/flashinfer/python/include/flashinfer/attention/decode.cuh: line 658 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size)
Traceback (most recent call last):
  File "/mnt/huanghaofeng/flashinfer/test.py", line 19, in <module>
    o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly
  File "/mnt/huanghaofeng/flashinfer/python/flashinfer/decode.py", line 194, in single_decode_with_kv_cache
    out = _decode.single_decode_with_kv_cache(
RuntimeError: SingleDecodeWithKVCache kernel launch failed, error: an illegal memory access was encountered

You can see that the problem is from cudaFuncSetAttribute.

I am using cuda 11.8, torch 2.2.0 and in a containerized development environment. Can this be the problem?

jason-huang03 · 2024-08-18T03:17:19Z

Also I find that device_id in function SinglePrefillWithKVCacheDispatched in python/include/flashinfer/attention/prefill.cuh seems to be 0 regardless of the device_id set in the python code.

yzh119 · 2024-08-18T03:34:21Z

@jason-huang03 would you mind checking the device id here.

jason-huang03 · 2024-08-18T03:36:22Z

I use std::cout, device.index() here is empty, but device is correct (like cuda:1). I am now trying to use cuda 12.4 and torch 2.4 to see whether the problem can be solved.

jason-huang03 · 2024-08-18T03:53:45Z

After using pytorch 2.4 and cuda 12.4, the error disappears. Thanks for your time. It seems that the device and device index api has undergone some changes in the cuda or pytorch version.

yzh119 · 2024-08-18T06:37:01Z

thanks for reporting, I'll check the behavior on cu118 platforms.

## plan - [x] Check all kernels and add device guard - [x] Complete the tests FIX: #452

yzh119 · 2024-11-15T19:24:18Z

Should have been fixed in #611.

🤖 I have created a release *beep* *boop* --- ## [0.2.0](v0.1.6...v0.2.0) (2024-12-17) [Release Blog](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html). ### Features * add `rotary_dim` argument to rope APIs for partial apply rope ([#599](#599)) ([eb9bc71](eb9bc71)) * add a `use_softmax` field in variant class ([#533](#533)) ([d81af97](d81af97)) * add an option `non_blocking` to plan function ([#622](#622)) ([560af6f](560af6f)) * add gemma_rmsnorm and gemma_fused_add_rmsnorm ([#477](#477)) ([1a6b17e](1a6b17e)) * add group size 3 to GQA decode dispatch ([#558](#558)) ([6227562](6227562)) * add JIT compilation support for FA3 templates ([#672](#672)) ([d4e8d79](d4e8d79)) * allow the cascade kernels to be executed using varying sequence lenghts ([#627](#627)) ([92ac440](92ac440)) * CUDAGraph compatibility of multi-level cascade inference APIs ([#586](#586)) ([2332e8a](2332e8a)) * fix the maximal grid dimension in prefill planning with CUDA graphs ([#639](#639)) ([86ca89a](86ca89a)) * improve the precision of the FusedAddRMSNormKernel function ([#587](#587)) ([c7dc921](c7dc921)) * JIT compilation ([#507](#507)) ([3613a5b](3613a5b)) * modify group-gemm stage number ([#497](#497)) ([52dab1d](52dab1d)) * non-contiguous query with paged kv cache ([#553](#553)) ([89f2c4a](89f2c4a)) * pass a dynamic token count to the cascade kernels ([#635](#635)) ([5fe9f7d](5fe9f7d)) * simplify prefill JIT compilation ([#605](#605)) ([fe4f898](fe4f898)) * specify gemm backend ([#648](#648)) ([0cc1a51](0cc1a51)) * support cached cos/sin in rope APIs ([#585](#585)) ([83e541d](83e541d)) * support huggingface transformer style rope interface ([#568](#568)) ([4f40420](4f40420)) * support sm90 cutlass group gemm ([#509](#509)) ([794bdda](794bdda)) * torch custom_op fix for rope ([#569](#569)) ([3e104bc](3e104bc)) * torch custom_op support: norm ([#552](#552)) ([f6e0010](f6e0010)) * torch.compile and custom_op support ([#554](#554)) ([9bf916f](9bf916f)) * warmup for jit kernel tests ([#629](#629)) ([8f5f349](8f5f349)) ### Bug Fixes * AOT compiler flags on non-sm90 ([#522](#522)) ([0aa4726](0aa4726)) * batch decode kernel redundant store output to gmem ([#505](#505)) ([90e42a7](90e42a7)) * compatible with torch 2.2 ([#478](#478)) ([ac41d1b](ac41d1b)) * #452 ([b53a46f](b53a46f)) * remove redundant load ([#495](#495)) ([2de16b0](2de16b0)) * update bmm fp8 test ([#487](#487)) ([45eac04](45eac04)) ### Performance Improvements * accelerate JIT compilation speed ([#618](#618)) ([eaf73fd](eaf73fd)) * Dense and sparse customizable flashattention-3 template ([#667](#667)) ([51236c9](51236c9)) * fix prefill kernel performance degradation (step 1) ([#602](#602)) ([595cf60](595cf60)) * fix the performance issue of `append_paged_kv_cache` ([#588](#588)) ([e15f7c9](e15f7c9)) * improve parallelism in RoPE with pos_ids ([#609](#609)) ([ff05155](ff05155)) * improve plan performance by using non-blocking memcpy ([#547](#547)) ([41ebe6d](41ebe6d)) * reduce the read and write of shared memory in the FusedAddRMSNormKernel ([#592](#592)) ([2043ca2](2043ca2)) * reduce total_num_tiles_q by one ([#644](#644)) ([553ace5](553ace5)) * remove unnecessary contiguous operation in block sparse attention ([#561](#561)) ([7a7ad46](7a7ad46)) * speedup jit compilation of prefill attention kernels ([#632](#632)) ([a059586](a059586)) * use cuda-core implemention for io-bound block-sparse attention ([#560](#560)) ([3fbf028](3fbf028)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>

yzh119 added the bug Something isn't working label Aug 18, 2024

jeejeelee mentioned this issue Nov 15, 2024

misc: add device guard for kernels #611

Merged

2 tasks

yzh119 closed this as completed in #611 Nov 15, 2024

yzh119 pushed a commit that referenced this issue Nov 15, 2024

misc: add device guard for kernels (#611)

b53a46f

## plan - [x] Check all kernels and add device guard - [x] Complete the tests FIX: #452

github-actions bot mentioned this issue Nov 14, 2024

chore(main): release 0.2.0 #476

Merged

github-actions bot mentioned this issue Dec 1, 2024

chore(main): release 0.2.0 ur4t/flashinfer#1

Closed

github-actions bot mentioned this issue Dec 13, 2024

chore(main): release 0.2.0 xslingcn/flashinfer#1

Closed

github-actions bot mentioned this issue Dec 17, 2024

chore(main): release 0.3.0 xslingcn/flashinfer#2

Open

github-actions bot mentioned this issue Dec 25, 2024

chore(main): release 0.3.0 #698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SingleDecodeWithKVCache meets illegal memory access when setting input tensors to cuda:1 #452

SingleDecodeWithKVCache meets illegal memory access when setting input tensors to cuda:1 #452

jason-huang03 commented Aug 17, 2024

yzh119 commented Aug 17, 2024

jason-huang03 commented Aug 18, 2024 •

edited

Loading

jason-huang03 commented Aug 18, 2024

yzh119 commented Aug 18, 2024

jason-huang03 commented Aug 18, 2024 •

edited

Loading

jason-huang03 commented Aug 18, 2024 •

edited

Loading

yzh119 commented Aug 18, 2024

yzh119 commented Nov 15, 2024

SingleDecodeWithKVCache meets illegal memory access when setting input tensors to cuda:1 #452

SingleDecodeWithKVCache meets illegal memory access when setting input tensors to cuda:1 #452

Comments

jason-huang03 commented Aug 17, 2024

yzh119 commented Aug 17, 2024

jason-huang03 commented Aug 18, 2024 • edited Loading

jason-huang03 commented Aug 18, 2024

yzh119 commented Aug 18, 2024

jason-huang03 commented Aug 18, 2024 • edited Loading

jason-huang03 commented Aug 18, 2024 • edited Loading

yzh119 commented Aug 18, 2024

yzh119 commented Nov 15, 2024

jason-huang03 commented Aug 18, 2024 •

edited

Loading

jason-huang03 commented Aug 18, 2024 •

edited

Loading

jason-huang03 commented Aug 18, 2024 •

edited

Loading