Bamba VLLM Draft #2

fabianlim · 2024-11-28T14:33:37Z

NOTES

refactor to make it TP-able
add tests and make sure it the non-chunked pre-fill tests pass
ensure chunked pre-fill the chunked prefill tests pass
investigate the cuda invalid access for long input sequences
fix precision problem for gated norm
fix mamba kernels for long sequences

Tests

Currently the tests, except the ones for chunked pre-fill, are passing

================================================================================================== short test summary info ==================================================================================================
FAILED tests/models/decoder_only/language/test_bamba.py::test_chunked_prefill_with_parallel_sampling[10-float-/workspace/bamba-ckpt-fp16] - ValueError: too many values to unpack (expected 2)
FAILED tests/models/decoder_only/language/test_bamba.py::test_chunked_prefill[1-32-float-/workspace/bamba-ckpt-fp16] - AssertionError: Test0:
FAILED tests/models/decoder_only/language/test_bamba.py::test_chunked_prefill[4-32-float-/workspace/bamba-ckpt-fp16] - AssertionError: Test0:
FAILED tests/models/decoder_only/language/test_bamba.py::test_chunked_prefill[16-32-float-/workspace/bamba-ckpt-fp16] - AssertionError: Test0:
==================================================================================== 4 failed, 9 passed, 1 warning in 496.02s (0:08:16) =====================================================================================
(mamba-vllm) 1000960000@flim-mamba-master-0:~/data/vllm$ pytest tests/models/decoder_only/language/test_bamba.py::test_chunked_prefill
==================================================================================================== test session starts ====================================================================================================
platform linux -- Python 3.10.12, pytest-8.3.3,

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim · 2024-12-03T03:46:04Z

vllm/model_executor/layers/mamba/ops/ssd_combined.py

+        z = z.contiguous()
+    if D is not None and D.stride(-1) != 1:
+        D = D.contiguous()
+    if initial_states is not None:


I feel chunked prefill support requires update of the kernels. This is because initial_states seems to be only implemented to handle batch > 1 when cu_seqlens == None, since the latter case is only supported when we flatten the input x such that batch == 1.

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim · 2024-12-03T15:57:19Z

For the chunked prefill effort, i made some mind maps of the _mamba_chunk_scan_combined_fwd function

since this is inference we only care about fwd
we only need to target _state_passing_fwd as that is the function that accepts initial_states
in _state_passing_fwd, the initial_states will override state in the computation states = scale * states + new_states, so we need to detect when a sequence has finished. we can do this using seq_idx_new and seq_idx. A new sequence is starting when their indices differ

   for c in range(nchunks):
        new_states = tl.load(states_ptrs, mask=offs_m < dim, other=0.0).to(tl.float32)
        dA_cs = tl.load(dA_cs_ptr).to(tl.float32)
        scale = tl.exp(dA_cs)
        if HAS_SEQ_IDX:
            seq_idx_new = tl.load(seq_idx_ptr + (min((c + 1) * chunk_size, seqlen) - 1) * stride_seq_idx_seqlen)
            scale = tl.where(seq_idx_new == seq_idx, scale, 0.0)
            seq_idx = seq_idx_new
        states = scale * states + new_states
        if c < nchunks - 1:
            tl.store(out_ptrs, states, mask=offs_m < dim)
        else:
            tl.store(final_states_ptrs, states, mask=offs_m < dim)
        states_ptrs += stride_states_chunk
        dA_cs_ptr += stride_dA_cs_chunk
        out_ptrs += stride_out_chunk

       # if we detect that seq_idx and seq_idx_new has changed, then we need to load the new init states
       # modification
      initstates_ptrs = initstates_ptr + offs_m * stride_batch
      states = tl.load(initstates_ptrs, mask=offs_m < dim, other=0.0).to(tl.float32)

cyang49 · 2024-12-04T22:31:41Z

To get stable latency measurements in nsight systems we rely on enabling cudagraph. However in our test with long context (64k) cudagraph usage is not seen in the profile. After digging in the code I found that the usage is controlled by engine argument max_seq_len_to_capture and default to 8192. Overriding this config will enable the usage correctly.

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim marked this pull request as draft November 28, 2024 14:33

bamba draft

14d1683

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim force-pushed the pr-draft branch 2 times, most recently from 4d67c31 to 98ba4fa Compare November 30, 2024 05:35

add test_bamba

e26d327

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim force-pushed the pr-draft branch from 98ba4fa to e26d327 Compare November 30, 2024 06:06

fabianlim added 2 commits December 2, 2024 02:56

rope fix for long sequence lengths

01ca5cd

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fix batching with seq_idx

1cd3c7a

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim commented Dec 3, 2024

View reviewed changes

add comments on chunked prefill

1c1dbb4

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim added 4 commits December 5, 2024 00:04

switch model and disable chunk prefill test

dfdcf28

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fix casting in rms norm gated

18d0176

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

TP fix

02943e7

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fix mamba scan invalid address

b3bf431

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim force-pushed the pr-draft branch from cf8ad1b to b3bf431 Compare December 8, 2024 14:05

gabe-l-hart mentioned this pull request Dec 12, 2024

Bamba architecture ggml-org/llama.cpp#10810

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bamba VLLM Draft #2

Bamba VLLM Draft #2

fabianlim commented Nov 28, 2024 •

edited

Loading

fabianlim Dec 3, 2024

fabianlim commented Dec 3, 2024

cyang49 commented Dec 4, 2024 •

edited

Loading

Bamba VLLM Draft #2

Are you sure you want to change the base?

Bamba VLLM Draft #2

Conversation

fabianlim commented Nov 28, 2024 • edited Loading

Tests

fabianlim Dec 3, 2024

Choose a reason for hiding this comment

fabianlim commented Dec 3, 2024

cyang49 commented Dec 4, 2024 • edited Loading

fabianlim commented Nov 28, 2024 •

edited

Loading

cyang49 commented Dec 4, 2024 •

edited

Loading