Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge upstream #44

Merged
merged 51 commits into from
Nov 5, 2024
Merged
Changes from 1 commit
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
8841ce3
llama : switch KQ multiplication to F32 precision by default (#10015)
ggerganov Oct 27, 2024
8125e6c
server : don't overfill the batch during infill (#10018)
ggerganov Oct 28, 2024
524afee
musa: workaround for Guilty Lockup in cleaning src0 (#10042)
yeahdongcn Oct 28, 2024
07028f9
flake.lock: Update (#10063)
ggerganov Oct 28, 2024
61715d5
llama : Add IBM granite template (#10013)
arch-btw Oct 28, 2024
8d8ff71
llama : remove Tail-Free sampling (#10071)
ggerganov Oct 29, 2024
8f275a7
ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the…
cyzero-kim Oct 29, 2024
c5b0f4b
llama : refactor model loader with backend registry (#10026)
slaren Oct 30, 2024
fc83a9e
ggml : add Q4_0_8_8 RISC-V GEMV and GEMM kernels (#10029)
xctan Oct 30, 2024
79a2bc0
convert : more detailed convert lora usage docs (#10065)
richdougherty Oct 30, 2024
6763f71
readme : more lora detail in main example readme (#10064)
richdougherty Oct 30, 2024
b9e02e8
ggml : fix memory leaks when loading invalid gguf files (#10094)
slaren Oct 30, 2024
61408e7
kompute: add backend registry / device interfaces (#10045)
slp Oct 30, 2024
1329c0a
kompute: add mul_mat_q4_k shader (#10097)
slp Oct 31, 2024
dea5e86
ggml : check tensor name lengths in gguf files (#10100)
slaren Oct 31, 2024
0a683e8
server : include scheme when printing URL (#10106)
bakkot Oct 31, 2024
ab3d71f
loader: refactor tensor weights storage (#9935)
kylo5aby Oct 31, 2024
c02e5ab
llama : fix buffer checks for mamba and rwk (#10111)
slaren Oct 31, 2024
1e9f949
quantize : fix --keep-split (#10114)
slaren Oct 31, 2024
85679d3
llama : improve output buffer type selection (#10098)
slaren Oct 31, 2024
e597e50
build: fix build error in Windows env with OneAPI setup (#10107)
kylo5aby Nov 1, 2024
f221d56
ggml : alloc ggml_contexts on the heap (whisper/2525)
ggerganov Nov 1, 2024
815fe72
sync : ggml
ggerganov Nov 1, 2024
1804adb
ggml : remove ggml_scratch (#10121)
ggerganov Nov 1, 2024
d865d14
server : fix smart selection of available slot (#10120)
sasha0552 Nov 1, 2024
ba6f62e
readme : update hot topics
ggerganov Nov 1, 2024
418f5ee
vulkan : improve ggml_vk_create_buffer error handling (#9898)
FanShupei Nov 1, 2024
e991e31
llama : use smart pointers for ggml resources (#10117)
slaren Nov 1, 2024
a6744e4
llama : add simple-chat example (#10124)
slaren Nov 1, 2024
7554aa4
convert-lora : make `--base` optional (#10110)
ngxson Nov 2, 2024
b634f8a
simple-chat : only add bos on first prompt (#10129)
slaren Nov 2, 2024
1926d6e
llama : adjust default context size + print warnings (#10136)
ggerganov Nov 2, 2024
4595041
server : fix endpoint checks (#10135)
ggerganov Nov 2, 2024
42cadc7
server : fix slot selection by lru (#10126)
sasha0552 Nov 2, 2024
9830b69
Add apple arm to presets (#10134)
kohnech Nov 2, 2024
1839f69
flake.lock: Update (#10146)
ggerganov Nov 3, 2024
08828a6
metal : minor fixup in FA kernel (#10143)
ggerganov Nov 3, 2024
9f40989
ggml : move CPU backend to a separate file (#10144)
slaren Nov 3, 2024
e2292aa
metal : fix minor string leaks (ggml/1004)
pminev Nov 1, 2024
284e5b0
cmake : make it possible linking ggml as external lib (ggml/1003)
ykhrustalev Nov 2, 2024
ce027ad
sync : ggml
ggerganov Nov 4, 2024
329ed91
CANN: adjust backend registry refactor. (#10158)
leo-pony Nov 4, 2024
f8e5813
metal : move dequantize templates to beginning of MSL source (#0)
ggerganov Nov 4, 2024
05697f6
metal : simplify f16 and f32 dequant kernels (#0)
ggerganov Nov 4, 2024
ea02c75
cuda : clear error after changing peer access (#10153)
slaren Nov 4, 2024
6a066b9
fix build break on arm64 linux (#10166)
snadampal Nov 4, 2024
9e0ecfb
server : clarify /slots endpoint, add is_processing (#10162)
ngxson Nov 4, 2024
401558b
ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (#10167)
slaren Nov 4, 2024
d5a409e
ggml : fix gelu tables initialization (#10172)
slaren Nov 4, 2024
3407364
Q6_K AVX improvements (#10118)
netrunnereve Nov 4, 2024
a9e8a9a
ggml : fix arch check in bf16_to_fp32 (#10164)
slaren Nov 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
musa: workaround for Guilty Lockup in cleaning src0 (ggml-org#10042)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
yeahdongcn authored Oct 28, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit 524afeec9dad7d765ce91f5cf30c73703867cb47
7 changes: 6 additions & 1 deletion ggml/src/ggml-cuda.cu
Original file line number Diff line number Diff line change
@@ -1484,14 +1484,19 @@ static void ggml_cuda_op_mul_mat(
const size_t nbytes_data = ggml_nbytes(src0);
const size_t nbytes_padding = ggml_row_size(src0->type, MATRIX_ROW_PADDING - ne00 % MATRIX_ROW_PADDING);
dev[id].src0_dd = dev[id].src0_dd_alloc.alloc(ctx.pool(id), nbytes_data + nbytes_padding);
// TODO: remove this for MUSA once the Guilty Lockup issue is resolved
#ifndef GGML_USE_MUSA
CUDA_CHECK(cudaMemsetAsync(dev[id].src0_dd, 0, nbytes_data + nbytes_padding, stream));
#else // GGML_USE_MUSA
CUDA_CHECK(cudaMemsetAsync(dev[id].src0_dd + nbytes_data, 0, nbytes_padding, stream));
#endif // !GGML_USE_MUSA
}

// If src0 is on a temporary compute buffer (partial offloading) there may be some padding that needs to be cleared:
if (ne00 % MATRIX_ROW_PADDING != 0 && ggml_is_quantized(src0->type) && ggml_backend_buffer_get_usage(src0->buffer) == GGML_BACKEND_BUFFER_USAGE_COMPUTE && src0->view_src == nullptr) {
const size_t nbytes_data = ggml_row_size(src0->type, (dev[id].row_high - dev[id].row_low)*ne00);
const size_t nbytes_padding = ggml_row_size(src0->type, MATRIX_ROW_PADDING - ne00 % MATRIX_ROW_PADDING);
CUDA_CHECK(cudaMemsetAsync(dev[id].src0_dd + nbytes_data , 0, nbytes_padding, stream));
CUDA_CHECK(cudaMemsetAsync(dev[id].src0_dd + nbytes_data, 0, nbytes_padding, stream));
}

if (src1_on_device && src1_is_contiguous) {