Skip to content

Commit

Permalink
Revert previous attempt at Triton patch; use CustomCacheManger approa…
Browse files Browse the repository at this point in the history
…ch instead. (#35)

I tested the previous fix for the Triton cache collision issue (see:
#34) and it didn't work.

I now see errors like:
```
FileNotFoundError: [Errno 2] No such file or directory: '/home/vllm/.triton/cache/1feb415f3280ca46eea8c4407a58c23e/fused_moe_kernel.json.tmp.pid_72_c0a0033e-6147-4520-ae3a-3847d02598f8'
```
which now shows the `uuid` instead of a random integer, but problem
remains.

This PR implements a different workaround, proposed by @cyang49, that
tells Triton to use a custom cache manager which assigns a different
directory based on the process id.

This time I have tested it and it seems to work.

---------

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: Joe Runde <joe@joerun.de>
Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com>
Co-authored-by: Joe Runde <joseph.runde@ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
  • Loading branch information
4 people authored Jun 3, 2024
1 parent f4ec244 commit a17c8fb
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 15 deletions.
10 changes: 3 additions & 7 deletions Dockerfile.ubi
Original file line number Diff line number Diff line change
Expand Up @@ -270,19 +270,15 @@ RUN microdnf install -y gcc \
&& microdnf clean all

# patch triton (fix for #720)
COPY triton_patch/cache_fix.patch .
RUN microdnf install -y patch \
&& patch /opt/vllm/lib/python3.11/site-packages/triton/runtime/cache.py cache_fix.patch \
&& microdnf remove -y patch \
&& microdnf clean all \
&& rm cache_fix.patch
COPY triton_patch/custom_cache_manager.py /opt/vllm/lib/python3.11/site-packages/triton/runtime/custom_cache_manager.py

ENV HF_HUB_OFFLINE=1 \
PORT=8000 \
GRPC_PORT=8033 \
HOME=/home/vllm \
VLLM_USAGE_SOURCE=production-docker-image \
VLLM_WORKER_MULTIPROC_METHOD=fork
VLLM_WORKER_MULTIPROC_METHOD=fork \
TRITON_CACHE_MANAGER="triton.runtime.custom_cache_manager:CustomCacheManager"

# setup non-root user for OpenShift
RUN microdnf install -y shadow-utils \
Expand Down
8 changes: 0 additions & 8 deletions triton_patch/cache_fix.patch

This file was deleted.

32 changes: 32 additions & 0 deletions triton_patch/custom_cache_manager.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import os

from triton.runtime.cache import (FileCacheManager, default_cache_dir,
default_dump_dir, default_override_dir)


class CustomCacheManager(FileCacheManager):

def __init__(self, key, override=False, dump=False):
self.key = key
self.lock_path = None
if dump:
self.cache_dir = default_dump_dir()
self.cache_dir = os.path.join(self.cache_dir, self.key)
self.lock_path = os.path.join(self.cache_dir, "lock")
os.makedirs(self.cache_dir, exist_ok=True)
elif override:
self.cache_dir = default_override_dir()
self.cache_dir = os.path.join(self.cache_dir, self.key)
else:
# create cache directory if it doesn't exist
self.cache_dir = os.getenv("TRITON_CACHE_DIR",
"").strip() or default_cache_dir()
if self.cache_dir:
self.cache_dir = f"{self.cache_dir}_{os.getpid()}"
self.cache_dir = os.path.join(self.cache_dir, self.key)
self.lock_path = os.path.join(self.cache_dir, "lock")
os.makedirs(self.cache_dir, exist_ok=True)
else:
raise RuntimeError("Could not create or locate cache dir")

print(f"Triton cache dir: {self.cache_dir=}")

0 comments on commit a17c8fb

Please sign in to comment.