Dependencies
- Triton for custom kernels
- transformer-engine for fp8 experimental capabilities (not integrated)
- PyTorch 2.2 & CUDA 12.1
To Run
- Clone CodeLlama-7b-Instruct-hf and CodeLlama-70b-Instruct-hf into a directory above or directly in this repo.
- Use fp16_to_int4.py to convert into a singular int4 quantized model.
- Run load_q40.py