Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running model from a GGUF file, only #326

Closed
MoonRide303 opened this issue May 17, 2024 · 47 comments
Closed

Running model from a GGUF file, only #326

MoonRide303 opened this issue May 17, 2024 · 47 comments
Labels
new feature New feature or request

Comments

@MoonRide303
Copy link

MoonRide303 commented May 17, 2024

Describe the bug
Running model from a GGUF file using llama.cpp is very straightforward, just like that:
server -v -ngl 99 -m Phi-3-mini-4k-instruct-Q6_K.gguf
and if model is supported, it just works.

I tried to do the same using mistral.rs, and I've got that:

mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:21:48.581660Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:21:48.581743Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:21:48.581820Z  INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:21:48.581873Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:21:48.583625Z  INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:21:48.583707Z  INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
File "tokenizer.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Why it asks me for a tokenizer file, when it's included in the GGUF file? I understand having this as an option (if I wanted to try out different tokenizer / configuration), but by default it should just use information provided in the gguf file itself.

Next attempt, when I copied tokenizer.json from the original model repo:

mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:28:34.987235Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:28:34.987332Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:28:34.987382Z  INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:28:34.987431Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:28:34.989190Z  INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:28:34.989270Z  INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:28:35.532371Z  INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `".\\tokenizer.json"`
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
File "config.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

And another attempt, after copying config.json (which I think is also unnecessary, as llama.cpp works fine without it):

mistralrs-server gguf -m . -t . -f .\Phi-3-mini-4k-instruct-Q6_K.gguf
2024-05-17T07:30:00.352139Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-17T07:30:00.352236Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-17T07:30:00.352301Z  INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-17T07:30:00.352344Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-17T07:30:00.354085Z  INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:30:00.354168Z  INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:30:00.601055Z  INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `".\\tokenizer.json"`
2024-05-17T07:30:00.814258Z  INFO mistralrs_core::pipeline::gguf: Loading `"config.json"` locally at `".\\config.json"`
2024-05-17T07:30:00.814412Z  INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-05-17T07:30:00.814505Z  INFO mistralrs_core::utils::tokens: Could not load token at "C:\\Users\\[REDACTED]/.cache/huggingface/token", using no HF token.
2024-05-17T07:30:01.022055Z  INFO mistralrs_core::pipeline: Loading `".\\Phi-3-mini-4k-instruct-Q6_K.gguf"` locally at `".\\.\\Phi-3-mini-4k-instruct-Q6_K.gguf"`
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:290:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I wanted to give mistral.rs a shot, but it's a really painful experience for now.

Latest commit
ca9bf7d (v0.1.8)

@MoonRide303 MoonRide303 added the bug Something isn't working label May 17, 2024
@joshpopelka20
Copy link
Contributor

Did you add the HuggingFace Token? I got the same error RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main])) until I added the token.

Here are the ways you can add it https://github.com/EricLBuehler/mistral.rs?tab=readme-ov-file#getting-models-from-hf-hub.

Also, this chat helped me as I was getting a 403 Error after that https://discuss.huggingface.co/t/error-403-what-to-do-about-it/12983. I had to accept the Llama license.

@MoonRide303
Copy link
Author

@joshpopelka20 I want to run model from a local GGUF file, only - exactly the same way as in llama.cpp. Communication with HF (or any other) servers shouldn't ever be required for that.

@polarathene
Copy link
Contributor

polarathene commented May 18, 2024

A recent issue also showed UX issues with this: #295 (comment)

UPDATE: This local model support may have been a very new feature it seems, which might explain the current UX issues: #308


I found the README a bit confusing too vs llama-cpp for local GGUF (it doesn't help that it refers to terms you need to configure, but then uses short option names, the linked CLI args output also appears outdated from what a git build shows).

I was not able to use absolute or relative paths in a way that mistral.rs would seem to understand / accept, so based on the linked issue above, I had to ensure the binary was adjacent to the model (and forced tokenizer.json + config.json files)...


Still like you it fails, but here is the extra output for knowing why:

$ RUST_BACKTRACE=1 ./mistralrs-server --token-source none gguf -m . -t . -f model.gguf

2024-05-18T01:41:28.388727Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-18T01:41:28.388775Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-18T01:41:28.388781Z  INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-18T01:41:28.388828Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-18T01:41:28.388869Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-05-18T01:41:28.658484Z  INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `"./tokenizer.json"`
2024-05-18T01:41:29.024145Z  INFO mistralrs_core::pipeline::gguf: Loading `"config.json"` locally at `"./config.json"`
2024-05-18T01:41:29.024256Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-05-18T01:41:29.333151Z  INFO mistralrs_core::pipeline: Loading `"model.gguf"` locally at `"./model.gguf"`
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:290:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: tokio::runtime::context::runtime::enter_runtime
   4: tokio::runtime::runtime::Runtime::block_on
   5: mistralrs_server::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Note that --token-source none has no effect (it must be given prior to the gguf subcommand where it is not considered a valid option), the code path still goes through load_model_from_hf (which will then forward to load_model_from_path if it didn't panic):

fn load_model_from_path(

/// The source of the HF token.
pub enum TokenSource {
Literal(String),
EnvVar(String),
Path(String),
CacheToken,
None,
}

"none" => Ok(TokenSource::None),

--token-source does accept none and refuses None or not-a-valid-variant, so I'm not sure why the output suggests it's still trying to use the default TokenSource::CacheToken? But it should be TokenSource::None (EDIT: Confirmed, this is a check hf_hub crate does regardless)

/// Defaults to using a cached token.
#[arg(long, default_value_t = TokenSource::CacheToken, value_parser = parse_token_source)]


Initial attempt

Let's follow the problem from CLI to hugging face API call with the token:

EDIT: Collapsed for brevity (not relevant)

let loader: Box<dyn Loader> = LoaderBuilder::new(args.model)

info!("Model kind is: {}", loader.get_kind().to_string());
let pipeline = loader.load_model_from_hf(
None,
args.token_source,
None,
&device,
false,
args.num_device_layers
.map(DeviceMapMetadata::from_num_device_layers)
.unwrap_or(DeviceMapMetadata::dummy()),
args.in_situ_quant,
)?;
info!("Model loaded.");

impl Loader for GGUFLoader {
#[allow(clippy::type_complexity, clippy::too_many_arguments)]
fn load_model_from_hf(
&self,
revision: Option<String>,
token_source: TokenSource,
_dtype: Option<DType>,
device: &Device,
silent: bool,
mapper: DeviceMapMetadata,
in_situ_quant: Option<GgmlDType>,
) -> Result<Arc<Mutex<dyn Pipeline + Send + Sync>>> {
let paths: anyhow::Result<Box<dyn ModelPaths>> = get_paths!(
LocalModelPaths,
&token_source,
revision,
self,
self.quantized_model_id,
self.quantized_filename,
silent
);
self.load_model_from_path(&paths?, _dtype, device, silent, mapper, in_situ_quant)
}
#[allow(clippy::type_complexity, clippy::too_many_arguments)]
fn load_model_from_path(

macro_rules! get_paths {
($path_name:ident, $token_source:expr, $revision:expr, $this:expr, $quantized_model_id:expr, $quantized_filename:expr, $silent:expr) => {{
let api = ApiBuilder::new()
.with_progress(!$silent)
.with_token(Some(get_token($token_source)?))
.build()?;

use hf_hub::{api::sync::ApiBuilder, Repo, RepoType};

/// This reads a token from a specified source. If the token cannot be read, a warning is logged with `tracing`
/// and *no token is used*.
pub(crate) fn get_token(source: &TokenSource) -> Result<String> {
Ok(match source {

TokenSource::None => "".to_string(),

The macro calls the huggingface API and adds the token .with_token(Some(get_token($token_source)?)), which is an empty string for TokenSource::None. The upstream crate .with_token() expects an option, so it's being wrapped with Some() to always assume a valid token is provided to it?

https://docs.rs/hf-hub/latest/hf_hub/api/sync/struct.ApiBuilder.html#method.with_token

https://github.com/huggingface/hf-hub/blob/9d6502f5bc2e69061c132f523c76a76dad470477/src/api/sync.rs#L143-L157

    /// Sets the token to be used in the API
    pub fn with_token(mut self, token: Option<String>) -> Self {
        self.token = token;
        self
    }

    fn build_headers(&self) -> HeaderMap {
        let mut headers = HeaderMap::new();
        let user_agent = format!("unkown/None; {NAME}/{VERSION}; rust/unknown");
        headers.insert(USER_AGENT, user_agent);
        if let Some(token) = &self.token {
            headers.insert(AUTHORIZATION, format!("Bearer {token}"));
        }
        headers
    }

Because the empty string was passed in, it passes that conditional and we add the authorization HTTP header with Bearer " (empty value). If it was instead None value passed to the API here, it'd skip the header and that error would be avoided.

Next up, back in mistral.rs with that same macro, the expected tokenizer.json and config.json files are presumably being enforced by the logic here:

let tokenizer_filename = if let Some(ref p) = $this.tokenizer_json {
info!("Using tokenizer.json at `{p}`");
PathBuf::from_str(p)?
} else {
$crate::api_get_file!(api, "tokenizer.json", model_id)
};
let config_filename = $crate::api_get_file!(api, "config.json", model_id);


Workaround

I'm terrible at debugging, so I sprinkled a bunch of info! lines to track where the logic in the macro was failing:

let gen_conf = if $crate::api_dir_list!(api, model_id)
.collect::<Vec<_>>()
.contains(&"generation_config.json".to_string())
{
Some($crate::api_get_file!(
api,
"generation_config.json",
model_id
))
} else {
None
};

The api_dir_list at this point fails with due to the 401 response:

macro_rules! api_dir_list {
($api:expr, $model_id:expr) => {
$api.info()
.map(|repo| {
repo.siblings
.iter()
.map(|x| x.rfilename.clone())
.collect::<Vec<String>>()
})
.unwrap_or_else(|e| {
// If we do not get a 404, it was something else.
let format = format!("{e:?}");
if let hf_hub::api::sync::ApiError::RequestError(resp) = e {
if resp.into_response().is_some_and(|r| r.status() != 404) {
panic!("{format}");
}
}

I'm not familiar with what this part of the code is trying to do, but for local/offline use the HF API shouldn't be queried at all... but it seems to be enforced?

Since 404 errors have an exception to panic, if you do the same for 401 then it's happy (tokenizer_config.json must also be provided though):

if resp.into_response().is_some_and(|r| r.status() != 404) {

# Example:
if resp.into_response().is_some_and(|r| !(matches!(r.status(), 401 | 404))) {

Proper solution is probably to opt-out of the HF API entirely though?

@EricLBuehler
Copy link
Owner

Hi @MoonRide303!

Our close integration with the HF hub is intentional, as generally it is better to use the official tokenizer. However, I agree that it would be nice to enable loading from only a GGUF file. I'll begin work on this, and it shouldn't be too hard.

@polarathene:

Note that --token-source none has no effect (it must be given prior to the gguf subcommand where it is not considered a valid option), the code path still goes through load_model_from_hf (which will then forward to load_model_from_path if it didn't panic):

I think this behavior can be improved, I'll make a modification.

@Jeadie
Copy link
Contributor

Jeadie commented May 20, 2024

Agree with #326 (comment). The prior PR change was the minimal chnages needed to load a known HF model from local. It is an awkward UX to have for a local-only model.

@ShelbyJenkins
Copy link

I think there is a strong use case for loading from file without access to hugging face. HF is good! But, if you're trying to use an LLM in production, it's another failure point if your access to HF goes down. Also, there is always the risk that the creators of the LLM model might deny access to the repo at some point in the future.

Anyways, trying to get this to work locally now with the rust library. load_model_from_path requires the ModelPaths object, which doesn't seem to be importable from mistralrs/src/lib.rs.

@EricLBuehler
Copy link
Owner

I think there is a strong use case for loading from file without access to hugging face. HF is good! But, if you're trying to use an LLM in production, it's another failure point if your access to HF goes down. Also, there is always the risk that the creators of the LLM model might deny access to the repo at some point in the future.

Yes, especially when using a GGUF file as otherwise, there is always ISQ. I'm working on adding this in #345.

Anyways, trying to get this to work locally now with the rust library. load_model_from_path requires the ModelPaths object, which doesn't seem to be importable from mistralrs/src/lib.rs.

Ah, sorry, that was an oversight. I just merged #348, which both exposes those, and also exposes the Device, DType and a few other useful types so that you do not need to explicitly depend on our Candle branch.

@EricLBuehler
Copy link
Owner

@MoonRide303, @polarathene, @Jeadie, @joshpopelka20, @ShelbyJenkins

I just merged #345, which enables using the GGUF tokenizer. The implementation is tested against the HF tokenizer in CI, so you have a guarantee that it is correct. This is the applicable readme section.

Here is an example:

cargo run --release --features ... -- -i --chat-template <chat_template> gguf -m . -f Phi-3-mini-128k-instruct-q4_K_M.gguf

I would appreciate your thoughts on how this can be improved!

@MoonRide303
Copy link
Author

@EricLBuehler Not strictly related to this issue, but I updated to current (12.5) CUDA version few days ago, and mistral.rs (as of v0.1.11) no longer compiles. Not blocking compilation with a newer (and possibly backward-compatible) versions of CUDA would be definitely an improvement, allowing me to verify if / how the fix works ^^ (alternative: provide binary releases).

   Compiling onig_sys v69.8.1
error: failed to run custom build command for `cudarc v0.11.1`

Caused by:
  process didn't exit successfully: `D:\repos-git\mistral.rs\target\release\build\cudarc-2198e5ff31cf1aaa\build-script-build` (exit code: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-env-changed=CUDA_ROOT
  cargo:rerun-if-env-changed=CUDA_PATH
  cargo:rerun-if-env-changed=CUDA_TOOLKIT_ROOT_DIR

  --- stderr
  thread 'main' panicked at C:\Users\[REDACTED]\.cargo\registry\src\index.crates.io-6f17d22bba15001f\cudarc-0.11.1\build.rs:54:14:
  Unsupported cuda toolkit version: `12050`. Please raise a github issue.
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...

@EricLBuehler
Copy link
Owner

EricLBuehler commented May 28, 2024

@MoonRide303, yes, but unfortunately that's a problem higher up in the dependency graph. There's a PR for that here: coreylowman/cudarc#238, and I'll let you know when it gets merged.

Alternatively, could you try out one of our docker containers: https://github.com/EricLBuehler/mistral.rs/pkgs/container/mistral.rs

@solaoi
Copy link

solaoi commented May 29, 2024

I am currently working on a project where I need to use the gguf model locally. However, I am not very familiar with calling Rust libraries directly.

Could you please provide an example of how to invoke gguf locally in Rust? A simple example would be very helpful for my understanding.

Thank you for your assistance!

@EricLBuehler
Copy link
Owner

Could you please provide an example of how to invoke gguf locally in Rust? A simple example would be very helpful for my understanding.

Absolutely, here is a simple example of running a GGUF model purely locally:

fn setup() -> anyhow::Result<Arc<MistralRs>> {
// Select a Mistral model
// We do not use any files from HF servers here, and instead load the
// chat template from the specified file, and the tokenizer and model from a
// local GGUF file at the path `.`
let loader = GGUFLoaderBuilder::new(
GGUFSpecificConfig { repeat_last_n: 64 },
Some("chat_templates/mistral.json".to_string()),
None,
".".to_string(),
"mistral-7b-instruct-v0.1.Q4_K_M.gguf".to_string(),
)
.build();
// Load, into a Pipeline
let pipeline = loader.load_model_from_hf(
None,
TokenSource::CacheToken,
None,
&Device::cuda_if_available(0)?,
false,
DeviceMapMetadata::dummy(),
None,
)?;
// Create the MistralRs, which is a runner
Ok(MistralRsBuilder::new(pipeline, SchedulerMethod::Fixed(5.try_into().unwrap())).build())
}
fn main() -> anyhow::Result<()> {
let mistralrs = setup()?;
let (tx, mut rx) = channel(10_000);
let request = Request::Normal(NormalRequest {
messages: RequestMessage::Completion {
text: "Hello! My name is ".to_string(),
echo_prompt: false,
best_of: 1,
},
sampling_params: SamplingParams::default(),
response: tx,
return_logprobs: false,
is_streaming: false,
id: 0,
constraint: Constraint::None,
suffix: None,
adapters: None,
});
mistralrs.get_sender().blocking_send(request)?;
let response = rx.blocking_recv().unwrap();
match response {
Response::CompletionDone(c) => println!("Text: {}", c.choices[0].text),
_ => unreachable!(),
}
Ok(())
}

Please feel free to let me know if you have any questions!

@solaoi
Copy link

solaoi commented May 29, 2024

@EricLBuehler
Thank you for providing the example!

I tried running it, but I encountered the following error:

RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))

Since this is a local example, I assumed that the HuggingFace Token wouldn't be necessary. Is this not the case?

@EricLBuehler
Copy link
Owner

EricLBuehler commented May 29, 2024

HI @solaoi, that should be fixed now, can you please try it again after a git pull?

@EricLBuehler
Copy link
Owner

@MoonRide303

Not strictly related to this issue, but I updated to current (12.5) CUDA version few days ago, and mistral.rs (as of v0.1.11) no longer compiles.

cudarc just merged 12.5 support, so this should compile now.

@MoonRide303
Copy link
Author

@EricLBuehler Pros: it compiles. Cons: doesn't work.

mistralrs-server.exe gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
thread 'main' panicked at C:\Users\[REDACTED]\.cargo\registry\src\index.crates.io-6f17d22bba15001f\cudarc-0.11.2\src\driver\sys\mod.rs:43:71:
called `Result::unwrap()` on an `Err` value: LoadLibraryExW { source: Os { code: 126, kind: Uncategorized, message: "Nie można odnaleźć określonego modułu." } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Not sure what might be causing this - but llama.cpp compiles and works without issues, so I'd assume my env is fine.

@EricLBuehler
Copy link
Owner

@MoonRide303, it looks like you are using Windows. This issue has been reported here (coreylowman/cudarc#219) and here (huggingface/candle#2175). Can you add the path to your libcuda.so to LD_LIBRARY_PATH?

@MoonRide303
Copy link
Author

MoonRide303 commented May 29, 2024

@EricLBuehler .so ELFs and LD_LIBRARY_PATH won't work on Windows. I am compiling and using dynamically linked CUDA-accelerated llama.cpp builds without issues, so CUDA .dlls should be in my path already.

$ which nvcuda.dll
/c/Windows/system32/nvcuda.dll
$ which nvrtc64_120_0.dll
/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.5/bin/nvrtc64_120_0.dll
$ which cudart64_12.dll
/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.5/bin/cudart64_12.dll

@EricLBuehler
Copy link
Owner

Right, sorry, my mistake. On Windows, do you know if you have multiple CUDA installations, can you run:
$ dir /c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/

@MoonRide303
Copy link
Author

@EricLBuehler Got some, but those were just empty dirs from old versions:

$ ls "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/"
v11.7/  v11.8/  v12.1/  v12.4/  v12.5/

I removed all except 12.5, and it didn't help in any way. But it shouldn't matter as long as necessary .dll are in the path (and they are).

@polarathene
Copy link
Contributor

that should be fixed now, can you please try it again after a git pull?

I did a fresh git clone and build in a container, so I'm not sure why I'm still encountering the 401.

@EricLBuehler I assume the reason you're not experiencing that is because the default for --token-source is cache which you've probably been utilizing? You need to add --token-source none and should also hit the same problem?

I pointed out the 401 issue earlier. It can be bypassed after a patch, but proper solution would be to skip calling out to HF in the first place?


The loaders HF method isn't doing much beyond getting the paths and then calling the local method directly after implicitly with that extra data?:

let paths: anyhow::Result<Box<dyn ModelPaths>> = get_paths_gguf!(
LocalModelPaths,
&token_source,
revision,
self,
self.quantized_model_id.clone(),
self.quantized_filename.clone(),
silent
);
self.load_model_from_path(&paths?, _dtype, device, silent, mapper, in_situ_quant)

What is the actual minimum amount of paths needed? Can that whole method be skipped if paths are provided locally? Is the chat template required or can it fallback to a default (perhaps with warning)? I'm not sure what llama-cpp does but their local GGUF loader support doesn't require much upfront to run.

Otherwise this macro is presumably juggling conditions of API call vs fallback/alternative? (also I'm not too fond of duplicating the macro to adjust for local GGUF):

macro_rules! get_paths_gguf {
($path_name:ident, $token_source:expr, $revision:expr, $this:expr, $quantized_model_id:expr, $quantized_filename:expr, $silent:expr) => {{
let api = ApiBuilder::new()
.with_progress(!$silent)
.with_token(get_token($token_source)?)
.build()?;
let revision = $revision.unwrap_or("main".to_string());
let model_id_this = $this.model_id.clone().unwrap_or($this.quantized_model_id.clone());
let model_id_copy = model_id_this.clone();
let api = api.repo(Repo::with_revision(
model_id_this.clone(),
RepoType::Model,
revision.clone(),
));
let model_id = std::path::Path::new(&model_id_copy);
let chat_template = if let Some(ref p) = $this.chat_template {
if p.ends_with(".json") {
info!("Using chat template file at `{p}`");
PathBuf::from_str(p)?
} else {
PathBuf::from_str("")?
}
} else {
$crate::api_get_file!(
api,
"tokenizer_config.json",
model_id
) // Will be loaded from inside gguf file
};
let filenames = get_model_paths(
revision.clone(),
&$token_source,
&Some($quantized_model_id),
&Some($quantized_filename),
&api,
&model_id,
)?;
let XLoraPaths {
adapter_configs,
adapter_safetensors,
classifier_path,
xlora_order,
xlora_config,
lora_preload_adapter_info,
} = get_xlora_paths(
model_id_this,
&$this.xlora_model_id,
&$token_source,
revision.clone(),
&$this.xlora_order,
)?;
let gen_conf = if $crate::api_dir_list!(api, model_id)
.collect::<Vec<_>>()
.contains(&"generation_config.json".to_string())
{
Some($crate::api_get_file!(
api,
"generation_config.json",
model_id
))
} else {
None
};
let tokenizer_filename = if $this.model_id.is_some() {
$crate::api_get_file!(api, "tokenizer.json", model_id)
} else {
PathBuf::from_str("")?
};
Ok(Box::new($path_name {
tokenizer_filename,
config_filename: PathBuf::from_str("")?,
filenames,
xlora_adapter_configs: adapter_configs,
xlora_adapter_filenames: adapter_safetensors,
classifier_path,
classifier_config: xlora_config,
xlora_ordering: xlora_order,
template_filename: chat_template,
gen_conf,
lora_preload_adapter_info,
}))
}};
}

Ah I see the paths struct here:

#[derive(Clone)]
pub struct LocalModelPaths<P> {
tokenizer_filename: P,
config_filename: P,
template_filename: P,
filenames: Vec<P>,
xlora_adapter_filenames: Option<Vec<(String, P)>>,
xlora_adapter_configs: Option<Vec<((String, String), LoraConfig)>>,
classifier_path: Option<P>,
classifier_config: Option<XLoraConfig>,
xlora_ordering: Option<Ordering>,
gen_conf: Option<P>,
lora_preload_adapter_info: Option<HashMap<String, (P, LoraConfig)>>,
}

How about this?:

  1. Some condition to opt-out of HF API when providing local file paths?
  2. Based on that condition handle either:
  • Assign any user supplied paths from CLI
  • Get all paths info via HF API
  1. Apply fallback paths for any mandatory paths that are still None, or fail.
  2. load_model_from_path() can be called now. The load_model_from_hf() would be changed to only return the paths, instead of minor convenience of calling load_model_from_path() internally?

I am a bit more familiar with this area of the project now, I might be able to take a shot at it once my active PR is merged 😅

Original response

Perhaps I am not using the command correctly:

Attempts

Can ignore most of this, mistralrs-server gguf -m . -f model.gguf fails with 401 unauthorized.


From the mistral.rs git repo at /mist, absolute path to model location for -m (EDIT: this probably should have been part of -f):

$ RUST_BACKTRACE=1 target/release/mistralrs-server gguf -m /models/Hermes-2-Pro-Mistral-7B.Q4_K_M -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

2024-05-30T00:06:47.470010Z  INFO mistralrs_core::pipeline::gguf: Loading model `/models/Hermes-2-Pro-Mistral-7B.Q4_K_M` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-30T00:06:47.508433Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
thread 'main' panicked at mistralrs-core/src/pipeline/gguf_tokenizer.rs:65:31:
no entry found for key
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic_display
   3: core::option::expect_failed
   4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   6: tokio::runtime::context::runtime::enter_runtime
   7: mistralrs_server::main

Error "no entry found for key"


From the model directory, absolute path to mistralrs-server:

$ RUST_BACKTRACE=1 /mist/target/release/mistralrs-server gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: tokio::runtime::context::runtime::enter_runtime
   4: mistralrs_server::main

401 unauthorized.


Just to double check I copied the mistralrs-server executable to the same folder which is how I previously tried to run it in past comments:

$ RUST_BACKTRACE=1 ./server gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: tokio::runtime::context::runtime::enter_runtime
   4: mistralrs_server::main

401 unauthorized again.

  • Adding --token-source none doesn't help. Nor -t ., both were the two other args I had originally used in past comments but AFAIK aren't necessary anymore? Produces same error as above with 401.
  • If I don't use -t ., but use -m with an absolute path to the directory for the -f it'll give the first output I shared in this comment. So I figured maybe the same for the -t which then results in:
$ RUST_BACKTRACE=1 ./server --token-source none gguf -m /models/Hermes-2-Pro-Mistral-7B.Q4_K_M -t /models/Hermes-2-Pro-Mistral-7B.Q4_K_M -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(404, Response[status: 404, status_text: Not Found, url: https://huggingface.co//models/Hermes-2-Pro-Mistral-7B.Q4_K_M/resolve/main/tokenizer.json]))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf::{{closure}}
   3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   4: tokio::runtime::context::runtime::enter_runtime
   5: mistralrs_server::main

So it's still trying to connect to HF 🤷‍♂️ (because of the mandatory -m arg I guess when I don't use .)

This model was one that you mentioned had a duplicate field (that error isn't being encountered here, although previously I had to add a patch to bypass a 401 panic, which you can see above).

@EricLBuehler
Copy link
Owner

@MoonRide303, @polarathene, the following command works on my machine after I merged #362:

cargo run --release --features cuda -- -i --token-source none --chat-template chat_templates/mistral.json gguf -m . -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Note: as documented in the README here, you need to specify the model id, file, and chat template when loading a local GGUF model without using the HF tokenizer. If you are using the HF tokenizer, you may specify -t/--tok-model-id which is a HF/local model ID to the tokenizer.json and tokenizer_config.json.


The loaders HF method isn't doing much beyond getting the paths and then calling the local method directly after implicitly with that extra data?:

Yes, it just queries the HTTP side and if that failes treats them as local paths. My thinking was that we should always try HTTP first, but maybe you can flip that in a future PR?

Otherwise this macro is presumably juggling conditions of API call vs fallback/alternative?

Not really, the api_dir_list! and api_get_file! macros handle that. get_paths_gguf! just handles the tokenizer loading differences between GGUF and anything else. I don't love it though, maybe we can use akin or something like that to deduplicate it? I haven't looked into that area.

Some condition to opt-out of HF API when providing local file paths?

That seems like a great idea, perhaps --local in the CLI and a flag in the MistralRs builder so that the Python and Rust APIs can accept it? Happy to accept a PR for that too.

@polarathene
Copy link
Contributor

@polarathene, the following command works on my machine after I merged #362:

cargo run --release --features cuda -- -i --token-source none --chat-template chat_templates/mistral.json gguf -m . -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Just realized you were referencing a change in the past hour, built again and your example works properly now 🎉

Original response
$ RUST_BACKTRACE=1 ./server -i --token-source none --chat-template /mist/chat_templates/mistral.json gguf -m . -f /models/mistral-7b-instruct-v0.1.Q4_K_M.gguf

thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: tokio::runtime::context::runtime::enter_runtime
   4: mistralrs_server::main

That is the same model as you used? I saw you link to it the other day (might be worth having the link by the README example if troubleshooting with a common model is advised?).

$ ./server --version
mistralrs-server 0.1.11

$ git log
commit 527e7f5282c991d399110e21ddbef6c51bba607c (grafted, HEAD -> master, origin/master, origin/HEAD)
Author: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
Date:   Wed May 29 10:12:24 2024 -0400

    Merge pull request #360 from EricLBuehler/fix_unauth

    Fix no auth token for local loading

Oh... mistook the PR you referenced as an older one, I see that's new.


However same command but changing just -f to another mistral model had this "no entry found for key" failure which I mentioned in my previous message:

$ RUST_BACKTRACE=1 target/release/mistralrs-server -i --token-source none --chat-template /mist/chat_templates/mistral.json gguf -m . -f /models/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf

2024-05-30T11:06:44.051393Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-30T11:06:44.099117Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
thread 'main' panicked at mistralrs-core/src/pipeline/gguf_tokenizer.rs:65:31:
no entry found for key
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic_display
   3: core::option::expect_failed
   4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   6: tokio::runtime::context::runtime::enter_runtime
   7: mistralrs_server::main

I tried another model and it was loading then panicked because I mistyped a different chat template filename, probably should have verified the file existed before it began to attempt loading the model.

Tried a few other GGUF models from HF and some also failed with no entry found for key, yet these seem to work with llama-cpp so probably not quite there yet? 🤔 (Here's the Hermes model I tried that gave the above failure)

@EricLBuehler
Copy link
Owner

EricLBuehler commented May 30, 2024

@polarathene I think you should be able to run the Hermes model now. I just merged #363, which allows the default unigram UNK token (0) in case it is missing.

Tried a few other GGUF models from HF and some also failed with no entry found for key, yet these seem to work with llama-cpp so probably not quite there yet? 🤔 (Here's the Hermes model I tried that gave the above failure)

Yeah, we only support the llama/replit GGUF tokenizer model for now as they are both unigram. After I merge #356, I'll add support for chat template via the GGUF file (I don't want to cause any more rebases :)), but until then, you should provide the Hermes chat template in the chat template file. In this case, it would be:

{
    "chat_template": "{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
}

@EricLBuehler
Copy link
Owner

@EricLBuehler Got some, but those were just empty dirs from old versions:

$ ls "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/"
v11.7/  v11.8/  v12.1/  v12.4/  v12.5/

I removed all except 12.5, and it didn't help in any way. But it shouldn't matter as long as necessary .dll are in the path (and they are).

@MoonRide303, coreylowman/cudarc#240 should fix this.

@EricLBuehler
Copy link
Owner

@MoonRide303, I think it should be fixed now, coreylowman/cudarc#240 was merged and there are reports that it works for others. Can you please try it again after a git pull and cargo update?

@polarathene
Copy link
Contributor

Yeah, we only support the llama/replit GGUF tokenizer model for now as they are both unigram.

I don't know much about these, but after a git pull I can confirm Hermes is working now while the models that aren't now report an error about the actual tokenizer missing support which is good to see 👍

@EricLBuehler
Copy link
Owner

I don't know much about these, but after a git pull I can confirm Hermes is working now while the models that aren't now report an error about the actual tokenizer missing support which is good to see 👍

Great! For reference, see these docs: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#ggml

@MoonRide303
Copy link
Author

@EricLBuehler tested (full rebuild) on current master (1d21c5f), result:

mistralrs-server.exe gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf

thread 'main' panicked at C:\Users\[REDACTED]\.cargo\registry\src\index.crates.io-6f17d22bba15001f\cudarc-0.11.3\src\curand\sys\mod.rs:51:9:
Unable to find curand lib under the names ["curand", "curand64", "curand64_12", "curand64_125", "curand64_125_0", "curand64_120_5"]. Please open GitHub issue.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Proper name for this one (as of CUDA 12.5 on Windows) should be:

$ which curand64_10.dll
/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.5/bin/curand64_10.dll

@EricLBuehler
Copy link
Owner

@MoonRide303 I opened an issue: coreylowman/cudarc#242. I'll let you know when a fix gets merged.

@MoonRide303
Copy link
Author

@EricLBuehler I've noticed the change in cudarc was merged, so I tried to rebuild - it seems problems with CUDA dlls are now solved. But it still asks me for tokenizer_config.json file:

mistralrs-server.exe gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:294:58:
File "tokenizer_config.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

When I copied this config from the original HF repo and added port parameter, I managed to launch the server:

mistralrs-server.exe -p 1234 gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
2024-06-01T20:00:38.909745Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-06-01T20:00:39.046703Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 18
general.name: Mistral-7B-Instruct-v0.3
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 32768
2024-06-01T20:00:39.072195Z  INFO mistralrs_core::pipeline::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32768, num added tokens: 0, num merges: 0, num scores: 32768
2024-06-01T20:00:40.665088Z  INFO mistralrs_core::pipeline::paths: `tokenizer_config.json` does not contain a chat template, attempting to use specified JINJA chat template.
2024-06-01T20:00:40.665937Z  INFO mistralrs_core::pipeline::paths: No specified chat template. No chat template will be used. Only prompts will be accepted, not messages.
2024-06-01T20:00:40.718935Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "</s>", unk_tok = <unk>
2024-06-01T20:00:40.721333Z  INFO mistralrs_server: Model loaded.
2024-06-01T20:00:40.721698Z  INFO mistralrs_core: GEMM reduced precision in BF16 not supported.
2024-06-01T20:00:40.747338Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-06-01T20:00:40.748197Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.

Though it seems it didn't read chat template at all (not from the GGUF, and not from the separate config file). Trying to chat after running python chat.py (from examples/server/) resulted in

openai.UnprocessableEntityError: Error code: 422 - {'message': 'Received messages for a model which does not have a chat template. Either use a different model or pass a single string as the prompt'}

@ShelbyJenkins
Copy link

Great work! I just re-started implementing for my crate. For reference I'm giving two options for loading GGUFs. Both are designed to be as easy as possible (for the Python immigrants).

Option 1) from presets with pre-downloaded tokenizer.json, tokenizer_config, and config.json. Given the user's vram, it then downloads the largest quant that will fit in the vram. The tokenizer.json is no longer required since you've implemented tokenizer from GGUF (legend), and presumably I can use that interface or use the code in my crate for my tokenizer needs.
Option 2) User supplies the full hf url or file path to the quantized model.
Option 3) not implemented - safe tensors -> ISQ :D

I then plan to pass those paths after loading to mistral.rs just like I currently do with llama.cpp.


IMO the GGUFLoaderBuilder or the GGUFLoader interface could be simplified for clarity/discoverability and maintenance.

  • Split into two code paths - one for loading from hf_hub, and one from loading from local.
  • Or combine with something like this
impl GGUFLoaderBuilder {
    pub fn new() -> Self {
        let kind = ModelKind::Quantized {
            quant: QuantizationKind::Gguf,
        };

        Self {
            config: GGUFSpecificConfig { repeat_last_n: 64 },
            kind,
            ..Default::default()
        }
    }
    /// Load the model from Hugging Face. 
    /// A full path to the quantized models download link
    /// OR it could be repo_id/model_id + filename
    pub fn from_hf(mut self, model_url: &str) -> Self {
        self.model_url = Some(model_url.to_string());
    }

    /// Load the model from a local path.
    pub fn from_local_path(mut self, model_path: &str) -> Self {
        self.model_path = Some(model_path.to_string());
    }

    /// A config for a GGUF loader. 
    pub fn repeat_last_n(mut self, repeat_last_n: usize) -> Self {
        self.config.repeat_last_n = repeat_last_n;
        self
    }
    /// The Jinja chat template to use.
    /// Not required if loading chat template from GGUF implemented.
    pub fn chat_template(mut self, chat_template: &str) -> Self {
        self.chat_template = Some(chat_template.to_string());
        self
    }
    /// The file name and path of the chat template json file
     /// Not required if loading chat template from GGUF implemented.
    pub fn chat_template_path(mut self, chat_template_path: &str) -> Self {
        self.chat_template = Some(chat_template_path.to_string());
        self
    }
    pub fn build(self) ...
}

I would implement this and submit a PR, but I haven't looked at the downstream code enough to understand the implications. If you think it's a good idea, I can make an attempt.

On chat templates - If we could implement loading the chat template from GGUF, I'm not sure we'd need any thing else. The reason llama.cpp doesn't do this is because they don't/won't implement a Jinja parser, but as most models are including the chat template in the GGUF now I'm not sure if there is a reason to manually load it. I don't mind manually adding a chat_template.json to the presets I have, but it makes loading a model from a single file more difficult. Another option might be to accept the chat template as a string.

@EricLBuehler
Copy link
Owner

@EricLBuehler I've noticed the change in cudarc was merged, so I tried to rebuild - it seems problems with CUDA dlls are now solved. But it still asks me for tokenizer_config.json file:

mistralrs-server.exe gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:294:58:
File "tokenizer_config.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

When I copied this config from the original HF repo and added port parameter, I managed to launch the server:

mistralrs-server.exe -p 1234 gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
2024-06-01T20:00:38.909745Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-06-01T20:00:39.046703Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 18
general.name: Mistral-7B-Instruct-v0.3
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 32768
2024-06-01T20:00:39.072195Z  INFO mistralrs_core::pipeline::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32768, num added tokens: 0, num merges: 0, num scores: 32768
2024-06-01T20:00:40.665088Z  INFO mistralrs_core::pipeline::paths: `tokenizer_config.json` does not contain a chat template, attempting to use specified JINJA chat template.
2024-06-01T20:00:40.665937Z  INFO mistralrs_core::pipeline::paths: No specified chat template. No chat template will be used. Only prompts will be accepted, not messages.
2024-06-01T20:00:40.718935Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "</s>", unk_tok = <unk>
2024-06-01T20:00:40.721333Z  INFO mistralrs_server: Model loaded.
2024-06-01T20:00:40.721698Z  INFO mistralrs_core: GEMM reduced precision in BF16 not supported.
2024-06-01T20:00:40.747338Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-06-01T20:00:40.748197Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.

Though it seems it didn't read chat template at all (not from the GGUF, and not from the separate config file). Trying to chat after running python chat.py (from examples/server/) resulted in

openai.UnprocessableEntityError: Error code: 422 - {'message': 'Received messages for a model which does not have a chat template. Either use a different model or pass a single string as the prompt'}

@MoonRide303 that was a bug and should be fixed now.

@EricLBuehler
Copy link
Owner

Great work! I just re-started implementing for my crate. For reference I'm giving two options for loading GGUFs. Both are designed to be as easy as possible (for the Python immigrants).

Hi @ShelbyJenkins! That's exciting, I look forward to seeing future developments.

I would implement this and submit a PR, but I haven't looked at the downstream code enough to understand the implications. If you think it's a good idea, I can make an attempt.

That sounds good. If you could also be sure to implement it for the Normal and GGML loaders too, that would be great.

On chat templates - If we could implement loading the chat template from GGUF, I'm not sure we'd need any thing else. The reason llama.cpp doesn't do this is because they don't/won't implement a Jinja parser, but as most models are including the chat template in the GGUF now I'm not sure if there is a reason to manually load it. I don't mind manually adding a chat_template.json to the presets I have, but it makes loading a model from a single file more difficult. Another option might be to accept the chat template as a string.

Yes, that is something which will be easy to do. I'll add support for that in the coming days and will write here when I do.

@EricLBuehler
Copy link
Owner

@ShelbyJenkins I just merged support for multiple GGUF files in #379.

@polarathene
Copy link
Contributor

IMO the GGUFLoaderBuilder or the GGUFLoader interface could be simplified for clarity/discoverability and maintenance.

I'll be refactoring this, it's on my todo.

I would implement this and submit a PR, but I haven't looked at the downstream code enough to understand the implications. If you think it's a good idea, I can make an attempt.

That sounds good. If you could also be sure to implement it for the Normal and GGML loaders too, that would be great.

Again heads up that this is something I'll be tackling 😅

My interests are in simplifying the existing code for maintenance, I also am likely to change the ModelKind enum/type as part of this.

After that's been handled, it would be a better time for you to PR your own improvements for UX. Just mentioning this to avoid us both working on conflicting changes at the same time for this portion of mistral.rs.

@polarathene
Copy link
Contributor

I just merged support for multiple GGUF files in #379.

I'm having trouble following this requirement?

Where is the example of sharded GGUF? Was there a misunderstanding with what @ShelbyJenkins was describing? (Multiple GGUF files that are distinct by their quantization to support lower VRAM requirements)

Your PR changed quite a bit and I'm lacking clarity as to why it was needed as it seems otherwise redundant complexity?

@MoonRide303
Copy link
Author

@EricLBuehler I've noticed the change in cudarc was merged, so I tried to rebuild - it seems problems with CUDA dlls are now solved. But it still asks me for tokenizer_config.json file:

mistralrs-server.exe gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:294:58:
File "tokenizer_config.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

When I copied this config from the original HF repo and added port parameter, I managed to launch the server:

mistralrs-server.exe -p 1234 gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
2024-06-01T20:00:38.909745Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-06-01T20:00:39.046703Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 18
general.name: Mistral-7B-Instruct-v0.3
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 32768
2024-06-01T20:00:39.072195Z  INFO mistralrs_core::pipeline::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32768, num added tokens: 0, num merges: 0, num scores: 32768
2024-06-01T20:00:40.665088Z  INFO mistralrs_core::pipeline::paths: `tokenizer_config.json` does not contain a chat template, attempting to use specified JINJA chat template.
2024-06-01T20:00:40.665937Z  INFO mistralrs_core::pipeline::paths: No specified chat template. No chat template will be used. Only prompts will be accepted, not messages.
2024-06-01T20:00:40.718935Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "</s>", unk_tok = <unk>
2024-06-01T20:00:40.721333Z  INFO mistralrs_server: Model loaded.
2024-06-01T20:00:40.721698Z  INFO mistralrs_core: GEMM reduced precision in BF16 not supported.
2024-06-01T20:00:40.747338Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-06-01T20:00:40.748197Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.

Though it seems it didn't read chat template at all (not from the GGUF, and not from the separate config file). Trying to chat after running python chat.py (from examples/server/) resulted in

openai.UnprocessableEntityError: Error code: 422 - {'message': 'Received messages for a model which does not have a chat template. Either use a different model or pass a single string as the prompt'}

@MoonRide303 that was a bug and should be fixed now.

Doesn't look fixed as of v0.1.15 (9712da6) - it still asks for tokenizer config in separate file:

mistralrs-server.exe -p 1234 gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
thread 'main' panicked at mistralrs-core\src\pipeline\gguf.rs:272:58:
File "tokenizer_config.json" not found at model id "."
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

And even when provided in separate file (which shouldn't be needed) - yes, the server starts without warnings:

mistralrs-server.exe -p 1234 gguf -m . -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
2024-06-05T06:42:12.616525Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-06-05T06:42:12.801370Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `unigram`, num tokens: 32768, num added tokens: 0, num merges: 0, num scores: 32768
2024-06-05T06:42:20.837169Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "</s>", unk_tok = <unk>
2024-06-05T06:42:20.839905Z  INFO mistralrs_server: Model loaded.
2024-06-05T06:42:20.840295Z  INFO mistralrs_core: GEMM reduced precision in BF16 not supported.
2024-06-05T06:42:20.898012Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-06-05T06:42:20.899032Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.

But the chat template still doesn't work:

python chat.py
Enter system prompt >>> Just be yourself.
>>> Who are you?
Traceback (most recent call last):
  File "D:\repos-git\mistral.rs\examples\server\chat.py", line 47, in <module>
    completion = openai.chat.completions.create(
  File "D:\anaconda3\lib\site-packages\openai\_utils\_utils.py", line 277, in wrapper
    return func(*args, **kwargs)
  File "D:\anaconda3\lib\site-packages\openai\resources\chat\completions.py", line 590, in create
    return self._post(
  File "D:\anaconda3\lib\site-packages\openai\_base_client.py", line 1240, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
  File "D:\anaconda3\lib\site-packages\openai\_base_client.py", line 921, in request
    return self._request(
  File "D:\anaconda3\lib\site-packages\openai\_base_client.py", line 1005, in _request
    return self._retry_request(
  File "D:\anaconda3\lib\site-packages\openai\_base_client.py", line 1053, in _retry_request
    return self._request(
  File "D:\anaconda3\lib\site-packages\openai\_base_client.py", line 1005, in _request
    return self._retry_request(
  File "D:\anaconda3\lib\site-packages\openai\_base_client.py", line 1053, in _retry_request
    return self._request(
  File "D:\anaconda3\lib\site-packages\openai\_base_client.py", line 1020, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Error code: 500 - {'message': 'invalid operation: Conversation roles must alternate user/assistant/user/assistant/... (in chat_template:1)'}

@EricLBuehler
Copy link
Owner

@polarathene, great, looking forward to those refactors.

Regarding the sharded GGUF files, here is one such example. That PR also reduced GGUF code duplication by centralizing the get_metadata and loading from a reader.

@EricLBuehler
Copy link
Owner

Hi @MoonRide303, I just merged #385 which adds more verbose and hopefully more helpful logging during loading.

I think there are a few things going on here.

  1. We don't support loading the chat template directly from the GGUF file. As such, you should provide a chat template file or a Hugging Face tokenizer model ID. I will add support for loading the chat template from the GGUF file soon (tracking issue is Load chat template from GGUF file #386).
  2. The Mistral v0.1 Instruct model does not accept a system prompt:
    From the HF Jinja template:
raise_exception('Only user and assistant roles are supported!')

@MoonRide303
Copy link
Author

MoonRide303 commented Jun 5, 2024

@EricLBuehler Model itself knows how to handle system role, it's just the limitation of the default template. But you're right that's a separate issue and mistral.rs behaviour was okay (exception defined in the template) - and it can also be worked around by providing custom template accepting system role. As of this issue - it seems that #386 is the last missing part, then.

@polarathene
Copy link
Contributor

That PR also reduced GGUF code duplication by centralizing the get_metadata and loading from a reader.

I was actually going to handle that (and do it better IMO), but your changes would have caused heavy conflicts to resolve, so I'm somewhat glad I was checking development activity as I was about to start around the time I discovered the merged PR 😓

@ShelbyJenkins if you want to tackle your own improvements to the loader go ahead. The pace of development in this area is a bit too frequent for me to want to try touch it for the time being.

@EricLBuehler
Copy link
Owner

EricLBuehler commented Jun 5, 2024

I was actually going to handle that (and do it better IMO), but your changes would have caused heavy conflicts to resolve, so I'm somewhat glad I was checking development activity as I was about to start around the time I discovered the merged PR 😓

@polarathene as mentioned in #380, I will roll back those changes. Looking forward to seeing your implementation!

@ShelbyJenkins if you want to tackle your own improvements to the loader go ahead. The pace of development in this area is a bit too frequent for me to want to try touch it for the time being.

I don't plan on working much in that area for the foreseeable future, aside from the rollback, which I'll do shortly. The pace of development there should not be excessive.

@ShelbyJenkins
Copy link

@polarathene @EricLBuehler Ok, I'll take look at this over this week and weekend. It's not a high priority for my project, but I think it's valuable to people consuming this library 👍

@EricLBuehler
Copy link
Owner

Hi @MoonRide303! I just merged #416 which enables loading the GGUF chat template from a GGUF file as well as #397 which adds support for the GPT2 (BPE) tokenizer type which extends support. This command now works:

cargo run --release --features cuda -- --token-source none -i gguf -m . -f Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

Output

2024-06-10T12:38:05.106472Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-06-10T12:38:05.106510Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-06-10T12:38:05.106523Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-06-10T12:38:05.132855Z  INFO mistralrs_core::pipeline::paths: Loading `"Meta-Llama-3-8B-Instruct.Q4_K_M.gguf"` locally at `"./Meta-Llama-3-8B-Instruct.Q4_K_M.gguf"`
2024-06-10T12:38:05.180204Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-06-10T12:38:06.002567Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: models
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 8192
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
2024-06-10T12:38:06.281404Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-06-10T12:38:06.292431Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}`
2024-06-10T12:38:08.128146Z  INFO mistralrs_core::pipeline::paths: Using literal chat template.
2024-06-10T12:38:08.399407Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|end_of_text|>", "<|eot_id|>", unk_tok = `None`
2024-06-10T12:38:08.414974Z  INFO mistralrs_server: Model loaded.
2024-06-10T12:38:08.415363Z  INFO mistralrs_core: GEMM reduced precision in BF16 not supported.
2024-06-10T12:38:08.429916Z  INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2024-06-10T12:38:08.432349Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-06-10T12:38:08.432442Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }

@EricLBuehler EricLBuehler added new feature New feature or request and removed bug Something isn't working labels Jun 10, 2024
@MoonRide303
Copy link
Author

cargo run --release --features cuda -- --token-source none -i gguf -m . -f Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

Something is wrong with tokenizer - it fails at What is 3333+777? test (using ac1537d):

PS D:\repos-git\mistral.rs> cargo run --release --features cuda -- --token-source none -i gguf -m . -f E:\ML-models\Meta-Llama-3-8B-Instruct-GGUF\Meta-Llama-3-8B-Instruct-Q6_K.gguf
    Finished `release` profile [optimized] target(s) in 0.42s
     Running `target\release\mistralrs-server.exe --token-source none -i gguf -m . -f E:\ML-models\Meta-Llama-3-8B-Instruct-GGUF\Meta-Llama-3-8B-Instruct-Q6_K.gguf`
2024-06-10T14:51:47.647104Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-06-10T14:51:47.647227Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-06-10T14:51:47.647344Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-06-10T14:51:47.648949Z  INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-06-10T14:51:47.649053Z  INFO hf_hub: Token file not found "C:\\Users\\[REDACTED]\\.cache\\huggingface\\token"
2024-06-10T14:51:47.951484Z  INFO mistralrs_core::pipeline::paths: Loading `"E:\\ML-models\\Meta-Llama-3-8B-Instruct-GGUF\\Meta-Llama-3-8B-Instruct-Q6_K.gguf"` locally at `"E:\\ML-models\\Meta-Llama-3-8B-Instruct-GGUF\\Meta-Llama-3-8B-Instruct-Q6_K.gguf"`
2024-06-10T14:51:48.475791Z  INFO mistralrs_core::pipeline::gguf: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-06-10T14:51:49.520064Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 18
general.name: Meta-Llama-3-8B-Instruct
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 8192
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
2024-06-10T14:51:49.712946Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-06-10T14:51:49.721485Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}`
2024-06-10T14:51:51.559330Z  INFO mistralrs_core::pipeline::paths: Using literal chat template.
2024-06-10T14:51:51.761652Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|eot_id|>", unk_tok = `None`
2024-06-10T14:51:51.778165Z  INFO mistralrs_server: Model loaded.
2024-06-10T14:51:51.778496Z  INFO mistralrs_core: GEMM reduced precision in BF16 not supported.
2024-06-10T14:51:51.802510Z  INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2024-06-10T14:51:51.806780Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-06-10T14:51:51.806864Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }
> What is 3333+777?
Let me calculate that for you!

33,333 + 777 = 34,110
>

Same model file used in llama.cpp:
image

@EricLBuehler
Copy link
Owner

EricLBuehler commented Jul 12, 2024

@MoonRide303 not sure about that, please open another issue. Closing this as the feature is complete, please feel free to reopen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants