Server: Update /props endpoint to correctly return default server parameters #8418

HanClinto · 2024-07-10T19:37:07Z

In #8402, we added the ability to set default request parameters on the command line.

One shortcoming of that PR is that the author failed to update the /props endpoint, so it was returning bogus information.

Example:

Start the server with some changed values for grammar and n_ctx:

./llama-server -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file "./grammars/no-e.gbnf" -c=1024

Navigate to http://localhost:8080/props and note that -- other than correctly listing the model that is loaded -- any of the default sampling parameters set via the CLI are not shown:

{
  "system_prompt": "",
  "default_generation_settings": {
    "n_ctx": 2048,
    "n_predict": -1,
    "model": "/Users/snakamoto/Library/Caches/llama.cpp/phi-2.Q4_K_M.gguf",
    "seed": -1,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.0500000007450581,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "penalty_prompt_tokens": [],
    "use_penalty_prompt_tokens": false,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.100000001490116,
    "penalize_nl": false,
    "stop": [],
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": true,
    "logit_bias": [],
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "top_k",
      "tfs_z",
      "typical_p",
      "top_p",
      "min_p",
      "temperature"
    ]
  },
  "total_slots": 1,
  "chat_template": ""
}

In particular, note that this endpoint (incorrectly) returns 2048 for n_ctx, and a blank string for grammar.

There were a few possible ways to fix this, but the lowest-friction method was, during init(), to initialize each of the slot's sampling parameters by copying from the global context's sampling parameters. This is similar to the one-liner method that we used in #8402, but while that operated at runtime (when the jobs are fired off), this one operates at initialization.

The first slot is then chosen, and the default parameters are serialized to json and stored in default_generation_settings_for_props -- the same as happened before. It's nice to have this serialized and saved this way, because even if the slot's parameters are overwritten by a later request, the value stored in default_generation_settings_for_props will always represent the defaults.

And this is what the end result looks like when querying /props:

{
  "system_prompt": "",
  "default_generation_settings": {
    "n_ctx": 1024,
    "n_predict": -1,
    "model": "/Users/snakamoto/Library/Caches/llama.cpp/phi-2.Q4_K_M.gguf",
    "seed": -1,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.0500000007450581,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "penalty_prompt_tokens": [],
    "use_penalty_prompt_tokens": false,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.100000001490116,
    "penalize_nl": false,
    "stop": [],
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": true,
    "logit_bias": [],
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "root ::= [^eE]*\n",
    "samplers": [
      "top_k",
      "tfs_z",
      "typical_p",
      "top_p",
      "min_p",
      "temperature"
    ]
  },
  "total_slots": 1,
  "chat_template": ""
}

It now contains the correct values of n_ctx = 1024 and our non-blank grammar -- success!

This solution does not add any increased memory usage, and I can't think of any edge cases that it falls down. Yesterday when I first tried to fix this, I got wrapped around the axle with an overly complicated approach. I'm glad I slept on it for a day because I think that today's solution is much more elegant.

Tagging @ngxson in particular for review on this one.

Thank you!

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

…gerganov#8418)

Initialize default slot sampling parameters from the global context.

9c14c0b

github-actions bot added examples server labels Jul 10, 2024

HanClinto requested a review from ngxson July 10, 2024 19:40

HanClinto mentioned this pull request Jul 10, 2024

[BUG] Respect llama.cpp server default grammar SillyTavern/SillyTavern#2505

Closed

3 tasks

ngxson approved these changes Jul 10, 2024

View reviewed changes

HanClinto merged commit 278d0e1 into ggerganov:master Jul 11, 2024
53 checks passed

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 11, 2024

Initialize default slot sampling parameters from the global context. (g…

56f9937

…gerganov#8418)

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024

Initialize default slot sampling parameters from the global context. (g…

945ce4f

…gerganov#8418)

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024

Initialize default slot sampling parameters from the global context. (g…

86ced79

…gerganov#8418)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: Update /props endpoint to correctly return default server parameters #8418

Server: Update /props endpoint to correctly return default server parameters #8418

HanClinto commented Jul 10, 2024 •

edited

Loading

Server: Update /props endpoint to correctly return default server parameters #8418

Server: Update /props endpoint to correctly return default server parameters #8418

Conversation

HanClinto commented Jul 10, 2024 • edited Loading

HanClinto commented Jul 10, 2024 •

edited

Loading