Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: Update /props endpoint to correctly return default server parameters #8418

Merged
merged 1 commit into from
Jul 11, 2024

Conversation

HanClinto
Copy link
Collaborator

@HanClinto HanClinto commented Jul 10, 2024

In #8402, we added the ability to set default request parameters on the command line.

One shortcoming of that PR is that the author failed to update the /props endpoint, so it was returning bogus information.

Example:

  1. Start the server with some changed values for grammar and n_ctx:
./llama-server -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file "./grammars/no-e.gbnf" -c=1024
  1. Navigate to http://localhost:8080/props and note that -- other than correctly listing the model that is loaded -- any of the default sampling parameters set via the CLI are not shown:
{
  "system_prompt": "",
  "default_generation_settings": {
    "n_ctx": 2048,
    "n_predict": -1,
    "model": "/Users/snakamoto/Library/Caches/llama.cpp/phi-2.Q4_K_M.gguf",
    "seed": -1,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.0500000007450581,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "penalty_prompt_tokens": [],
    "use_penalty_prompt_tokens": false,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.100000001490116,
    "penalize_nl": false,
    "stop": [],
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": true,
    "logit_bias": [],
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "top_k",
      "tfs_z",
      "typical_p",
      "top_p",
      "min_p",
      "temperature"
    ]
  },
  "total_slots": 1,
  "chat_template": ""
}

In particular, note that this endpoint (incorrectly) returns 2048 for n_ctx, and a blank string for grammar.

There were a few possible ways to fix this, but the lowest-friction method was, during init(), to initialize each of the slot's sampling parameters by copying from the global context's sampling parameters. This is similar to the one-liner method that we used in #8402, but while that operated at runtime (when the jobs are fired off), this one operates at initialization.

The first slot is then chosen, and the default parameters are serialized to json and stored in default_generation_settings_for_props -- the same as happened before. It's nice to have this serialized and saved this way, because even if the slot's parameters are overwritten by a later request, the value stored in default_generation_settings_for_props will always represent the defaults.

And this is what the end result looks like when querying /props:

{
  "system_prompt": "",
  "default_generation_settings": {
    "n_ctx": 1024,
    "n_predict": -1,
    "model": "/Users/snakamoto/Library/Caches/llama.cpp/phi-2.Q4_K_M.gguf",
    "seed": -1,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.0500000007450581,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "penalty_prompt_tokens": [],
    "use_penalty_prompt_tokens": false,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.100000001490116,
    "penalize_nl": false,
    "stop": [],
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": true,
    "logit_bias": [],
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "root ::= [^eE]*\n",
    "samplers": [
      "top_k",
      "tfs_z",
      "typical_p",
      "top_p",
      "min_p",
      "temperature"
    ]
  },
  "total_slots": 1,
  "chat_template": ""
}

It now contains the correct values of n_ctx = 1024 and our non-blank grammar -- success!

This solution does not add any increased memory usage, and I can't think of any edge cases that it falls down. Yesterday when I first tried to fix this, I got wrapped around the axle with an overly complicated approach. I'm glad I slept on it for a day because I think that today's solution is much more elegant.

Tagging @ngxson in particular for review on this one.

Thank you!

@HanClinto HanClinto merged commit 278d0e1 into ggerganov:master Jul 11, 2024
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants