-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : refactor session file management #8699
Conversation
* llama : saving and restoring state checks for overflow The size of the buffers should now be given to the functions working with them, otherwise a truncated file could cause out of bound reads. * llama : stream from session file instead of copying into a big buffer Loading session files should no longer cause a memory usage spike. * llama : llama_state_get_size returns the actual size instead of max This is a breaking change, but makes that function *much* easier to keep up to date, and it also makes it reflect the behavior of llama_state_seq_get_size. * llama : share code between whole and seq_id-specific state saving Both session file types now use a more similar format. * llama : no longer store all hparams in session files Instead, the model arch name is stored. The layer count and the embedding dimensions of the KV cache are still verified when loading. Storing all the hparams is not necessary.
7de7c17
to
f1b0a1f
Compare
Some platforms use "%lu" and others "%llu" for uint64_t. Not sure how to handle that, so casting to size_t when displaying errors.
f1b0a1f
to
cddc899
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job!
I guess you can drop the _context
suffix for these new types:
llama_data_context
->llama_data
llama_data_read_context
->llama_data_read
- etc.
The reason to have the suffix in llama_context
, ggml_context
, gguf_context
is that they are top-level objects and without the suffix it would be too ambiguous in terms of symbol names (i.e. grep for ggml
will match the entire codebase).
For example, llama_model
is OK and does not need to be llama_model_context
. Same for ggml_tensor
, llama_kv_cache
, llama_vocab
, etc.
The llama_sampling_context
in common
is a mistake and I will try to fix this soon.
But even if you keep the names as they are, it's fine too
llama_state_get_size cannot be used to get the max size anymore.
I recently bumped into the fact that there are both This means that the RNG save and restore for If I'd say this is out of the scope of this PR. Fixing the sampling state save and restore will at least be easier after this refactor if the session file format needs to be changed again. |
Yup, I'm aware and will address this within #8643 |
* llama : remove LLAMA_MAX_RNG_STATE It's no longer necessary to limit the size of the RNG state, because the max size of session files is not estimated anymore.
Nice! :) |
* llama : refactor session file management * llama : saving and restoring state checks for overflow The size of the buffers should now be given to the functions working with them, otherwise a truncated file could cause out of bound reads. * llama : stream from session file instead of copying into a big buffer Loading session files should no longer cause a memory usage spike. * llama : llama_state_get_size returns the actual size instead of max This is a breaking change, but makes that function *much* easier to keep up to date, and it also makes it reflect the behavior of llama_state_seq_get_size. * llama : share code between whole and seq_id-specific state saving Both session file types now use a more similar format. * llama : no longer store all hparams in session files Instead, the model arch name is stored. The layer count and the embedding dimensions of the KV cache are still verified when loading. Storing all the hparams is not necessary. * llama : fix uint64_t format type * llama : various integer type cast and format string fixes Some platforms use "%lu" and others "%llu" for uint64_t. Not sure how to handle that, so casting to size_t when displaying errors. * llama : remove _context suffix for llama_data_context * llama : fix session file loading llama_state_get_size cannot be used to get the max size anymore. * llama : more graceful error handling of invalid session files * llama : remove LLAMA_MAX_RNG_STATE It's no longer necessary to limit the size of the RNG state, because the max size of session files is not estimated anymore. * llama : cast seq_id in comparison with unsigned n_seq_max
Follow-up from #8526 (comment)
When updating the KV cache structure, to keep the session files working, there are at least 6 places to update (2 types of session files with read, write, and size calculation), which can be complicated to consistently do.
To make this easier to maintain, I've unified the format of the
seq_id
-specific session files and whole KV cache session files. They still use separate MAGIC and VERSION, but at least now they share most of the save-and-restore code.After this, there will only be 2 places to update when making changes, the writing and the reading. No need to try get the size calculation right (see the changes to
llama_state_get_size
in the summary below), and no need to maintain 2 separate ways of both reading and writing the KV cache content.Summary
llama_state_get_size
returns the actual size instead of maxllama_state_seq_get_size
.size_t
in session filesllama_hparams
no longer technically required to be trivially copyable, which should allow removing theLLAMA_MAX_LAYERS
limit in a future refactor.seq_id
-specific session files should now also work with recurrent models like MambaTODO
llama-save-load-state
seq_id
-specific state loading fails because slots are not contiguous (did not work before anyway)llama-cli
llama-save-load-state
llama-cli
rng
is not the right one, because ofllama_sampling
vsllama_sampling_context
. Same problem as onmaster
.llama-server
llama-save-load-state
llama-cli
llama-server