-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update train.py #897
Update train.py #897
Conversation
Signed-off-by: Dashiell Stander <dstander@protonmail.com>
Signed-off-by: Dashiell Stander <dstander@protonmail.com>
Signed-off-by: Dashiell Stander <dstander@protonmail.com>
Signed-off-by: Dashiell Stander <dstander@protonmail.com>
Signed-off-by: Dashiell Stander <dstander@protonmail.com>
Signed-off-by: Dashiell Stander <dstander@protonmail.com>
feat(ci): add `pip` caching to CI
add flash_attn_kvpacked
…leutherAI#866) * Changed is_pipe_parallel setting to fix pipeline-parallel inference * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically --------- Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
feat(logging): improve typing
Reqs correction
update train.py 1. black formatter. 2. remove unnecessary import 3. add more arguments
|
removed num_proc temporarily disabled emoji added continuing subword prefix option ( does not work well with Bytelevel)
improve reader error handling
add whitespace related handling. add whitespace argument expose reconstruct pre_tokenizer_list add more whitespace to check tokenizer invertibility
remove unnecessary print
set dropout default to None import path related code. Change normalizer change buffer_tokens change whitespace reservation handling
Clear whitespace_reservation TODO add single_whitespace argument (might be necessary for invertibility)
add gitignore file to ignore artifacts
add directory parsing error checks add more metrics (tokenizer reconstructions, unicode fallback portion)
path handling changes black formatting
change from GPT2TokenizerFast to PreTrainedTokenizerFast class
enhanced test string
add logic to handle jsonl, txt input add logic to handle folder with jsonl,txt or arrow dataset
add byte_fallback option expose (incompatible with current transformer wrapper) change dataset_loading with new util.py add dataset shuffling option
fix error in loading sequence
fix whitespace preservation logic
simplify data loading logic. remove unnecessary special tokens
remove emoji related code
add whitespace processing regex r"\s{16,}"
add whitespace pretokenizer (only processes looong whitespaces)
add camel case regex
separate camel_case regex
@satpalsr @dashstander @StellaAthena Is it intentional to add commits on this branch or PR? |
I have not myself touched this PR |
Oh I think this happened while merging other updates. Thanks for clarify :) |
update train.py