Update train.py #897

chris-ha458 · 2023-04-22T07:16:36Z

update train.py

black formatter.
remove unnecessary import
add more arguments

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

feat(ci): add `pip` caching to CI

add flash_attn_kvpacked

…leutherAI#866) * Changed is_pipe_parallel setting to fix pipeline-parallel inference * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically --------- Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Quentin Anthony <qganthony@yahoo.com>

feat(logging): improve typing

Reqs correction

update train.py 1. black formatter. 2. remove unnecessary import 3. add more arguments

CLAassistant · 2023-04-22T07:16:40Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
4 out of 6 committers have signed the CLA.

✅ satpalsr
✅ StellaAthena
✅ dashstander
✅ chris-ha458
❌ SauravMaheshkar
❌ curt-tigges
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

removed num_proc temporarily disabled emoji added continuing subword prefix option ( does not work well with Bytelevel)

improve reader error handling

add whitespace related handling. add whitespace argument expose reconstruct pre_tokenizer_list add more whitespace to check tokenizer invertibility

remove unnecessary print

set dropout default to None import path related code. Change normalizer change buffer_tokens change whitespace reservation handling

Clear whitespace_reservation TODO add single_whitespace argument (might be necessary for invertibility)

add gitignore file to ignore artifacts

add directory parsing error checks add more metrics (tokenizer reconstructions, unicode fallback portion)

path handling changes black formatting

change from GPT2TokenizerFast to PreTrainedTokenizerFast class

enhanced test string

add logic to handle jsonl, txt input add logic to handle folder with jsonl,txt or arrow dataset

add byte_fallback option expose (incompatible with current transformer wrapper) change dataset_loading with new util.py add dataset shuffling option

fix error in loading sequence

fix whitespace preservation logic

simplify data loading logic. remove unnecessary special tokens

remove emoji related code

add whitespace processing regex r"\s{16,}"

add whitespace pretokenizer (only processes looong whitespaces)

add camel case regex

separate camel_case regex

bzantium · 2023-04-26T15:19:51Z

@satpalsr @dashstander @StellaAthena Is it intentional to add commits on this branch or PR?

dashstander · 2023-04-26T15:21:43Z

I have not myself touched this PR

bzantium · 2023-04-26T15:22:38Z

I have not myself touched this PR

Oh I think this happened while merging other updates. Thanks for clarify :)

satpalsr and others added 25 commits March 29, 2023 22:01

add flash_attn_kvpacked

f4706e0

fix formatting

45d7052

whoops

61b5eee

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

accept changes from main & resolve conflicts

9c645dd

Merge branch 'main' into flash_attn_infer

ee99945

Merge in new changes

af7276f

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

Error

fa8c87d

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

errors

e02346d

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

feat(ci): add pip caching to CI

f4b05c7

Set training attribute appropriately

2d9c258

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

Split up FlashAttention methods

1f2d66c

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

Comment out clear_cache

49d8dba

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

Just remove clear_cache

17b84d7

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

Fix pre-commit formatting

29b968d

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

Merge pull request EleutherAI#889 from SauravMaheshkar/main

2840fe7

feat(ci): add `pip` caching to CI

Merge branch 'main' into flash_attn_infer

c2153df

Merge pull request EleutherAI#862 from satpalsr/flash_attn_infer

c64bacc

add flash_attn_kvpacked

feat: improve typing

aac8777

Merge pull request EleutherAI#893 from SauravMaheshkar/main

73cdd86

feat(logging): improve typing

Added DeeperSpeed to requirements.txt

beaf960

Update NeoXArgs docs automatically

1b1e4eb

Update NeoXArgs docs automatically

945c0ce

Merge pull request EleutherAI#895 from EleutherAI/reqs-correction

ae9f81f

Reqs correction

Update train.py

1f8adff

update train.py 1. black formatter. 2. remove unnecessary import 3. add more arguments

chris-ha458 requested a review from a team as a code owner April 22, 2023 07:16

chris-ha458 requested review from Quentin-Anthony and StellaAthena and removed request for a team April 22, 2023 07:16

chris-ha458 added 26 commits April 22, 2023 17:00

Update train.py

a5ac143

removed num_proc temporarily disabled emoji added continuing subword prefix option ( does not work well with Bytelevel)

Update utils.py

1f542f4

improve reader error handling

Update train.py

c03184c

add whitespace related handling. add whitespace argument expose reconstruct pre_tokenizer_list add more whitespace to check tokenizer invertibility

Update train.py

5c54c87

Update utils.py

0c1dfea

remove unnecessary print

Update train.py

33a796b

set dropout default to None import path related code. Change normalizer change buffer_tokens change whitespace reservation handling

Update train.py

08d5c7b

Clear whitespace_reservation TODO add single_whitespace argument (might be necessary for invertibility)

Create .gitignore

39ec5c8

add gitignore file to ignore artifacts

Update train.py

d9abaa9

add directory parsing error checks add more metrics (tokenizer reconstructions, unicode fallback portion)

Update preprocess.py

7ff7575

path handling changes black formatting

Update train.py

0b6b5e5

change from GPT2TokenizerFast to PreTrainedTokenizerFast class

Update train.py

3ef197c

enhanced test string

Update utils.py

ace43c6

add logic to handle jsonl, txt input add logic to handle folder with jsonl,txt or arrow dataset

Update train.py

30a34e4

add byte_fallback option expose (incompatible with current transformer wrapper) change dataset_loading with new util.py add dataset shuffling option

Update utils.py

e23ec68

fix error in loading sequence

Update train.py

d6027fa

fix whitespace preservation logic

Update train.py

1014fe7

simplify data loading logic. remove unnecessary special tokens

Update train.py

311f76b

remove emoji related code

Update train.py

aab47d2

add whitespace processing regex r"\s{16,}"

update tokenizer

f0cab24

add whitespace pretokenizer (only processes looong whitespaces)

Merge branch 'main' into patch-1

a60b05c

Update train.py

38f9328

Update train.py

12b6228

add camel case regex

Update train.py

fbd4330

separate camel_case regex

Update train.py

94f1fd3

Update train.py

ae98f98

bzantium merged commit de91412 into EleutherAI:polyglot Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update train.py #897

Update train.py #897

chris-ha458 commented Apr 22, 2023

CLAassistant commented Apr 22, 2023 •

edited

Loading

bzantium commented Apr 26, 2023

dashstander commented Apr 26, 2023

bzantium commented Apr 26, 2023

Update train.py #897

Update train.py #897

Conversation

chris-ha458 commented Apr 22, 2023

CLAassistant commented Apr 22, 2023 • edited Loading

bzantium commented Apr 26, 2023

dashstander commented Apr 26, 2023

bzantium commented Apr 26, 2023

CLAassistant commented Apr 22, 2023 •

edited

Loading