Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a need to integrate Byte-pair encodings (BPE)? #13

Closed
GeorgeS2019 opened this issue Sep 11, 2020 · 1 comment
Closed

Is there a need to integrate Byte-pair encodings (BPE)? #13

GeorgeS2019 opened this issue Sep 11, 2020 · 1 comment

Comments

@GeorgeS2019
Copy link

GeorgeS2019 commented Sep 11, 2020

Byte-pair encodings (BPE) are now very commonly used in NLP.

Is there a plan in future to integrate BPE in Sep2SeqSharp?

If so, will that be a c# wrapper (e.g. swift wrapper) around e.g. FastBPE.

Would you consider a pure C# version of e.g. FastBPE [ link to pure python FastBPE ]?

This issue is more a feature proposal. Looking forwards to get some feedback

@zhongkaifu
Copy link
Owner

Thanks @GeorgeS2019 for your suggestion.

I know sub-word tokenization is really useful for text generation tasks, such as MT task could get 2~3pt BLEU scores gain on average and some NN frameworks did integrate sub-word tokenization, such as Marian uses built-in SentencePiece for data processing,

However, since it's a part of data processing, and include several key steps, such as model training, encoding and decoding, I prefer to create separate project for it rather than integrating it to Seq2SeqSharp project.

So, in my opinion, my plan would be 1) Create a project for BPE training/encoding/decoding called SubwordSharp. :) 2) Create a training pipeline to integrate SubwordSharp BPE model training, BPE encoding, and Seq2SeqSharp training, BPE decoding and evaluation steps together, and 3) Create a runtime pipeline to integrate BPE encoding, Seq2SeqSharp inference, BPE decoding together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants