You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know sub-word tokenization is really useful for text generation tasks, such as MT task could get 2~3pt BLEU scores gain on average and some NN frameworks did integrate sub-word tokenization, such as Marian uses built-in SentencePiece for data processing,
However, since it's a part of data processing, and include several key steps, such as model training, encoding and decoding, I prefer to create separate project for it rather than integrating it to Seq2SeqSharp project.
So, in my opinion, my plan would be 1) Create a project for BPE training/encoding/decoding called SubwordSharp. :) 2) Create a training pipeline to integrate SubwordSharp BPE model training, BPE encoding, and Seq2SeqSharp training, BPE decoding and evaluation steps together, and 3) Create a runtime pipeline to integrate BPE encoding, Seq2SeqSharp inference, BPE decoding together.
Byte-pair encodings (BPE) are now very commonly used in NLP.
Is there a plan in future to integrate BPE in Sep2SeqSharp?
If so, will that be a c# wrapper (e.g. swift wrapper) around e.g. FastBPE.
Would you consider a pure C# version of e.g. FastBPE [ link to pure python FastBPE ]?
This issue is more a feature proposal. Looking forwards to get some feedback
The text was updated successfully, but these errors were encountered: