Missing TorchText.NN feature #257

GeorgeS2019 · 2021-05-16T21:03:14Z

The recent example: SequenceToSequence.cs represents an excellent example of [Tutorial 1]: SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT

Together with the SequenceToSequence Modeling example, a number of TorchText classes were implemented:

Organized according to TorchText namespaces

TorchText.Data.Utils => TorchText.Data.Utils.cs
TorchText.Data: => AG_NEWSReader.cs
TorchText.Vocab: => Vocab.cs
TorchText.Datasets: => Datasets.cs

It will make the TorchSharp more complete in term of TorchText if the following feature and example is implemented

TorchText.NN:
[Tutorial 6]: Transformers and Multi-Head Attention for TorchText.NN multiheadattention.py

NiklasGustafsson · 2021-05-17T18:29:56Z

@GeorgeS2019 -- I have been thinking about the side-packages for a while. TorchText, TorchVision, etc. Because I didn't have a good idea of the full scenarios, I just added the functionality to the Examples, as utils. I think it's a good idea to start working on these packages for real, though.

I'm a little confused about Tutorial 6 -- it seems to be a vision tutorial, and I don't find any use of torchtext there, just torchvision.

That said, torchvision is another package that we need to get included, but I'll probably start working on the text functionality first.

GeorgeS2019 · 2021-05-17T18:55:19Z

@NiklasGustafsson
The .NET Transformer framework Seq2SeqSharp
has integrated, like TorchText, Transformers with Multi-Head Attention

The TorchText.NN module that provides the Transformers with Multi-Head Attention , in my view, is essential for TorchSharp

@zhongkaifu FYI

GeorgeS2019 · 2021-05-17T19:08:36Z

@NiklasGustafsson

There are 6 NLP tutorials, 3 are NLP from scratch, independent of TorchText.
The other 3 are using TorchText (designed to make NLP deep learning design more industry standard)

GeorgeS2019 · 2021-05-17T19:26:23Z

@NiklasGustafsson

I'm a little confused about Tutorial 6 -- it seems to be a vision tutorial, and I don't find any use of torchtext there, just torchvision.

Perhaps it is more important to focus on the LAST remaining TorchText tutorials as listed above.
The multihead implementation discussed in the tutorial 6 I quoted, if I am not wrong, was implemented independent of TorchText.NN's multihead

There is a lengthy discussion why the multihead feature has been transferred to TorchText.NN for various reasons that many of which are off my comprehension :-)

NiklasGustafsson · 2021-05-17T19:43:36Z

Thanks for those thoughts. The first of the two NLP tutorials that you list is, I believe, implemented here. The second one, the language translation example, should be a great one to tackle next.

NiklasGustafsson · 2021-05-18T16:21:43Z

The translation tutorial depends on the 'spacy' package for language tokenization, and I don't suspect there's something similar for .NET. This speaks to a broader need to specify, design, and prioritize data processing libraries for TorchSharp.

GeorgeS2019 · 2021-05-18T18:54:50Z

@NiklasGustafsson

FYI: Related discussions => Proposal - .NET tokenization library & Proposal for common .NET tokenization library

The Data Processing step anticipates different types of tokenizers and spacy is only one of them

TorchText.Data Utils.cs

de_tokenizer = get_tokenizer('spacy', language='de_core_news_sm')
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

The list of tokenizers considered in TorchText.Data Utils.cs are

spacy
moses ( perhaps fast-mosestokenizer )
toktok
revtok
subword

Not all of them I have seen ported/made available to .NET

=> Interested to learn from others what are the possible substitutes for the above list of tokenizers without resorting to Python.NET

@zhongkaifu => any feedback?

I know sub-word tokenization is really useful for text generation tasks, such as MT task could get 2~3pt BLEU scores gain on average and some NN frameworks did integrate sub-word tokenization, such as Marian uses built-in SentencePiece for data processing step,

FYI: SentencePiece is being implemented in TorchText.Data => functional.py

if tokenizer == "spacy":
        try:
            import spacy
            try:
                spacy = spacy.load(language)
            except IOError:
                # Model shortcuts no longer work in spaCy 3.0+, try using fullnames
                # List is from https://github.com/explosion/spaCy/blob/b903de3fcb56df2f7247e5b6cfa6b66f4ff02b62/spacy/errors.py#L789
                OLD_MODEL_SHORTCUTS = spacy.errors.OLD_MODEL_SHORTCUTS if hasattr(spacy.errors, 'OLD_MODEL_SHORTCUTS') else {}
                if language not in OLD_MODEL_SHORTCUTS:
                    raise
                import warnings
                warnings.warn(f'Spacy model "{language}" could not be loaded, trying "{OLD_MODEL_SHORTCUTS[language]}" instead')
                spacy = spacy.load(OLD_MODEL_SHORTCUTS[language])
            return partial(_spacy_tokenize, spacy=spacy)
        except ImportError:
            print("Please install SpaCy. "
                  "See the docs at https://spacy.io for more information.")
            raise
        except AttributeError:
            print("Please install SpaCy and the SpaCy {} tokenizer. "
                  "See the docs at https://spacy.io for more "
                  "information.".format(language))
            raise
    elif tokenizer == "moses":
        try:
            from sacremoses import MosesTokenizer
            moses_tokenizer = MosesTokenizer()
            return moses_tokenizer.tokenize
        except ImportError:
            print("Please install SacreMoses. "
                  "See the docs at https://github.com/alvations/sacremoses "
                  "for more information.")
            raise
    elif tokenizer == "toktok":
        try:
            from nltk.tokenize.toktok import ToktokTokenizer
            toktok = ToktokTokenizer()
            return toktok.tokenize
        except ImportError:
            print("Please install NLTK. "
                  "See the docs at https://nltk.org  for more information.")
            raise
    elif tokenizer == 'revtok':
        try:
            import revtok
            return revtok.tokenize
        except ImportError:
            print("Please install revtok.")
            raise
    elif tokenizer == 'subword':
        try:
            import revtok
            return partial(revtok.tokenize, decap=True)
        except ImportError:
            print("Please install revtok.")
            raise
    raise ValueError("Requested tokenizer {}, valid choices are a "
                     "callable that takes a single string as input, "
                     "\"revtok\" for the revtok reversible tokenizer, "
                     "\"subword\" for the revtok caps-aware tokenizer, "
                     "\"spacy\" for the SpaCy English tokenizer, or "
                     "\"moses\" for the NLTK port of the Moses tokenization "
                     "script.".format(tokenizer))

NiklasGustafsson · 2022-10-19T20:34:51Z

@GeorgeS2019 -- in light of more recent discussions about tokenization, I'm closing this issue as outdated. If you disagree, please reopen with an explanation of what you think needs to be tracked.

This was referenced May 18, 2021

Proposal for common .NET tokenization library #248

Closed

Missing LayerNormalization & MultiHeadAttention Keras Layers SciSharp/TensorFlow.NET#797

Open

GeorgeS2019 mentioned this issue Aug 4, 2021

Include MultiheadAttention module in C# API #320

Merged

NiklasGustafsson added the Missing Feature label Sep 22, 2021

NiklasGustafsson closed this as completed Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing TorchText.NN feature #257

Missing TorchText.NN feature #257

GeorgeS2019 commented May 16, 2021 •

edited

Loading

NiklasGustafsson commented May 17, 2021

GeorgeS2019 commented May 17, 2021 •

edited

Loading

GeorgeS2019 commented May 17, 2021 •

edited

Loading

GeorgeS2019 commented May 17, 2021 •

edited

Loading

NiklasGustafsson commented May 17, 2021

NiklasGustafsson commented May 18, 2021

GeorgeS2019 commented May 18, 2021 •

edited

Loading

NiklasGustafsson commented Oct 19, 2022

Missing TorchText.NN feature #257

Missing TorchText.NN feature #257

Comments

GeorgeS2019 commented May 16, 2021 • edited Loading

NiklasGustafsson commented May 17, 2021

GeorgeS2019 commented May 17, 2021 • edited Loading

GeorgeS2019 commented May 17, 2021 • edited Loading

GeorgeS2019 commented May 17, 2021 • edited Loading

NiklasGustafsson commented May 17, 2021

NiklasGustafsson commented May 18, 2021

GeorgeS2019 commented May 18, 2021 • edited Loading

NiklasGustafsson commented Oct 19, 2022

GeorgeS2019 commented May 16, 2021 •

edited

Loading

GeorgeS2019 commented May 17, 2021 •

edited

Loading

GeorgeS2019 commented May 17, 2021 •

edited

Loading

GeorgeS2019 commented May 17, 2021 •

edited

Loading

GeorgeS2019 commented May 18, 2021 •

edited

Loading