Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing TorchText.NN feature #257

Closed
GeorgeS2019 opened this issue May 16, 2021 · 8 comments
Closed

Missing TorchText.NN feature #257

GeorgeS2019 opened this issue May 16, 2021 · 8 comments

Comments

@GeorgeS2019
Copy link

GeorgeS2019 commented May 16, 2021

The recent example: SequenceToSequence.cs represents an excellent example of [Tutorial 1]: SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT

Together with the SequenceToSequence Modeling example, a number of TorchText classes were implemented:

Organized according to TorchText namespaces

  • TorchText.Data.Utils => TorchText.Data.Utils.cs
  • TorchText.Data: => AG_NEWSReader.cs
  • TorchText.Vocab: => Vocab.cs
  • TorchText.Datasets: => Datasets.cs

It will make the TorchSharp more complete in term of TorchText if the following feature and example is implemented

HTML5 Icon

HTML5 Icon

@NiklasGustafsson
Copy link
Contributor

@GeorgeS2019 -- I have been thinking about the side-packages for a while. TorchText, TorchVision, etc. Because I didn't have a good idea of the full scenarios, I just added the functionality to the Examples, as utils. I think it's a good idea to start working on these packages for real, though.

I'm a little confused about Tutorial 6 -- it seems to be a vision tutorial, and I don't find any use of torchtext there, just torchvision.

That said, torchvision is another package that we need to get included, but I'll probably start working on the text functionality first.

@GeorgeS2019
Copy link
Author

GeorgeS2019 commented May 17, 2021

@NiklasGustafsson
The .NET Transformer framework Seq2SeqSharp
has integrated, like TorchText, Transformers with Multi-Head Attention

image

The TorchText.NN module that provides the Transformers with Multi-Head Attention , in my view, is essential for TorchSharp

@zhongkaifu FYI

@GeorgeS2019
Copy link
Author

GeorgeS2019 commented May 17, 2021

@NiklasGustafsson

There are 6 NLP tutorials, 3 are NLP from scratch, independent of TorchText.
The other 3 are using TorchText (designed to make NLP deep learning design more industry standard)

HTML5 Icon

@GeorgeS2019
Copy link
Author

GeorgeS2019 commented May 17, 2021

@NiklasGustafsson

I'm a little confused about Tutorial 6 -- it seems to be a vision tutorial, and I don't find any use of torchtext there, just torchvision.

Perhaps it is more important to focus on the LAST remaining TorchText tutorials as listed above.
The multihead implementation discussed in the tutorial 6 I quoted, if I am not wrong, was implemented independent of TorchText.NN's multihead

There is a lengthy discussion why the multihead feature has been transferred to TorchText.NN for various reasons that many of which are off my comprehension :-)

@NiklasGustafsson
Copy link
Contributor

Thanks for those thoughts. The first of the two NLP tutorials that you list is, I believe, implemented here. The second one, the language translation example, should be a great one to tackle next.

@NiklasGustafsson
Copy link
Contributor

The translation tutorial depends on the 'spacy' package for language tokenization, and I don't suspect there's something similar for .NET. This speaks to a broader need to specify, design, and prioritize data processing libraries for TorchSharp.

@GeorgeS2019
Copy link
Author

GeorgeS2019 commented May 18, 2021

@NiklasGustafsson

FYI: Related discussions => Proposal - .NET tokenization library & Proposal for common .NET tokenization library

The Data Processing step anticipates different types of tokenizers and spacy is only one of them

TorchText.Data Utils.cs

de_tokenizer = get_tokenizer('spacy', language='de_core_news_sm')
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

The list of tokenizers considered in TorchText.Data Utils.cs are

Not all of them I have seen ported/made available to .NET

=> Interested to learn from others what are the possible substitutes for the above list of tokenizers without resorting to Python.NET

@zhongkaifu => any feedback?

I know sub-word tokenization is really useful for text generation tasks, such as MT task could get 2~3pt BLEU scores gain on average and some NN frameworks did integrate sub-word tokenization, such as Marian uses built-in SentencePiece for data processing step,

FYI: SentencePiece is being implemented in TorchText.Data => functional.py

if tokenizer == "spacy":
        try:
            import spacy
            try:
                spacy = spacy.load(language)
            except IOError:
                # Model shortcuts no longer work in spaCy 3.0+, try using fullnames
                # List is from https://github.com/explosion/spaCy/blob/b903de3fcb56df2f7247e5b6cfa6b66f4ff02b62/spacy/errors.py#L789
                OLD_MODEL_SHORTCUTS = spacy.errors.OLD_MODEL_SHORTCUTS if hasattr(spacy.errors, 'OLD_MODEL_SHORTCUTS') else {}
                if language not in OLD_MODEL_SHORTCUTS:
                    raise
                import warnings
                warnings.warn(f'Spacy model "{language}" could not be loaded, trying "{OLD_MODEL_SHORTCUTS[language]}" instead')
                spacy = spacy.load(OLD_MODEL_SHORTCUTS[language])
            return partial(_spacy_tokenize, spacy=spacy)
        except ImportError:
            print("Please install SpaCy. "
                  "See the docs at https://spacy.io for more information.")
            raise
        except AttributeError:
            print("Please install SpaCy and the SpaCy {} tokenizer. "
                  "See the docs at https://spacy.io for more "
                  "information.".format(language))
            raise
    elif tokenizer == "moses":
        try:
            from sacremoses import MosesTokenizer
            moses_tokenizer = MosesTokenizer()
            return moses_tokenizer.tokenize
        except ImportError:
            print("Please install SacreMoses. "
                  "See the docs at https://github.com/alvations/sacremoses "
                  "for more information.")
            raise
    elif tokenizer == "toktok":
        try:
            from nltk.tokenize.toktok import ToktokTokenizer
            toktok = ToktokTokenizer()
            return toktok.tokenize
        except ImportError:
            print("Please install NLTK. "
                  "See the docs at https://nltk.org  for more information.")
            raise
    elif tokenizer == 'revtok':
        try:
            import revtok
            return revtok.tokenize
        except ImportError:
            print("Please install revtok.")
            raise
    elif tokenizer == 'subword':
        try:
            import revtok
            return partial(revtok.tokenize, decap=True)
        except ImportError:
            print("Please install revtok.")
            raise
    raise ValueError("Requested tokenizer {}, valid choices are a "
                     "callable that takes a single string as input, "
                     "\"revtok\" for the revtok reversible tokenizer, "
                     "\"subword\" for the revtok caps-aware tokenizer, "
                     "\"spacy\" for the SpaCy English tokenizer, or "
                     "\"moses\" for the NLTK port of the Moses tokenization "
                     "script.".format(tokenizer))

@NiklasGustafsson
Copy link
Contributor

@GeorgeS2019 -- in light of more recent discussions about tokenization, I'm closing this issue as outdated. If you disagree, please reopen with an explanation of what you think needs to be tracked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants