-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing TorchText.NN feature #257
Comments
@GeorgeS2019 -- I have been thinking about the side-packages for a while. TorchText, TorchVision, etc. Because I didn't have a good idea of the full scenarios, I just added the functionality to the Examples, as utils. I think it's a good idea to start working on these packages for real, though. I'm a little confused about Tutorial 6 -- it seems to be a vision tutorial, and I don't find any use of torchtext there, just torchvision. That said, torchvision is another package that we need to get included, but I'll probably start working on the text functionality first. |
@NiklasGustafsson The TorchText.NN module that provides the Transformers with Multi-Head Attention , in my view, is essential for TorchSharp @zhongkaifu FYI |
There are 6 NLP tutorials, 3 are NLP from scratch, independent of TorchText. |
Perhaps it is more important to focus on the LAST remaining TorchText tutorials as listed above. There is a lengthy discussion why the multihead feature has been transferred to TorchText.NN for various reasons that many of which are off my comprehension :-) |
Thanks for those thoughts. The first of the two NLP tutorials that you list is, I believe, implemented here. The second one, the language translation example, should be a great one to tackle next. |
The translation tutorial depends on the 'spacy' package for language tokenization, and I don't suspect there's something similar for .NET. This speaks to a broader need to specify, design, and prioritize data processing libraries for TorchSharp. |
FYI: Related discussions => Proposal - .NET tokenization library & Proposal for common .NET tokenization library The Data Processing step anticipates different types of tokenizers and spacy is only one of them TorchText.Data Utils.cs de_tokenizer = get_tokenizer('spacy', language='de_core_news_sm')
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm') The list of tokenizers considered in TorchText.Data Utils.cs are
Not all of them I have seen ported/made available to .NET => Interested to learn from others what are the possible substitutes for the above list of tokenizers without resorting to Python.NET @zhongkaifu => any feedback?
FYI: SentencePiece is being implemented in TorchText.Data => functional.py if tokenizer == "spacy":
try:
import spacy
try:
spacy = spacy.load(language)
except IOError:
# Model shortcuts no longer work in spaCy 3.0+, try using fullnames
# List is from https://github.com/explosion/spaCy/blob/b903de3fcb56df2f7247e5b6cfa6b66f4ff02b62/spacy/errors.py#L789
OLD_MODEL_SHORTCUTS = spacy.errors.OLD_MODEL_SHORTCUTS if hasattr(spacy.errors, 'OLD_MODEL_SHORTCUTS') else {}
if language not in OLD_MODEL_SHORTCUTS:
raise
import warnings
warnings.warn(f'Spacy model "{language}" could not be loaded, trying "{OLD_MODEL_SHORTCUTS[language]}" instead')
spacy = spacy.load(OLD_MODEL_SHORTCUTS[language])
return partial(_spacy_tokenize, spacy=spacy)
except ImportError:
print("Please install SpaCy. "
"See the docs at https://spacy.io for more information.")
raise
except AttributeError:
print("Please install SpaCy and the SpaCy {} tokenizer. "
"See the docs at https://spacy.io for more "
"information.".format(language))
raise
elif tokenizer == "moses":
try:
from sacremoses import MosesTokenizer
moses_tokenizer = MosesTokenizer()
return moses_tokenizer.tokenize
except ImportError:
print("Please install SacreMoses. "
"See the docs at https://github.com/alvations/sacremoses "
"for more information.")
raise
elif tokenizer == "toktok":
try:
from nltk.tokenize.toktok import ToktokTokenizer
toktok = ToktokTokenizer()
return toktok.tokenize
except ImportError:
print("Please install NLTK. "
"See the docs at https://nltk.org for more information.")
raise
elif tokenizer == 'revtok':
try:
import revtok
return revtok.tokenize
except ImportError:
print("Please install revtok.")
raise
elif tokenizer == 'subword':
try:
import revtok
return partial(revtok.tokenize, decap=True)
except ImportError:
print("Please install revtok.")
raise
raise ValueError("Requested tokenizer {}, valid choices are a "
"callable that takes a single string as input, "
"\"revtok\" for the revtok reversible tokenizer, "
"\"subword\" for the revtok caps-aware tokenizer, "
"\"spacy\" for the SpaCy English tokenizer, or "
"\"moses\" for the NLTK port of the Moses tokenization "
"script.".format(tokenizer)) |
@GeorgeS2019 -- in light of more recent discussions about tokenization, I'm closing this issue as outdated. If you disagree, please reopen with an explanation of what you think needs to be tracked. |
The recent example: SequenceToSequence.cs represents an excellent example of [Tutorial 1]: SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT
Together with the SequenceToSequence Modeling example, a number of TorchText classes were implemented:
Organized according to TorchText namespaces
It will make the TorchSharp more complete in term of TorchText if the following feature and example is implemented
The text was updated successfully, but these errors were encountered: