-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spell checker/corrector? #315
Comments
There's currently no text normalization or spelling correction in spaCy. We'd like to get this built, though. |
What would be the recommended approach? I'm thinking first doing an Then, to use some spell checker based on the vocab. Using ngram features would be great too, and also to allow the addition of an additional custom dictionary (or some way to give more weight to our own dictionary). To actually auto correct, I guess to use something like https://github.com/gfairchild/pyxDamerauLevenshtein , where the distance allowed should be growing with the length of the token. |
I might be missing something entirely here, but I've been trying to understand how Spacy treats misspellings in its lemmatization/tokenization. As near as I can tell, the behavior right now is to take misspelled words and insert them into the list, bumping all following tokens down. This was pretty confusing when word.lemma was returning different values, depending on whether or not my data contained misspellings. For the work I'm doing, I don't want to correct the spellings, I just want to know that the misspellings are there and be able to extract them. From my end, a good first step might be to simply have misspellings/words not in the lemma lists be flagged as such in some way (optionally?). Am I totally out to lunch? |
@lucasjfriesen For a simple temporary solution, I think you could just check whether the token is |
Good thought @kootenpv - Thanks! I'll see what I can work with that. Edit FWIW to anyone else reading this: |
Any update on this? IS there going to be a context-aware spell checker for Spacy? Ideally, we like to provide our own context (train dataset). thank you |
Quick update: This might be a nice use case for the new custom processing pipeline components and extension attributes introduced in v2.0! |
Adding on this, Hunspell is the most used spell checker, and has a binding in python, that could be a good start : https://github.com/blatinier/pyhunspell |
@pavillet Thanks, this is a great suggestion! Just had a look at the API and felt inspired, so here's some untested, semi-pseudocode for a possible spaCy component: Example using
|
Took a stab at it here: https://github.com/tokestermw/spacy_hunspell Hardest part was installing |
@tokestermw Ah, this is really cool – can't wait to try it! Also, let me know if/when it's ready to be shared, so we can post it on Twitter and add it to the extensions on the resources page. |
@ines I think it's mostly ready: https://github.com/tokestermw/spacy_hunspell/releases Haven't thoroughly tested for various platforms and the installation may need some work but the plugin itself is straightforward. I have a couple other ideas for plugins so will be working on that too. 👍 |
Just added it to the resources and shared it on Twitter 🎉 Will close this issue, since there's now a plugin and other ideas and suggestions further up in the thread. Of course, this doesn't mean there can't be more than one spell checker for spaCy 😉 So if anyone was going to build their own, feel free to share it – it'd definitely be a great addition to our (still very small) collection of community plugins! |
https://github.com/atpaino/deep-text-corrector may be helpful. Best regards ! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Does spacy use any text normalizer to resolve spelling errors? Is there any plans for it?
Or do I need a separate step before passing the text string to spaCy?
The text was updated successfully, but these errors were encountered: