Spell checker/corrector? #315

xumx · 2016-03-29T22:13:24Z

Does spacy use any text normalizer to resolve spelling errors? Is there any plans for it?
Or do I need a separate step before passing the text string to spaCy?

honnibal · 2016-03-30T23:37:03Z

There's currently no text normalization or spelling correction in spaCy. We'd like to get this built, though.

kootenpv · 2017-03-12T11:20:48Z

What would be the recommended approach?

I'm thinking first doing an nlp parse without dependency, just tokenisation.

Then, to use some spell checker based on the vocab. Using ngram features would be great too, and also to allow the addition of an additional custom dictionary (or some way to give more weight to our own dictionary).

To actually auto correct, I guess to use something like https://github.com/gfairchild/pyxDamerauLevenshtein , where the distance allowed should be growing with the length of the token.

ghost · 2017-03-23T07:25:39Z

I might be missing something entirely here, but I've been trying to understand how Spacy treats misspellings in its lemmatization/tokenization. As near as I can tell, the behavior right now is to take misspelled words and insert them into the list, bumping all following tokens down. This was pretty confusing when word.lemma was returning different values, depending on whether or not my data contained misspellings.

For the work I'm doing, I don't want to correct the spellings, I just want to know that the misspellings are there and be able to extract them. From my end, a good first step might be to simply have misspellings/words not in the lemma lists be flagged as such in some way (optionally?). Am I totally out to lunch?

kootenpv · 2017-03-23T08:18:06Z

@lucasjfriesen For a simple temporary solution, I think you could just check whether the token is in the nlp.vocab.

ghost · 2017-03-25T17:14:04Z

Good thought @kootenpv - Thanks! I'll see what I can work with that.

Edit FWIW to anyone else reading this: is_oov yields a bool asking "Is the word out-of-vocabulary?". Nice and easy.

casraz · 2017-06-01T15:15:56Z

Any update on this? IS there going to be a context-aware spell checker for Spacy? Ideally, we like to provide our own context (train dataset).

thank you

ines · 2017-11-09T16:52:03Z

Quick update: This might be a nice use case for the new custom processing pipeline components and extension attributes introduced in v2.0!

pavillet · 2017-11-17T09:13:31Z

Adding on this, Hunspell is the most used spell checker, and has a binding in python, that could be a good start : https://github.com/blatinier/pyhunspell

ines · 2017-11-21T23:27:15Z

@pavillet Thanks, this is a great suggestion! Just had a look at the API and felt inspired, so here's some untested, semi-pseudocode for a possible spaCy component:

Example using `pyhunspell`

import hunspell
from spacy.tokens import Token


class spaCyHunSpell(object):
    name = 'spacy_hunspell'
    
    def __init__(self, dic_path, aff_path):
        self.hobj = hunspell.HunSpell(dic_path, aff_path)
        Token.set_extension('hunspell_spell', default=None)
        Token.set_extension('hunspell_suggest', getter=self.get_suggestion)

    def __call__(self, doc):
        for token in doc:
            token._.hunspell_spell = self.hobj.spell(token.text)
        return doc

    def get_suggestion(self, token):
        return self.hobj.suggest(token.text)

import spacy

nlp = spacy.load('en_core_web_sm')
hunspell = spaCyHunSpell('en_US.dic', 'en_US.aff')
nlp.add_pipe(hunspell)

doc = nlp(u"This is spookie")
assert [t._.hunspell_spell for t in doc] == [True, True, False]
suggestions = doc[2]._.hunspell_suggest
# ['spookier', 'spookiness', 'spook', 'cookie', 'bookie', 'Spokane', 'spoken']

Alternative ideas and inspiration

pirate/spellchecker: A spell-checker extending Peter Norvig's with multi-typo correction, hamming distance weighting, and more.
How to Write a Spelling Corrector by Peter Norvig

Relevant spaCy documentation

The relevant docs if anyone wants to take this on and build an extension package – would also be a great project for spaCy beginners!

tokestermw · 2017-11-27T06:50:09Z

Took a stab at it here: https://github.com/tokestermw/spacy_hunspell

Hardest part was installing hunspell since the pseudocode is correct :)

ines · 2017-11-27T12:15:59Z

@tokestermw Ah, this is really cool – can't wait to try it! Also, let me know if/when it's ready to be shared, so we can post it on Twitter and add it to the extensions on the resources page.

tokestermw · 2017-11-28T16:50:33Z

@ines I think it's mostly ready: https://github.com/tokestermw/spacy_hunspell/releases

Haven't thoroughly tested for various platforms and the installation may need some work but the plugin itself is straightforward.

I have a couple other ideas for plugins so will be working on that too.

👍

ines · 2017-11-28T23:08:58Z

Just added it to the resources and shared it on Twitter 🎉 Will close this issue, since there's now a plugin and other ideas and suggestions further up in the thread.

Of course, this doesn't mean there can't be more than one spell checker for spaCy 😉 So if anyone was going to build their own, feel free to share it – it'd definitely be a great addition to our (still very small) collection of community plugins!

ufukhurriyetoglu · 2018-02-04T22:10:58Z

https://github.com/atpaino/deep-text-corrector may be helpful.

Best regards !

lock · 2018-05-08T00:55:11Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the enhancement Feature requests and improvements label Apr 14, 2016

honnibal closed this as completed Apr 14, 2016

honnibal reopened this Apr 14, 2016

bittlingmayer mentioned this issue Jul 20, 2017

All tokens are out of vocabulary in 2.0 #1204

Closed

ines added the help wanted Contributions welcome! label Nov 9, 2017

ines closed this as completed Nov 28, 2017

ines added a commit that referenced this issue Nov 29, 2017

Add spacy_hunspell to resources (see #315)

8d3f293

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spell checker/corrector? #315

Spell checker/corrector? #315

xumx commented Mar 29, 2016

honnibal commented Mar 30, 2016

kootenpv commented Mar 12, 2017

ghost commented Mar 23, 2017 •

edited by ghost

Loading

kootenpv commented Mar 23, 2017

ghost commented Mar 25, 2017 •

edited by ghost

Loading

casraz commented Jun 1, 2017 •

edited

Loading

ines commented Nov 9, 2017

pavillet commented Nov 17, 2017

ines commented Nov 21, 2017 •

edited

Loading

tokestermw commented Nov 27, 2017

ines commented Nov 27, 2017

tokestermw commented Nov 28, 2017

ines commented Nov 28, 2017

ufukhurriyetoglu commented Feb 4, 2018 •

edited

Loading

lock bot commented May 8, 2018

Spell checker/corrector? #315

Spell checker/corrector? #315

Comments

xumx commented Mar 29, 2016

honnibal commented Mar 30, 2016

kootenpv commented Mar 12, 2017

ghost commented Mar 23, 2017 • edited by ghost Loading

kootenpv commented Mar 23, 2017

ghost commented Mar 25, 2017 • edited by ghost Loading

casraz commented Jun 1, 2017 • edited Loading

ines commented Nov 9, 2017

pavillet commented Nov 17, 2017

ines commented Nov 21, 2017 • edited Loading

Example using pyhunspell

Alternative ideas and inspiration

Relevant spaCy documentation

tokestermw commented Nov 27, 2017

ines commented Nov 27, 2017

tokestermw commented Nov 28, 2017

ines commented Nov 28, 2017

ufukhurriyetoglu commented Feb 4, 2018 • edited Loading

lock bot commented May 8, 2018

ghost commented Mar 23, 2017 •

edited by ghost

Loading

ghost commented Mar 25, 2017 •

edited by ghost

Loading

casraz commented Jun 1, 2017 •

edited

Loading

ines commented Nov 21, 2017 •

edited

Loading

Example using `pyhunspell`

ufukhurriyetoglu commented Feb 4, 2018 •

edited

Loading