Word2Vec model to generate synonyms on the fly #12168

dantuzi · 2023-02-24T08:30:52Z

Description

If you want to expand your query/documents with synonyms in Apache Lucene, you need a predefined file containing the list of terms that share the same semantics.
It's not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match your contextual domain.
The term "daemon" in the domain of operating system articles is not a synonym of "devil" but it's closer to the term "process".

Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary.
Two words with similar meanings are identified with two vectors close to each other.

This contribution integrates this technique with the text analysis pipeline. It automatically generates synonyms on the fly from a Word2Vec model generated using the library DL4J.
Please see our presentation at the Berlin Buzzwords conference: https://pretalx.com/bbuzz22/talk/UYZAUX/

We also created a tool to generate a Word2vec model from a Lucene index: https://github.com/SeaseLtd/LuceneWord2VecModelTrainer

dantuzi added the type:enhancement label Feb 24, 2023

alessandrobenedetti linked a pull request Feb 24, 2023 that will close this issue

Introduced the Word2VecSynonymFilter #12169

Merged

dantuzi mentioned this issue Feb 24, 2023

Introduced the Word2VecSynonymFilter #12169

Merged

alessandrobenedetti assigned dantuzi Feb 24, 2023

alessandrobenedetti closed this as completed in #12169 Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word2Vec model to generate synonyms on the fly #12168

Word2Vec model to generate synonyms on the fly #12168

dantuzi commented Feb 24, 2023

Word2Vec model to generate synonyms on the fly #12168

Word2Vec model to generate synonyms on the fly #12168

Comments

dantuzi commented Feb 24, 2023

Description