Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec model to generate synonyms on the fly #12168

Closed
dantuzi opened this issue Feb 24, 2023 · 0 comments · Fixed by #12169
Closed

Word2Vec model to generate synonyms on the fly #12168

dantuzi opened this issue Feb 24, 2023 · 0 comments · Fixed by #12169
Assignees

Comments

@dantuzi
Copy link
Contributor

dantuzi commented Feb 24, 2023

Description

If you want to expand your query/documents with synonyms in Apache Lucene, you need a predefined file containing the list of terms that share the same semantics.
It's not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match your contextual domain.
The term "daemon" in the domain of operating system articles is not a synonym of "devil" but it's closer to the term "process".

Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary.
Two words with similar meanings are identified with two vectors close to each other.

This contribution integrates this technique with the text analysis pipeline. It automatically generates synonyms on the fly from a Word2Vec model generated using the library DL4J.
Please see our presentation at the Berlin Buzzwords conference: https://pretalx.com/bbuzz22/talk/UYZAUX/

We also created a tool to generate a Word2vec model from a Lucene index: https://github.com/SeaseLtd/LuceneWord2VecModelTrainer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant