Text Semantic Similarity - MachineLearning

Machine Learning Project of Semester VI students(Group 3) at School of Engineering and Applied Science, Ahmedabad University.

About

Machine Learning has found its place in the technological world rapidly since the past few years. One of the applications of Machine Learning includes Plagiarism Checking which is an application of Text Semantic Similarity. Text Semantic Similarity is a measure of the degree of semantic equivalence between two pieces of text.

How do we know whether a document that we are reading is authorized? Are students copying the content/ideas from other sources or are they produced by them?
In this project, we build algorithms (one or more) and analyse the algorithms suitable for plagiarism checking software by applying the already understood concepts of Machine Learning.

Team

1)Aneri Sheth- 1401072

2)Himanshu Budhia- 1401039

3)Raj Shah- 1401050

4)Twinkle Vaghela- 1401106

NATURAL LANGUAGE PROCESSING

Natural Language processing is a wide domain coveringconcepts of Computer Science, Artificial Intelligence and Machine Learning. It is used to analyze text or how humansspeak. One of the applications of NLP is Semantic Analysis(Understanding the meaning of text).

CORPUS-BASED APPROACH

This approach uses semantically annotated corpora to train Machine learning algorithms to decide which word to use in which context. Corpus-based methods are supervised learning approaches when the training data is trained by the algorithms. The corpora and the lexical resource used is WordNet.

Sentence 1 - A cemetery is a place where dead people’s bodies or their ashes are buried.
Sentence 2 - A graveyard is an area of land, sometimes near a church, where dead people are buried.

Tokenizing:

Splitting sentences and words from the body of text. Words are separated by space after the word, i.e.after every word there is a space. It counts punctuation as a separate token.

Tokenize.py

Stop Words:

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . Stop words can be filtered from the text to be processed.

StopWords.py

Lemmatizing:

The goal of lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. By default, an attempt will be made to find the closest noun of a word.

Lemmatizing.py

Synsets:

WordNet is a lexical database for the English language, and is part of the NLTK corpus. We can use WordNet alongside the NLTK module to find the meaning of words, synonyms, antonyms and more.

Wordnet.py

Results

Let S1 be ”I was given a card by her in the garden” and S2 be ”In the garden, she gave me a card.”
For semantic analysis, two phrases/sentences are taken. The two sentences are similar, dissimilar or somewhat similar.
After that, set of stopwords are defined for English language.
After eliminating the special characters and punctuations and then removing all the stop words and lemmatizing, we get S1={I, given, card, garden} and S2={In, garden, gave, card}.
After lemmatizing, we find the synonyms of the lemmatized words which are called synsets. Then, we compare first word of S1 with all the words of S2 and continue this iteratively and find the similarity index of each word with words in the S2.
We find the mean of the computed similarity indexes and thus we we anaylze the semantic similarity using machine learning.
If the similarity index is less than 0.60, the sentences are labeled as ’Not Similar’, if it is between 0.60 and 0.8, the sentences are labeled as ’Somewhat Similar’ and more than 0.8, the sentences are ’Similar’.

Output

Similarity Example 1:
Similarity Example 2:
Somewhat Similarity Example:
Dissimilarity Example:

[Final.py](https://github.com/budhiahimanshu96/Text-Semantic-Similarity-MachineLearning/blob/master/NLTK/Final.py)

Discussion and Future Work

Semantics Similarity has been done for sentences and phrases. However, paragraphs and short texts will need complex algorithms for separating of sentences and finding their semantics similarity.
We find similarity word by word and thus we may get false positives and negatives.
We would try to decrease the false positive and negative rates by using sentence- sentence similarity instead of word-word similarity.
Our implementation does not consider spellings. To implement that, Longest Common Subsequence (LCS) Algorithm can be used.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
EndSem		EndSem
IF-IDF		IF-IDF
Images		Images
Literature		Literature
NLTK		NLTK
Random Forest		Random Forest
ML outputs.docx		ML outputs.docx
ML.png		ML.png
ML_OCR.m		ML_OCR.m
MSER_ocr.pdf		MSER_ocr.pdf
MidTermReport_Group3.pdf		MidTermReport_Group3.pdf
Midterm_ML.tex		Midterm_ML.tex
README.md		README.md
References		References
TO DO		TO DO
Text_similarity.m		Text_similarity.m
Words_similarity.m		Words_similarity.m
_config.yml		_config.yml
ocr_matlab.m		ocr_matlab.m
visdiff.docx		visdiff.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Semantic Similarity - MachineLearning

About

Team

NATURAL LANGUAGE PROCESSING

CORPUS-BASED APPROACH

Tokenizing:

Stop Words:

Lemmatizing:

Synsets:

Results

Output

Discussion and Future Work

About

Releases

Packages

Contributors 4

Languages

budhiahimanshu96/Text-Semantic-Similarity-MachineLearning

Folders and files

Latest commit

History

Repository files navigation

Text Semantic Similarity - MachineLearning

About

Team

NATURAL LANGUAGE PROCESSING

CORPUS-BASED APPROACH

Tokenizing:

Stop Words:

Lemmatizing:

Synsets:

Results

Output

Discussion and Future Work

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages