Machine Learning Project of Semester VI students(Group 3) at School of Engineering and Applied Science, Ahmedabad University.
Machine Learning has found its place in the technological world rapidly since the past few years. One of the applications of Machine Learning includes Plagiarism Checking which is an application of Text Semantic Similarity. Text Semantic Similarity is a measure of the degree of semantic equivalence between two pieces of text.
How do we know whether a document that we are reading is authorized? Are students copying the content/ideas from other sources or are they produced by them?
In this project, we build algorithms (one or more) and analyse the algorithms suitable for plagiarism checking software by applying the already understood concepts of Machine Learning.
1)Aneri Sheth- 1401072
2)Himanshu Budhia- 1401039
3)Raj Shah- 1401050
4)Twinkle Vaghela- 1401106
Natural Language processing is a wide domain coveringconcepts of Computer Science, Artificial Intelligence and Machine Learning. It is used to analyze text or how humansspeak. One of the applications of NLP is Semantic Analysis(Understanding the meaning of text).
This approach uses semantically annotated corpora to train Machine learning algorithms to decide which word to use in which context. Corpus-based methods are supervised learning approaches when the training data is trained by the algorithms. The corpora and the lexical resource used is WordNet.
Sentence 1 - A cemetery is a place where dead people’s bodies or their ashes are buried.
Sentence 2 - A graveyard is an area of land, sometimes near a church, where dead people are buried.
Splitting sentences and words from the body of text. Words are separated by space after the word, i.e.after every word there is a space. It counts punctuation as a separate token.
Tokenize.py
Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words .
Stop words can be filtered from the text to be processed.
StopWords.py
The goal of lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
By default, an attempt will be made to find the closest noun of a word.
Lemmatizing.py
WordNet is a lexical database for the English language, and is part of the NLTK corpus. We can use WordNet alongside the NLTK module to find the meaning of words, synonyms, antonyms and more.
Wordnet.py
- Let S1 be ”I was given a card by her in the garden” and S2 be ”In the garden, she gave me a card.”
- For semantic analysis, two phrases/sentences are taken. The two sentences are similar, dissimilar or somewhat similar.
- After that, set of stopwords are defined for English language.
- After eliminating the special characters and punctuations and then removing all the stop words and lemmatizing, we get S1={I, given, card, garden} and S2={In, garden, gave, card}.
- After lemmatizing, we find the synonyms of the lemmatized words which are called synsets. Then, we compare first word of S1 with all the words of S2 and continue this iteratively and find the similarity index of each word with words in the S2.
- We find the mean of the computed similarity indexes and thus we we anaylze the semantic similarity using machine learning.
- If the similarity index is less than 0.60, the sentences are labeled as ’Not Similar’, if it is between 0.60 and 0.8, the sentences are labeled as ’Somewhat Similar’ and more than 0.8, the sentences are ’Similar’.
[Final.py](https://github.com/budhiahimanshu96/Text-Semantic-Similarity-MachineLearning/blob/master/NLTK/Final.py)
- Semantics Similarity has been done for sentences and phrases. However, paragraphs and short texts will need complex algorithms for separating of sentences and finding their semantics similarity.
- We find similarity word by word and thus we may get false positives and negatives.
- We would try to decrease the false positive and negative rates by using sentence- sentence similarity instead of word-word similarity.
- Our implementation does not consider spellings. To implement that, Longest Common Subsequence (LCS) Algorithm can be used.