GitHub - Anshupriya2694/Plagiarism-Detection-on-AWS: In this project, a plagiarism detector is built. The detector examines a text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar the text file is when compared to a provided source text.

Plagiarism Detection on AWS

In this project, the task is to build a plagiarism detector that examines a text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar that text file is to a provided source text.

Defining Features

One of the ways to go about detecting plagiarism, is by computing similarity features that measure how similar a given text file is as compared to an original source text. This can develop as many features as needed and require defining a couple as outlined in this paper (which is also linked in the Lesson Resources tab. In this paper, researchers created features called containment and longest common subsequence.

I have defined a few different similarity features to compare the two texts. Once the relevant features are extracted, a LinearSVC model is used to perform classification.

Containment

One of the first tasks is to create containment features that first look at a whole body of text (and count up the occurrences of words in several text files) and then compare a submitted and source text, relative to the traits of the whole body of text.

Count vectorization calculates n-gram counts and then follow the formula for containment:

If the two texts have no n-grams in common, the containment will be 0, but if all their n-grams intersect then the containment will be 1. Intuitively, it can be seen how having longer n-gram's in common, might be an indication of cut-and-paste plagiarism.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
__pycache__		__pycache__
data		data
notebook_ims		notebook_ims
plagiarism_data		plagiarism_data
source_pytorch		source_pytorch
source_sklearn		source_sklearn
.DS_Store		.DS_Store
1_Data_Exploration.ipynb		1_Data_Exploration.ipynb
2_Plagiarism_Feature_Engineering.ipynb		2_Plagiarism_Feature_Engineering.ipynb
3_Training_a_Model.ipynb		3_Training_a_Model.ipynb
README.md		README.md
Tex2Img_1602260180.jpg		Tex2Img_1602260180.jpg
data.zip		data.zip
data.zip.1		data.zip.1
helpers.py		helpers.py
problem_unittests.py		problem_unittests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plagiarism Detection on AWS

Defining Features

Containment

About

Releases

Packages

Languages

Anshupriya2694/Plagiarism-Detection-on-AWS

Folders and files

Latest commit

History

Repository files navigation

Plagiarism Detection on AWS

Defining Features

Containment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages