This repository contains the implementation of a machine learning-based code plagiarism detection tool. The model identifies plagiarized code submissions by comparing the similarities between pairs of code files using a Random Forest classifier. The project includes preprocessing, data augmentation, model training, and evaluation steps.
notebook.ipynb
: Jupyter notebook with the complete implementation.data/
: Directory containing C++ code files used for training and testing.README.md
: This README file.alternate/
: Directory containing previous versions of the model and other alternate methods tried
The objective of this project is to develop a plagiarism detection tool that checks students' code submissions for similarities and identifies the plagiarized ones.
- Data Collection: Pairs of code files labeled as either plagiarized (1) or not plagiarized (0).
- Preprocessing: Removing comments, includes, and other non-essential parts of the code to focus on logic.
- Data Augmentation: Creating additional samples by shuffling code lines to increase the dataset size.
- Feature Extraction: Using TF-IDF to convert code samples into numerical vectors.
- Model Training: Using a Random Forest classifier with hyperparameter tuning.
- Evaluation: Assessing the model using cross-validation and classification metrics.
- Best Parameters:
max_depth
: Nonemin_samples_leaf
: 1min_samples_split
: 2n_estimators
: 100
- Cross-Validation Scores: [0.859, 0.953, 0.671]
- Mean Cross-Validation Score: 0.828
- Classification Report:
- Precision (Class 0): 1.00
- Recall (Class 0): 0.83
- F1-Score (Class 0): 0.91
- Precision (Class 1): 0.87
- Recall (Class 1): 1.00
- F1-Score (Class 1): 0.93
- Overall Accuracy: 0.92
To use this project, follow these steps:
- Clone the repository:
git clone https://github.com/yourusername/plagiarism-detection.git
- Navigate to the project directory:
cd plagiarism-detection
- Open the Jupyter notebook:
jupyter notebook notebook.ipynb
- Python 3.6+
- Scikit-learn
- Imbalanced-learn
- Numpy
- Pandas
- Jupyter Notebook