This project aims to classify text as either human-generated or AI-generated. It utilizes a variety of natural language processing (NLP) features and machine learning algorithms to achieve this classification task.
The following features are extracted from the provided dataset:
- Basic NLP Features:
- Char count, word count, word density, punctuation count, title word count, upper-case count, noun count, adverb count, verb count, adjective count, pronoun count.
- Term Frequencies and N-gram:
- Count vectorizer with 35742 features.
- Bigram words (5000 features).
- Trigram words (5000 features).
- BiTrigram characters (5000 features).
- Topic Modeling:
- NeuralLDA with 20 topics.
- Others:
- Readability score, Named Entity Recognition (NER) count, text error length, and Lexical Diversity.
After feature extraction, Principal Component Analysis (PCA) is applied with n_components
set to 256 for feature selection.
The project utilizes five different algorithms for training and testing:
- Random Forest
- Support Vector Machine (SVM)
- XGBoost
- Gradient Boosting
- Logistic Regression
Among the five algorithms tested, Gradient Boosting demonstrated superior performance. It provided accurate classification results during the prediction phase.
A simple Flask application is developed to demonstrate the functionality of the AI Generated Text Detection model. Users can input text, and the application will classify it as either human-generated or AI-generated.
To use the project:
- Clone the repository from GitHub.
- Install the required dependencies.
- Run the Flask application.
- Input text to classify whether it is human-generated or AI-generated.
- Habeeb Moosa - Project Lead & Developer
- Hanisha Musangi - Frontend Developer
This project is licensed under the MIT License.