- NumPy
- Pandas
- Matplotlib
- Scikit Learn
- Feature Extraction : Count Vectorizer and TF IDF Vectorizer
- Model Selection : Train Test Split, Grid Search Cross Validation and K Fold Cross Validation
- Ensembles : Random Forest Classifier and Gradient Boosting Classifier
- NLTK ( Corpus, Stopwords, Porter Stemmer, Word Net Lemmatizer )
- String ( Punctuation )
- RE : Regular Expression
- SMS ( Messages Extracted in form of
TSV
) - Labels ( Spam or Ham )
- Remove Punctuation.
- Change to Lowercase.
- Tokenization : Splitting a Phrase | Sentence into List of Individual Words called Tokens.
- Remove Stopwords : Most Common Words ( Filtered out before processing Natural Language Data )
- Stemming ( Speed ) | Lemmatization ( Accuracy ) : Reduce the Word to its Base | Stem Form.
- Vectorization : Convert Text to Numbers ( Feature Vectors ) that an Algorithm and a ML Model can Understand and Learn.
- CountVectorizer : Extract Features from Text ( Count Occurence of Text in Corpus and Consider each as Feature | Column )
- Bag of Words : Count Occurence of the Word in each Document and each Word becomes Feature Represented by a Vector.
- TF-IDF : Represents Importance of the Word in the Document.
- Term Frequency : Number of Time the Term Appears in the Document.
- Inverse Document Frequency : Number of Documents containing the Word.
- Split the Data into Training Set and Test Set.
- Train Vectorizers on Training Set and Use that to Transform Test Set.
- Fit Best Random Forest Model and Best Gradient Boosting Model on Training Set and Predict on Test Set.
- Evaluate Results of these Two Models to Select Best Model.