- numpy
- pandas
- matplotlib
- sklearn
- re
- string
- nltk
Part-A consists of the following topics:
- Dimensionality Reduction Introduction
- Missing value Ratio
- Low Variance filter
- High Correlation Filter
- Random Forest
- Backward Feature Elemination
- Forward Feaeture Selection
- Singular Value decomposition (SVD)
- Principle Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
Dataset Used is Fashion MNIST Training Dataset
Ans. In a dataset, the number of variables or features present is known as Dimensions.
If data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input to a model to predict the target variable. Input variables are also called features.
When the dataset is huge, there will be a lot of variables and features, thus it will have higher dimensions.
When we deal with high dimnesional data, then its more chalanging to do predictive modeling on a model,It's also hard for us to visualize the data and it also affects the output of Machine Learning Algorithm as there might be some grabage, irrelevant, missing or repetative data, Which is also known as "Curse of Dimensionality". This increases the cost of the storage of data and uses a huge amount of resources.
Dimensionality reduction may can be defined as the process or technique to remove the 'Uninformative information' while retaining the informational data. In laymen words, it's a process of removing input variable from the database such that the removal of the variables does not change the database output by a huge amount.
Ans. Dimensionality reduction is required to remove the 'Uninformative information' from the data which in return will do the following things:
It will reduce the size of the database as dimension of the database is reduced. Less dimensions lead to less computation and training time Some algorithms do not perform well when we have a large dimensions. So reducing these dimensions needs to happen for the algorithm to be useful. It's hard for us to visualize the data if it have higher dimensionality so dimensionality reduction helps us to visualize the data. After dimensionality reduction, the output of the machine learning model will be more accurate when compared to uncleaned high dimensional dataset.
When a dataset is given to us which have huge dimenisons, there is a chance that some values are missing in the dataset.
Here question arrises, If the dataset does not have values associated to the variable so chould be impute the variable or drop the variable.
To decide if we want to impute the variavle or drop the variable we calculate the 'Missing Value Ratio' which will help us to decide we impute or drop the varaiable if the missing value have a threshold of x(say 50%). If the values are more than x then we impute the variable otherwise we drop the variable.
In our dataset say some variable have same values (say z), if we include that variable then our model would be improved? Maybe...
If the variance of that variable is 0 so it would not be very heplful for us. So, we remove variables with low or 0 variance.
High correlation between two variables may can be defined as they have similar trends and are likely to have information similar to each other This can bring down the performance of some models drastically. We can calculate the correlation between independent numerical variables.
If corelation coefficeint overcomes some threshold value (say 80%) then we will drop one of the two variable. Since dropping variable is not ideal so we drop the variable taking ddomain into consideration.
df.corr()
gives us the correlation of each variable
Random forest may can be defined as an algorithm which is used to do feature selection which helps us select a smaller subset of features.
We can impliment ramdom forest by importing from sklearn.ensemble
library called RandomForestRegressor
. Since RandomForestRegressor
only takes in numerical values thats why we tends to convert the data into the numeric form.
After applying the RandomForestRegressor, we visualize the importance of each variable by plotting it into a graph. model.feature_importance_
gives us the importance of each feature and we plot the features.
We then use SelectFromModel
from sklearn.feature_selection
which decides the importance with respect to their weight.
Backward Feature Elemination is a method to reduce the dimension by following some simple steps:
Step - 1: First we take whole set from the training set and calculate its performance (say n variable).
Step - 2: We now remove one of the variable (one at a time) (n-1 variables) and calculate its performance and compare it to the previous performance.
Step - 3: We remove the variable which increases the performance on removal.
We continue removing variables one by one until it increase our criterion.
Forward Feature Selection is a method to reduce the dimension by following some simple steps:
Step - 1: First we take empty set (containing 0 elements).
Step - 2: We include one element to the empty set.
Step - 3: We now include one of the variable (one at a time) (i+1 variables) and calculate its performance and compare it to the previous performance.
Step - 4: We add the variable which increases the performance on addition.
We continue adding variables one by one until it increase our criterion.
SVD or Singual value decomposition is a dimensionality reduction technique which decomposes the maatrix created by the variables and features into three matrices smaller matrices.
PCA or principal Component analysis is an unsupervised dimensionality reduction technique in which a principal component is a linear combination of the original variables. These Principal components are extracted in such a way that the first principal component explains maximum variance in the dataset. Second principal component tries to explain the remaining variance in the dataset and is uncorrelated to the first principal component. Third principal component tries to explain the variance which is not explained by the first two principal components and so on.
It can be easily done by importing PCA
from skklearn.decomposition
and applying pca.fit()
on the dataframe.
LDA or Linear Discriminant Analysis approach is very similar to a Principal Component Analysis, but in addition to finding the component axes that maximize the variance of our data (PCA), we are additionally finds the axes that maximize the separation between multiple classes.
We can easily apply LDA by importing LinearDiscriminantAnalysisfrom
the library sklearn.discriminant_analysis
and applying sklearn_lda.fit_transform()
over the dataframe.
Part-B consists of following topics:
- Manual tokenization
- Tokenization and Cleaning with NLTK
- Word Counts with CountVectorizer
- TF(Term Frequency) - IDF(Inverse Document Frequency)
- Word Frequencies with TfidfVectorizer
- The Bag-of-Word Model
As we know that the machine learning algoriths need numbers to work with. As rax text is messy thats why machines can not work directly on the texts. So we convert or process th text into a format that machine can understand. The process of conversion or the cleaning is called feature extraction in texts.
We load the data using the open()
function including its address and what we need to do with the file. Here we use 'r+
' which is read and write.
At last we close()
the file to save the changes in the file.
Text file can be vey messy without a proper format. So to bring uniformity we use whitespaces to seperate the words from space, newlines, tabs, etc.
We can do this in Python with the split()
function on the loaded string.
Alternative approach is to use regex model (re) and split the document into words by selecting for strings of alphanumeric characters. firstly we import re
; Then we use the function re.split
to split the data.
we want words without punctuations (like ',' , '.' , '$' ,etc) so we import punctuations from string and use ragex for character filtering. The function sub()
replaces the punctuation with nothing.
We can normalize the case by calling lower()
function to each word.
The Natural Language Toolkit, or NLTK for short, is a Python library written for working and modeling text. We import the nltk library and download punkt and stopwords which contais the puunctuations and the stopwords
NLTK provides the sent_tokenize()
function to split text into sentences. It will split paragraphs into the sentences.
NLTK provides a function called word_tokenize()
for splitting strings into tokens and we normalize the case for every word.
Punctioations are those that differentiate between the ending of a sentence or for spacing. The most common puuncuations are: ' . ', ' , ', ' : ', ' ; ', ' " ', ' - ', etc.
Stop words are those words that do not contribute to the deeper meaning of the phrase. They are the most common words such as: the, a, is, etc.
Words like 'Finding' , 'found' , 'finds' , etc all reduces to stem 'find'.
We can easily filter stem words by using NLTK via the PorterStemmer
class.
The CountVectorizer
provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
You can use it as follows:
- Create an instance of the CountVectorizer class.
- Call the
fit()
function in order to learn a vocabulary from one or more documents. - Call the
transform()
function on one or more documents as needed to encode each as a vector.
An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document. Because these vectors will contain a lot of zeros, we call them sparse.
Python provides an efficient way of handling sparse vectors in the scipy.sparse()
package.
The vectors returned from a call to transform()
will be sparse vectors, and you can transform them back to NumPy arrays to look and better understand what is going on by calling the toarray()
function.
TF or Term Frequency may can be defined as a measure of how frequently a term occurs in a document.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF or Inverse Document Frequency may can be defined as a measure of how important a term is.
IDF(t) = ln(Total number of documents / Number of documents with term t in it)*
Words like 'The' will appear many times and their large counts will not be very meaningful in the encoded vectors.
An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an abriviation that stands for Term Frequency - Inverse Document Frequency which are the components of the resulting scores assigned to each word.
- Term Frequency - This summarizes how often a given word appears within a document.
- Inverse Document Frequency - This downscales words that appear a lot across documents.
The TfidfVectorizer
will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.
The Bag of Words (BoW) model is the simplest form of text representation in numbers. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).
Let’s take three types of movie reviews we saw earlier:
Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good
We will first build a vocabulary from all the unique words in the above three reviews.
The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.
We can now take each of these words and mark their occurrence in the three movie reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews:
Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]
Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]
Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]
In the above example, we can have vectors of length 11. However, we start facing issues when we come across new sentences:
- If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.
- Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid)
- We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.