Skip to content

madhurimarawat/Machine-Learning-Using-Python

Repository files navigation

Machine-Learning-Using-Python

This repository contains machine learning programs in the Python programming language.


About Python Programming

  • Python is a high-level, general-purpose, and very popular programming language.
  • Python programming language (latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry.
  • Python is available across widely used platforms like Windows, Linux, and macOS.
  • The biggest strength of Python is huge collection of standard library .

Mode of Execution Used PyCharm

Pycharm

  • Visit the official website of pycharm: PyCharm
  • Download according to the platform that will be used like Linux, Macos or Windows.
  • Two versions of Pycharm are avilable-

  1. Community version
  • Community version is open source and we can use it for free without any paid plan.
  • We can download this at the end of pycharm website.
  • After downloading community version we can directly follow the setup wizard and it will be setup.

  1. Professional Version.
  • This is available at the top of website, we can directly download from there.
  • After downloading professional version, follow the below steps.
  • Follow the setup wizard and sign up for the free version (trial version) or else continue with the premium or paid version.

Using Pycharm

  • First, in pycharm we have the concept of virtual environment. In virtual environment we can install all the required libraries or frameworks.
  • Each project has its own virtual environment, so thath we can install requirements like Libraries or Framworks for that project only.
  • After this we can create a new file, various file types are available in pycharm like script files, text files and also Jupyter Notebooks.
  • After selecting the required file type, we can continue the execution of that file by saving it and using this shortcut shift+F10 (In Windows).
  • Output is given in Console while installation happens in terminal in Pycharm.

Machine learning 🤖 🛠🧠

  • Machine learning is a method of data analysis that automates analytical model building.
  • It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.
  • Machine Learning algorithm learns from experience E with respect to some type of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Steps of Machine learning


Types of Machine Learning


  • Basically supervised learning is when we teach or train the machine using data that is well-labelled.
  • Which means some data is already tagged with the correct answer.
  • After that, the machine is provided with a new set of examples(data) so that the supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data.

  • K-Nearest Neighbours is one of the most basic yet essential classification algorithms in Machine Learning.
  • It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining, and intrusion detection..
  • In this algorithm,we identify category based on neighbors.

  • The main idea behind SVMs is to find a hyperplane that maximally separates the different classes in the training data.
  • This is done by finding the hyperplane that has the largest margin, which is defined as the distance between the hyperplane and the closest data points from each class.
  • Once the hyperplane is determined, new data can be classified by determining on which side of the hyperplane it falls.
  • SVMs are particularly useful when the data has many features, and/or when there is a clear margin of separation in the data.

  • Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
  • It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
  • The fundamental Naive Bayes assumption is that each feature makes an independent and equal contribution to the outcome.

  • It builds a flowchart-like tree structure where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.
  • It is constructed by recursively splitting the training data into subsets based on the values of the attributes until a stopping criterion is met, such as the maximum depth of the tree or the minimum number of samples required to split a node.
  • The goal is to find the attribute that maximizes the information gain or the reduction in impurity after the split.

  • It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.
  • Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.
  • The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

  • Regression: It predicts the continuous output variables based on the independent input variable. like the prediction of house prices based on different parameters like house age, distance from the main road, location, area, etc.
  • It computes the linear relationship between a dependent variable and one or more independent features.
  • The goal of the algorithm is to find the best linear equation that can predict the value of the dependent variable based on the independent variables.

Types of Linear Regression

1. Univariate/Simple Linear regression

  • When the number of the independent feature, is 1 then it is known as Univariate Linear regression.

2. Multivariate/Multiple Linear regression

  • In the case of more than one feature, it is known as multivariate linear regression.

  • Logistic regression is a supervised machine learning algorithm mainly used for classification tasks where the goal is to predict the probability that an instance of belonging to a given class or not.
  • It is a kind of statistical algorithm, which analyze the relationship between a set of independent variables and the dependent binary variables.
  • It is a powerful tool for decision-making.
  • For example email spam or not.

Types of Logistic Regression

1. Binomial Logistic regression

  • In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, etc.

2. Multinomial Logistic regression

  • In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”.

3. Ordinal Logistic regression

  • In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as “low”, “Medium”, or “High”.
  • Unsupervised learning is the training of a machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance.
  • Here the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data.
  • Unsupervised learning models are utilized for three main tasks— association, clustering and dimensionality reduction.
  • An association rule is a rule-based method for finding relationships between variables in a given dataset.
  • These methods are frequently used for market basket analysis, allowing companies to better understand relationships between different products.
  • Understanding consumption habits of customers enables businesses to develop better cross-selling strategies and recommendation engines.
  • Examples of this can be seen in Amazon’s “Customers Who Bought This Item Also Bought” or Spotify’s "Discover Weekly" playlist.

Types of Association Rules

  • Apriori is an algorithm for frequent item set mining and association rule learning over relational databases.
  • It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database.
  • The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database.
  • This has applications in domains such as market basket analysis.
  • Clustering is a data mining technique which groups unlabeled data based on their similarities or differences.
  • Clustering algorithms are used to process raw, unclassified data objects into groups represented by structures or patterns in the information.
  • Clustering algorithms can be categorized into a few types, specifically exclusive, overlapping, hierarchical, and probabilistic.

Types of Clustering

1. K Means Clustering

  • K-Means Clustering is an unsupervised machine learning algorithm.
  • Its objective is to group data points into K clusters to minimize the variance within each cluster.
  • The process involves iteratively assigning data points to the nearest cluster centroid and updating the centroids until convergence.
  • K-Means is commonly applied in various domains such as customer segmentation, image compression, and anomaly detection.

iii) Dimentionality Reduction

  • Dimensionality reduction is a technique used when the number of features, or dimensions, in a given dataset is too high.
  • It reduces the number of data inputs to a manageable size while also preserving the integrity of the dataset as much as possible.
  • It is commonly used in the preprocessing data stage.

Types of Dimentionality Reduction

1. Principal component analysis

  • Principal component analysis (PCA) is a type of dimensionality reduction algorithm which is used to reduce redundancies and to compress datasets through feature extraction.
  • This method uses a linear transformation to create a new data representation, yielding a set of "principal components."
  • The first principal component is the direction which maximizes the variance of the dataset.
  • While the second principal component also finds the maximum variance in the data, it is completely uncorrelated to the first principal component, yielding a direction that is perpendicular, or orthogonal, to the first component.

Dataset Used

Iris Dataset

  • Iris Dataset is a part of sklearn library.
  • Sklearn comes loaded with datasets to practice machine learning techniques and iris is one of them.
  • Iris has 4 numerical features and a tri class target variable.
  • This dataset can be used for classification as well as clustering.
  • In this dataset, there are 4 features sepal length, sepal width, petal length and petal width and the target variable has 3 classes namely ‘setosa’, ‘versicolor’, and ‘virginica’.
  • Objective for a multiclass classifier is to predict the target class given the values for the four features.
  • Dataset is already cleaned,no preprocessing required.
  • K-Nearest Neighbor and Support Vector Machine is implemented on this dataset.

Breast Cancer Dataset

  • The breast cancer dataset is a classification dataset that contains 569 samples of malignant and benign tumor cells.
  • The samples are described by 30 features such as mean radius, texture, perimeter, area, smoothness, etc.
  • The target variable has 2 classes namely ‘benign’ and ‘malignant’.
  • Objective for a multiclass classifier is to predict the target class given the values for the features.
  • Dataset is already cleaned,no preprocessing required.
  • K-Nearest Neighbor and Support Vector Machine is implemented on this dataset.

Wine Dataset

  • The wine dataset is a classic and very easy multi-class classification dataset that is available in the sklearn library.
  • It contains 178 samples of wine with 13 features and 3 classes.
  • The goal is to predict the class of wine based on the features.
  • Dataset is already cleaned,no preprocessing required.
  • K-Nearest Neighbor and Support Vector Machine is implemented on this dataset.

Naive bayes classification data

  • Dataset is taken from:
  • Contains diabetes data for classification.
  • The dataset has 3 columns-glucose, blood pressure and diabetes and 995 entries.
  • Column glucose and blood pressure data is to classify whether the patient has diabetes or not.
  • Dataset is already cleaned,no preprocessing required.
  • Naive bayes classifier is implemented on this dataset.

Red wine Quality Dataset

  • Dataset is taken from: Red wine Quality Dataset

  • Input variables (based on physicochemical tests):

1. fixed acidity 2. volatile acidity 3. citric acid 4. residual sugar 5. chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol
  • Output variable (based on sensory data):
12 - quality (score between 0 and 10)
  • Dataset is already cleaned,no preprocessing required.
  • Decision Tree and Random Forest is implemented on this dataset.

Cars Evaluation Dataset

  • Dataset is taken from: Cars Evaluation Dataset
  • Contains information about cars with respect to features like Attribute Values:
1. buying v-high, high, med, low 2.maint v-high, high, med, low 3.doors 2, 3, 4, 5-more 4. persons 2, 4, more 5. lug_boot small, med, big 6.safety low, med, high
  • Target categories are:
1. unacc 1210 (70.023 %) 2. acc 384 (22.222 %) 3. good 69 ( 3.993 %) 4. v-good 65 ( 3.762 %)
  • Contains Values in string format.
  • Dataset is not cleaned, preprocessing is required.
  • Random Forest is implemented on this dataset.

Census/Adult Dataset

  • Dataset is taken from: Census/Adult Dataset

  • Contains dataset of population in various parameters like employment,marital status,gender,ethnicity etc.

  • Model need to predict if income is greater than 50K or not.

  • Contains Values in string format.

  • Dataset is not cleaned, preprocessing is required.

  • Naive bayes classifier is implemented on this dataset.

Salary Dataset

  • Dataset is taken from: Salary Dataset

  • Contains Salary data for Regression.

  • The dataset has 2 columns-Years of Experience and Salary and 30 entries.

  • Column Years of Experience is used to find regression for Salary.

  • Dataset is already cleaned,no preprocessing required.

  • Linear Regression is implemented on this dataset.

USA Housing Dataset

  • Dataset is taken from: Housing Dataset

  • Contains Housing data for Regression.

  • This dataset has multiple columns-Area Population, Address etc and Price(Output) and 5000 entries.

  • Rest of the Columns are used to find regression for Price.

  • Dataset is already cleaned,no preprocessing required.

  • Linear Regression and Principal Component Analysis is implemented on this dataset.

Credit Card Fraud Dataset

  • Dataset is taken from: Salary Dataset
  • Contains Fraud data for Classification.
  • The dataset has 31 columns.
  • Dataset is already cleaned,no preprocessing required.
  • Logistic Regression is implemented on this dataset.

Market Bucket Optimization Dataset

  • Dataset is taken from: Salary Dataset
  • Contains various product data for Apriori or association algorithm.
  • The dataset has 20 columns of data about various products.
  • Dataset is already cleaned,no preprocessing required.
  • Apriori Algorithm is implemented on this dataset.

CIFAR-10 Dataset

  • CIFAR-10 is a dataset used in computer vision tasks.
  • It consists of 60,000 color images.
  • These images are divided into 10 different classes.
  • Each class contains 6,000 images.
  • The dataset is typically split into 50,000 training images and 10,000 test images.
  • Common classes in CIFAR-10 include airplanes, automobiles, birds, cats, dogs, and more.
  • The primary purpose of CIFAR-10 is for image classification and object recognition.
  • Researchers and developers often use it to benchmark and evaluate machine learning and deep learning algorithms.
  • Neural Network is implemented on this dataset.

Mall Customers Dataset

  • Dataset is taken from: Housing Dataset
  • Contains Mall Customers data for Clustering.
  • Gender, Age, Annual Income (k$) and Spending Score (1-100) columns are used to cluster data points.
  • Dataset is already cleaned,no preprocessing required.
  • K Means Clustering is implemented on this dataset.

  • Deep learning is a subset of machine learning, which is essentially a neural network with three or more layers.
  • These neural networks attempt to simulate the behavior of the human brain—albeit far from matching its ability—allowing it to “learn” from large amounts of data.
  • While a neural network with a single layer can still make approximate predictions, additional hidden layers can help to optimize and refine for accuracy.
  • Deep learning drives many artificial intelligence (AI) applications and services that improve automation, performing analytical and physical tasks without human intervention.
  • Deep learning technology lies behind everyday products and services (such as digital assistants, voice-enabled TV remotes, and credit card fraud detection) as well as emerging technologies (such as self-driving cars).

Libraries Used 📚 💻

Short Description about all libraries used.

To install python library this command is used-

pip install library_name
  • NumPy (Numerical Python) – Enables with collection of mathematical functions to operate on array and matrices.
  • Pandas (Panel Data/ Python Data Analysis) - This library is mostly used for analyzing, cleaning, exploring, and manipulating data.
  • Matplotlib - It is a data visualization and graphical plotting library.
  • Scikit-learn - It is a machine learning library that enables tools for used for many other machine learning algorithms such as classification, prediction, etc.
  • Mlxtend (machine learning extensions)- It is a library of extension and helper modules for Python's data analysis and machine learning libraries.
  • TensorFlow (tf)- TensorFlow is an open-source machine learning framework developed by Google.
  • Keras- Keras is an open-source deep learning framework that serves as an interface for TensorFlow and other backends, making it easier to build and train neural networks.

Additional Resources 🧮📚📓🌐

  1. p2j- This python library is used to convert python script files to jupyter notebooks.The syntax is
p2j python_script.py

Where script.py is the name of script file.

  • After executing this command in the console or command prompt of the file location Jupyter notebook will be written in the same location.
  1. Flask - This Python framwork is used to deploy machine learning models.

    If you want to see introductory codes to flask, visit my repository: https://github.com/madhurimarawat/Machine-Learning-Projects-In-Python

  2. Streamlit - This framework is used to create website using python without having to worry about frontend.

    I deployed my ML models that I made in this repository using streamlit:
    Visit Website from : ML Algorithms on Inbuilt and Kaggle Datasets

    To See codes: https://github.com/madhurimarawat/ML-Model-Datasets-Using-Streamlits

    Also if you want to see introductory codes to streamlit, visit my repository: https://github.com/madhurimarawat/Streamlit-Programs


Thanks for Visiting 😄

  • Drop a 🌟 if you find this repository useful.

  • If you have any doubts or suggestions, feel free to reach me.

    📫 How to reach me:   Linkedin Badge     Mail Illustration📫

  • Contribute and Discuss: Feel free to open issues 🐛, submit pull requests 🛠️, or start discussions 💬 to help improve this repository!