This is coursework from the Applied Data Science Master's program, which offers a comprehensive exploration of the major algorithms used in machine learning. The graduate level machine learning course covers supervised and unsupervised learning techniques, regression methods, resampling methods (including cross-validation and bootstrap), decision trees, dimensionality reduction, regularization, clustering, and kernel methods. Additionally, the class delves into advanced concepts such as hidden Markov models and graphical models, as well as feedforward and recurrent neural networks, and deep learning. Below is a description of the projects:
Task Number | Description | Algorithm |
---|---|---|
1 | The assignment involves working with a biomedical dataset containing biomechanical attributes for a binary classification task. It includes pre-processing, exploratory data analysis, and implementing the k-nearest neighbors (KNN) algorithm with various distance metrics. The task also involves evaluating different distance metrics, exploring weighted decision-making, and analyzing training error rates. | KNN |
2 | The first task involves predicting the net hourly electrical energy output of a Combined Cycle Power Plant using regression tasks, exploratory data analysis, and KNN regression for comparison with linear regression models. The task also pertains to predicting the net hourly electrical energy output of a Combined Cycle Power Plant and includes tasks such as fitting regression models, analyzing nonlinear associations, and implementing KNN regression for comparison with linear regression models. | Regression |
3 | The task involves the application of logistic regression and penalized logistic regression for binary classification, as well as multinomial regression and Naive Bayes for multi-class classification. The tasks include exploring time-domain features, estimating standard deviations, and building models to classify human activities based on time series data | Logistic Regression,Penalized Logistic Regression, Naive Bayes |
4 | The task involves implementing decision trees for interpretable models using the Acute Inflammations dataset and converting decision rules into a set of IF-THEN rules. Additionally, LASSO and boosting techniques are applied for regression using the Communities and Crime dataset, including handling missing values, calculating the Coefficient of Variation for features, and fitting various regression models such as ridge regression, LASSO, and PCR, as well as implementing an L1-penalized gradient boosting tree using XGBoost with regularization term determined through cross-validation | Decision Trees, Lasso and Ridge Regression |
5 | The task involves applying tree-based methods, such as random forests and XGBoost, to the APS Failure dataset, addressing missing values, analyzing feature significance, and evaluating class imbalance. Additionally, the implementation of SMOTE (Synthetic Minority Over-sampling Technique) for compensating class imbalance is explored, and the performance of the uncompensated case is compared with the SMOTE case using appropriate cross-validation method | Random Forests, SMOTE, XGBoost |
6 | The task involves multi-class and multi-label classification using Support Vector Machines (SVM) on the Anuran Calls (MFCCs) dataset, focusing on evaluating classifiers, training SVMs for each label, addressing class imbalance, and studying the Classifier Chain method. Additionally, it includes performing k-means clustering on the entire Anuran Calls dataset and determining the majority label for each cluster, followed by calculating the average Hamming distance, score, and loss between the true and assigned labels. | SVM |
Final Grade on all Tasks: 99%