Text-Clustering

Python and ML project to cluster the descriptions of Siena's courses.

Project Overview

The goal of this project is to cluster the descriptions of Siena's courses. The first step will be vectorizing the text appropriately.

Four different groupings will be created:

A clustering with three groups (corresponding to Siena's three schools)
A clustering with 33 groups (corresponding to departments)
A clustering with 57 groups (corresponding to course prefixes)
A clustering with the optimal number of groups, based on silhouette score

The first three of these have a fixed number of clusters, so I decided to use k-means, agglomerative clustering, and LDA to compare which is best by adjusting parameters to maximize the ARI when compared to the ground truth.

For the fourth case, I used the best parameter settings from above and tried out several different numbers of groups, ranging from 2 up to over 80 groups. I also tried DBSCAN with a few different parameter settings in order to see how many groups it suggested as well. My optimal grouping is the one with the highest silhouette score.

Data files:

descriptions.txt- The text of the course descriptions.
school_codes.txt- The school each ID is associated with.
dept_codes.txt- The departments each ID is associated with.
prefix_codes.txt- The course prefix each ID is associated with.
SCRTXT.xlsx- Process of generating the data files. Also, the "SchoolMap" tab in the Excel workbook shows the mapping from course prefixes to school/department/prefix codes.

Notes on Data Files

Reminder: The models are built only from the descriptions. The other three files are there so that I could compute an ARI score.
All of the files are tab-delimited text; each file has only two columns - the ID and the content described.
All of the IDs and labels have been randomized - there is no information to be mined from them.
Any label of 0 indicates that the course does not fit cleanly into one of the schools/departments (e.g. military studies courses for ROTC students).

Setup

To setup your computer, follow these steps:

Install Spyder: https://docs.spyder-ide.org/current/installation.html
If you need to install any of the packages I used, follow the instructions in this video: https://www.youtube.com/watch?v=i7Njb3xO4Fw
Clone this repo using the command:

git clone "https://github.com/lwoluke/Text-Clustering.git"

Blog post I made on the creation of this project: https://python.plainenglish.io/python-ml-text-clustering-siena-college-course-descriptions-b34b633ff123

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
results		results
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-Clustering

Project Overview

Data files:

Notes on Data Files

Setup

Blog post I made on the creation of this project: https://python.plainenglish.io/python-ml-text-clustering-siena-college-course-descriptions-b34b633ff123

About

Releases

Packages

Languages

lwoluke/Text-Clustering

Folders and files

Latest commit

History

Repository files navigation

Text-Clustering

Project Overview

Data files:

Notes on Data Files

Setup

Blog post I made on the creation of this project: https://python.plainenglish.io/python-ml-text-clustering-siena-college-course-descriptions-b34b633ff123

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages