ThingiRec is a content-based recommendation system for thingiverse.com users. The web app can be accessed here.
ThingiRec uses item data from thingiverse.com to recommend to users other users with whom they should connect and parts that they may be interested in building. Thingiverse.com is a 3D printing hobbyist website where users share their 3D-printed creations. User recommendations are made by content-based filtering; cosine similarity is calculated between each of the user's parts and all other parts in the database for comparison. After the most similar items are found, the associated users are recommended to the input user.
The goal in using content-based filtering is to connect users based on printing complications they might have. For example, User A who is interested in ornate iphone cases and User B who is interested in automotive transmissions may not connect based on their outwardly stated interests, but they are both interested in functional gears. Content-based filtering may match them.
User A's Iphone Case | User B's Automotive Transmission |
---|---|
The overall process of the project can be broken down into 4 steps. These steps will be detailed below:
- Data Collection
- Data Transformation
- Model Creation/Code Refactoring
- App Creation and Deployment
Items uploaded to thingiverse.com have a maximum item id of ~1,500,000, representing ~1.5 million items that have been uploaded to the site. All potential item pages were inspected and ~500,000 records were yielded from the scraping. Many items have been deleted or hidden from the site since it's inception. The item id, name, description, and associated username was scraped from each page using BeautifulSoup, requests, and pandas and were stored in a PostgreSQL database using psycopg2.
The script used for scraping is /thingiscrape/item_scrape_thingiverse.py. In practice, this scraping was parallelized over 3 AWS instances to speed the collection.
Upon launch of the web app, all of the part names and descriptions are vectorized by the sklearn TfidfVectorizer. The number of features was limited to 1000 to increase the speed of recommendation.
When a username is entered, cosine similarity is calculated between each of the user's parts and all other parts in the database. From the most similar parts, the related usernames are taken and are recommended for connection
The most challenging aspect of the project was creating recommendations in both a memory and time efficient manner. Through many iterations of code refactoring, the memory required for recommendations was reduced from <64 GB to <16 GB and the time requirement of a baseline recommendation was reduced from 20 minutes to 7 seconds.
The web app is written in Python using the Flask framework and is designed with a Start Bootstrap theme. The app is hosted on AWS.