This project is about implementing an inverted index using Apache Spark(Pyspark)to build a relational database (SQLite) for 19,000 Reuters News Articles.Storing the index in a database offers the benefit of using the B-Tree data structure offered by a relational database instead of building it from the scratch.
Natrual lanaguage processing is applied to clean the text and invert the HTML text files into tf-idf index using Python libraries(nltk,re, bs4, collections). Two datasets are given; a real one from Reuters which contains more than 19,000 documents, and a small sample of 5 documents in order to help with testing the code.