Big Data and NLP：Inverted Index Database for 19,000 Reuters News Articles

This project is about implementing an inverted index using Apache Spark（Pyspark）to build a relational database (SQLite) for 19,000 Reuters News Articles.Storing the index in a database offers the benefit of using the B-Tree data structure offered by a relational database instead of building it from the scratch.

Natrual lanaguage processing is applied to clean the text and invert the HTML text files into tf-idf index using Python libraries(nltk,re, bs4, collections). Two datasets are given; a real one from Reuters which contains more than 19,000 documents, and a small sample of 5 documents in order to help with testing the code.

Interface for keyword searching and ranking the most relevant results by TF-IDF

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Reuters News Articles		Reuters News Articles
Big_Data_and_NLP_Inverted_Index_Database_for_19,000_Reuters_News_Articles_final.ipynb		Big_Data_and_NLP_Inverted_Index_Database_for_19,000_Reuters_News_Articles_final.ipynb
Database query result.png		Database query result.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data and NLP：Inverted Index Database for 19,000 Reuters News Articles

Interface for keyword searching and ranking the most relevant results by TF-IDF

About

Releases

Packages

Languages

JennyYu2017/Big-Data-and-NLP-Inverted-Index-Database-for-19-000-Reuters-News-Articles

Folders and files

Latest commit

History

Repository files navigation

Big Data and NLP：Inverted Index Database for 19,000 Reuters News Articles

Interface for keyword searching and ranking the most relevant results by TF-IDF

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages