This repository houses a basic search engine implementation utilizing Hadoop's MapReduce framework to process an extensive text corpus efficiently. The dataset used for this project is a subset of the English Wikipedia dump. It is 5.2 GB in total. The project focuses on implementing a naive search algorithm to address challenges in information.
We started by dividing the 5 GB Wikipedia dataset into smaller, manageable chunks to facilitate easier processing and analysis. This of course was only temporary, as the dataset would later be used in full for the final search engine.
Our code cleaned and standardized the text data, removing stopwords, and normalized terms for consistency across the dataset.
The implementation calculates Term Frequency (TF) and Inverse Document Frequency (IDF) scores to evaluate the importance of words within documents relative to the entire dataset. Then it uses the Vector Space Model Implementation, which involves coding a model to represent both documents and queries as vectors, to measure similarities for ranking purposes.
To run this implementation of a Hadoop-MapReduce Search Engine, you'll need the following:
- Apache Hadoop (install)
- Python (install)
- NLTK (install)
- pandas (install)
- numpy (install)
- Dataset link Download Dataset
Ensure you have these software and libraries installed on your system before proceeding.
- Efficient Indexing: Utilizing MapReduce tasks to efficiently analyze the entire corpus and generate unique word IDs, calculate Inverse Document Frequency (IDF), and create a consolidated vocabulary.
- Vectorized Representation: The Indexer computes a machine-readable representation of the entire document corpus using TF/IDF weighting.
- Relevance Analysis: The Ranker Engine generates a vectorized representation for user queries and conducts relevance analysis by calculating the relevance function between the query and each document. This enables the retrieval of sorted lists of relevant documents based on relevance scores.