Hadoop MapReduce Naive Search Engine

This repository houses a basic search engine implementation utilizing Hadoop's MapReduce framework to process an extensive text corpus efficiently. The dataset used for this project is a subset of the English Wikipedia dump. It is 5.2 GB in total. The project focuses on implementing a naive search algorithm to address challenges in information.

Dataset Preparation:

We started by dividing the 5 GB Wikipedia dataset into smaller, manageable chunks to facilitate easier processing and analysis. This of course was only temporary, as the dataset would later be used in full for the final search engine.

Data Preprocessing:

Our code cleaned and standardized the text data, removing stopwords, and normalized terms for consistency across the dataset.

TF-IDF Score Calculation:

The implementation calculates Term Frequency (TF) and Inverse Document Frequency (IDF) scores to evaluate the importance of words within documents relative to the entire dataset. Then it uses the Vector Space Model Implementation, which involves coding a model to represent both documents and queries as vectors, to measure similarities for ranking purposes.

Developing the Search Engine with MapReduce:

Word Enumeration: Our code will scan the dataset to identify unique words, assigning each a unique identifier.

Document Count: We'll compute the IDF for each term, essentially counting the documents each term appears in.

Indexing: The script will process each document to create a TF/IDF vector representation, forming the basis of our search index.

Query Processing: We'll develop functions to convert user queries into vectors and find the most relevant documents by comparing these vectors with our document index.

Dependencies:

To run this implementation of a Hadoop-MapReduce Search Engine, you'll need the following:

Apache Hadoop (install)
Python (install)
NLTK (install)
pandas (install)
numpy (install)
Dataset link Download Dataset

Ensure you have these software and libraries installed on your system before proceeding.

Features

Efficient Indexing: Utilizing MapReduce tasks to efficiently analyze the entire corpus and generate unique word IDs, calculate Inverse Document Frequency (IDF), and create a consolidated vocabulary.
Vectorized Representation: The Indexer computes a machine-readable representation of the entire document corpus using TF/IDF weighting.
Relevance Analysis: The Ranker Engine generates a vectorized representation for user queries and conducts relevance analysis by calculating the relevance function between the query and each document. This enables the retrieval of sorted lists of relevant documents based on relevance scores.

Team:

Manal Aamir: GitHub
Mohammad Malik: GitHub
Aqsa Fayaz: GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
mapred		mapred
.gitignore		.gitignore
A2_report.docx		A2_report.docx
MainProcess.py		MainProcess.py
README.md		README.md
VectorSpaceModel.py		VectorSpaceModel.py
preprocessing.py		preprocessing.py
subsets.py		subsets.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop MapReduce Naive Search Engine

Dataset Preparation:

Data Preprocessing:

TF-IDF Score Calculation:

Developing the Search Engine with MapReduce:

Dependencies:

Features

Team:

About

Releases

Packages

Contributors 3

Languages

mohammad-malik/wikipedia-naive-search

Folders and files

Latest commit

History

Repository files navigation

Hadoop MapReduce Naive Search Engine

Dataset Preparation:

Data Preprocessing:

TF-IDF Score Calculation:

Developing the Search Engine with MapReduce:

Dependencies:

Features

Team:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages