Skip to content

Apache Hadoop for big data processing and analysis

Notifications You must be signed in to change notification settings

kkowenn/Simple-Hadoop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Hadoop Project

Project Overview

This project is designed to leverage Apache Hadoop for big data processing and analysis. It provides a framework for distributed storage and processing of large datasets across clusters of computers using simple programming models.

What This Project Can Do

  • Data Processing: Efficiently process large datasets using Hadoop's MapReduce framework.
  • Data Storage: Store vast amounts of data in a distributed file system (HDFS).
  • Scalability: Scale horizontally by adding more nodes to the cluster.
  • Fault Tolerance: Ensure high availability of data with built-in redundancy.

Requirements

Before you begin, ensure you have met the following requirements:

  • Java Development Kit (JDK) version 8 or higher
  • Apache Hadoop version 3.x installed
  • A configured Hadoop cluster or a single-node setup
  • Appropriate access permissions for HDFS

Installation

  1. Install Java:

    • Make sure you have Java installed. You can check by running:

      java -version
    • If Java is not installed, you can download and install it from Oracle's official website.

  2. Install Apache Hadoop:

    • Download Hadoop from the Apache Hadoop releases page.

    • Extract the downloaded archive:

      tar -xzf hadoop-x.x.x.tar.gz
    • Move it to your desired installation directory:

      mv hadoop-x.x.x /usr/local/hadoop
    • Add Hadoop to your PATH by editing your .bashrc or .zshrc file:

      export HADOOP_HOME=/usr/local/hadoop
      export PATH=$PATH:$HADOOP_HOME/bin
  3. Configure Hadoop:

    • Edit the configuration files located in $HADOOP_HOME/etc/hadoop to suit your setup. Key files include:

      • core-site.xml
      • hdfs-site.xml
      • mapred-site.xml
      • yarn-site.xml
    • Start the Hadoop services:

      start-dfs.sh
      start-yarn.sh

Running the Project

  1. Upload Data to HDFS:

    • Use the following command to upload a file to HDFS:
      hdfs dfs -put /local/path/to/your/data.txt /hdfs/path/to/data.txt
  2. Execute MapReduce Job:

    • To run a MapReduce job, use the following command:
      yarn jar /path/to/your/hadoop-project.jar com.example.YourMainClass /hdfs/path/to/data.txt /hdfs/output/path
  3. View Output:

    • After the job completes, you can view the output stored in HDFS:
      hdfs dfs -cat /hdfs/output/path/part-r-00000

Project:

Project1: Java classes for word counting, including drivers, mappers, and reducers. It processes beer reviews from a CSV file, generates word counts, and outputs results in a designated directory. The project is executed via JAR files using Hadoop's command-line interface.

About

Apache Hadoop for big data processing and analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages