Skip to content

Day1 Task1

genomewalker edited this page May 26, 2024 · 5 revisions

A simple BASH script to count words and characters

Source: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/day1/task1
Data: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/data/count
Environment: day1

This Bash script takes an input file as a command-line argument, counts the number of words and characters in the file, and outputs the results to the console. If the input file does not exist, is empty, or there is an error while counting the words and characters, the script displays an appropriate error message to the user.

Here's a step-by-step breakdown of what the script does:

  1. Defines four custom error messages to display to the user if an error occurs.
  2. Defines a function to handle errors and exits the script with a non-zero exit code if an error occurs.
  3. Sets up a trap to catch errors that occur during the script execution and calls the error handling function.
  4. Checks if an input file has been provided, exits with an exit code of 2 if no input file is provided.
  5. Checks if the input file exists and is not empty, exits with exit codes 141 or 142 if the file does not exist or is empty, respectively.
  6. Counts the number of words and characters in the input file and saves the counts in two variables.
  7. Checks if the word and character counts are empty, exits with exit code 143 if either count is empty.
  8. Outputs the file name, word count, and character count to the console.

Detailed explanation

First, let's look at the line at the top of the script:

#!/bin/bash

This is called a shebang line and it specifies which interpreter to use to execute the script. In this case, it tells the shell to use the Bash interpreter.

Next, the script defines four custom error messages using variables:

no_input_file_msg="Please provide an input file"
nonexistent_file_msg="The file does not exist"
file_is_empty_msg="The file is empty"
no_results_msg="The file is empty"

These messages will be displayed to the user in case an error occurs during the script execution.

After that, the script defines a function called handle_error:

handle_error() {
    if [ "${1}" -ne 0 ]; then
        if [ "${1}" -eq 2 ]; then
            echo "${no_input_file_msg}"
        elif [ "${1}" -eq 141 ]; then
            echo "${nonexistent_file_msg}"
        elif [ "${1}" -eq 142 ]; then
            echo "${file_is_empty_msg}"
        elif [ "${1}" -eq 143 ]; then
            echo "${no_results_msg}"
        else
            echo "An unexpected error occurred"
        fi
        exit 1
    fi
}

This function is defined to handle errors in the script. It takes one argument, which is the exit code of the previous command. If the exit code is not 0, the function checks which error occurred and displays the corresponding error message to the user. Finally, the function exits with a non-zero exit code (1) to indicate that an error occurred.

The script also defines a trap function using the trap command:

trap 'handle_error $?' ERR EXIT

This sets up a trap that will catch any errors that occur during the script execution and call the handle_error function with the exit code of the previous command.

Next, the script checks if an input file has been provided:

INPUT_FILE="${1}"

if [ -z "${INPUT_FILE}" ]; then
    exit 2
fi

If no input file has been provided, the script exits with an exit code of 2.

The script then checks if the input file exists and is not empty:

if [ ! -s "${INPUT_FILE}" ]; then
    # if file does not exist or is empty
    if [ ! -e "${INPUT_FILE}" ]; then
        exit 141
    else
        exit 142
    fi
fi

If the file does not exist or is empty, the script exits with exit codes 141 or 142, respectively.

If the input file exists and is not empty, the script counts the number of words and characters in the file using the wc command and saves the counts in two variables:

words=$(wc -w "${INPUT_FILE}" | awk '{print $1}')
characters=$(wc -c "${INPUT_FILE}" | awk '{print $1}')

The awk command is used to extract the first field of the wc output, which contains the word or character count.

Next, the script checks if the word and character counts are empty:

if [ -z "${words}" ] || [ -z "${characters}" ]; then
    exit 143
fi

If either count is empty, the script exits with an exit code of 143.

Finally, the script outputs the file name, word count, and character count to the console:

printf 'File: %s; Word count: %s; Character count: %s\n' "${INPUT_FILE}" "${words}" "${characters}"

First we will move into the folder where the script is located:

cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day1/task1

and activate the conda environment:

conda activate day1

You can execute the script as follows:

  1. Set the script to be executable
    chmod +x count.sh
  2. Then do:
    ./count.sh ../../data/count/example1.txt

How to process multiple files

If we would like to process multiple files in a folder named data, we can achieve this by using a for loop in Bash to iterate over all the files in the directory and execute the script for each file individually. Here's how you can do it:

for file in ../../data/count/example*txt; do ./count.sh "${file}"; done

This command will iterate over all the files in the data directory and execute the script for each file individually, passing the file path as an argument to the script. The script will process each file and output the results to the console.

If we want to speed up the process, we can run it in parallel using GNU parallel. First let's install it in our environment:

mamba install parallel

Then we can run the following command:

parallel -j 2 ./count.sh {} ::: ../../data/count/example*

Here we will run 2 parallel processes.

You can find many examples on how to use GNU parallel in this tutorial