-
Notifications
You must be signed in to change notification settings - Fork 0
Day1 Task1
Source: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/day1/task1
Data: https://github.com/GeoGenetics/data-analysis-2024/tree/main/reproducible-data-analysis/data/count
Environment: day1
This Bash script takes an input file as a command-line argument, counts the number of words and characters in the file, and outputs the results to the console. If the input file does not exist, is empty, or there is an error while counting the words and characters, the script displays an appropriate error message to the user.
Here's a step-by-step breakdown of what the script does:
- Defines four custom error messages to display to the user if an error occurs.
- Defines a function to handle errors and exits the script with a non-zero exit code if an error occurs.
- Sets up a trap to catch errors that occur during the script execution and calls the error handling function.
- Checks if an input file has been provided, exits with an exit code of 2 if no input file is provided.
- Checks if the input file exists and is not empty, exits with exit codes 141 or 142 if the file does not exist or is empty, respectively.
- Counts the number of words and characters in the input file and saves the counts in two variables.
- Checks if the word and character counts are empty, exits with exit code 143 if either count is empty.
- Outputs the file name, word count, and character count to the console.
First, let's look at the line at the top of the script:
#!/bin/bash
This is called a shebang line and it specifies which interpreter to use to execute the script. In this case, it tells the shell to use the Bash interpreter.
Next, the script defines four custom error messages using variables:
no_input_file_msg="Please provide an input file"
nonexistent_file_msg="The file does not exist"
file_is_empty_msg="The file is empty"
no_results_msg="The file is empty"
These messages will be displayed to the user in case an error occurs during the script execution.
After that, the script defines a function called handle_error:
handle_error() {
if [ "${1}" -ne 0 ]; then
if [ "${1}" -eq 2 ]; then
echo "${no_input_file_msg}"
elif [ "${1}" -eq 141 ]; then
echo "${nonexistent_file_msg}"
elif [ "${1}" -eq 142 ]; then
echo "${file_is_empty_msg}"
elif [ "${1}" -eq 143 ]; then
echo "${no_results_msg}"
else
echo "An unexpected error occurred"
fi
exit 1
fi
}
This function is defined to handle errors in the script. It takes one argument, which is the exit code of the previous command. If the exit code is not 0, the function checks which error occurred and displays the corresponding error message to the user. Finally, the function exits with a non-zero exit code (1) to indicate that an error occurred.
The script also defines a trap function using the trap command:
trap 'handle_error $?' ERR EXIT
This sets up a trap that will catch any errors that occur during the script execution and call the handle_error function with the exit code of the previous command.
Next, the script checks if an input file has been provided:
INPUT_FILE="${1}"
if [ -z "${INPUT_FILE}" ]; then
exit 2
fi
If no input file has been provided, the script exits with an exit code of 2
.
The script then checks if the input file exists and is not empty:
if [ ! -s "${INPUT_FILE}" ]; then
# if file does not exist or is empty
if [ ! -e "${INPUT_FILE}" ]; then
exit 141
else
exit 142
fi
fi
If the file does not exist or is empty, the script exits with exit codes 141 or 142
, respectively.
If the input file exists and is not empty, the script counts the number of words and characters in the file using the wc
command and saves the counts in two variables:
words=$(wc -w "${INPUT_FILE}" | awk '{print $1}')
characters=$(wc -c "${INPUT_FILE}" | awk '{print $1}')
The awk
command is used to extract the first field of the wc
output, which contains the word or character count.
Next, the script checks if the word and character counts are empty:
if [ -z "${words}" ] || [ -z "${characters}" ]; then
exit 143
fi
If either count is empty, the script exits with an exit code of 143
.
Finally, the script outputs the file name, word count, and character count to the console:
printf 'File: %s; Word count: %s; Character count: %s\n' "${INPUT_FILE}" "${words}" "${characters}"
First we will move into the folder where the script is located:
cd ~/course/wdir/data-analysis-2024/reproducible-data-analysis/day1/task1
and activate the conda environment:
conda activate day1
You can execute the script as follows:
- Set the script to be executable
chmod +x count.sh
- Then do:
./count.sh ../../data/count/example1.txt
If we would like to process multiple files in a folder named data, we can achieve this by using a for
loop in Bash to iterate over all the files in the directory and execute the script for each file individually. Here's how you can do it:
for file in ../../data/count/example*txt; do ./count.sh "${file}"; done
This command will iterate over all the files in the data
directory and execute the script for each file individually, passing the file path as an argument to the script. The script will process each file and output the results to the console.
If we want to speed up the process, we can run it in parallel using GNU parallel. First let's install it in our environment:
mamba install parallel
Then we can run the following command:
parallel -j 2 ./count.sh {} ::: ../../data/count/example*
Here we will run 2 parallel processes.
You can find many examples on how to use GNU parallel in this tutorial
- Day 0
- Day 1
- Setting up the environment
- Task 1: Writing a simple BASH script
- Task 2: Writing a More Complex BASH script
- Day 2
- Resources