Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include healthcheck logic for helper scripts running as sidecars #1842

Draft
wants to merge 3 commits into
base: alpha
Choose a base branch
from

Conversation

TrevorBenson
Copy link
Collaborator

Description

Enhances the healthcheck.sh script to work for checking permissions on sidecar containers (helper scripts) via the ENTRYPOINT_PROCESS.

Where should the reviewer start?

  1. Copy the updated healthcheck.sh script into the sidecar container in /home/guild/.scripts/.
  2. Wait for the 5 minute check interval to occur and confirm if the sidecar is now shown as healthy.

Testing different CPU Usage values

  • Define a CPU_THRESHOLD environment variable (defaults to 80 %) at a value you want to mark a container unhealthy when it is exceeded.

Testing different amount of retries (internal to healthcheck.sh script).

  • Define a RETRIES environment variable (defaults to 20) at a number of retries you want to perform if the CPU usage is above the CPU_THRESHOLD value before exiting non zero

Currently it is a 3 second delay between checks, so 20 retries results in up to 60 seconds before the healthcheck will exit as unhealthy due to CPU load.

Testing different healthcheck values (external to healthcheck.sh script).

The current HEALTHCHECK of the container image is:

  • 5 minutes start period
  • 5 minutes interval
  • 100 seconds timeout
  • 3 retries (default value from being undefined)

Reducing the start period and intervals to something more appropriate for the sidecar script will result in a much shorter period to determine the sidecar containers health.

Make sure to keep the environment variable RETRIES * 3 < container healthcheck timeout to avoid marking the container unhealthy before the script will return during periods of high cpu load.

Motivation and context

Issue #1841

Which issue it fixes?

Closes #1841

How has this been tested?

  1. docker cp the script into preview network cncli sync, validate and leaderlog containers and waiting until the interval runs the script
  2. Execute the script with docker exec to confirm it reports healthy
  3. Monitor the containers until the healthcheck interval occurs and that they are marked healthy.

Additional Details

There is a SLEEPING_SCRIPTS array which is used for validate and leaderlog to still check for the cncli binary, but not consider a sleep period for validate and leaderlog to be unhealthy. Not 100% sure this is the best approach, but with sleep periods being variable I felt it was likely an acceptable middle ground.

Please do not hesitate to suggest an alternative approach to handling sleeping sidecars healthchecks if you think you have an improvement.

@adamsthws if you could please copy this into your sidecar containers (and your pool) and report back any results. I am marking this as a draft PR for the time being until testing is completed, after which if things look good I will mark it for review and get feedback from others.

Thanks

@TrevorBenson
Copy link
Collaborator Author

FWIW Here is my preview network pool, and cncli containers showing healthy once the script was copied in and healthcheck interval was reached:

# podman ps --filter 'name=preview-cncli-[slv]' --filter 'name=preview-ccio-pool --format '{{ .Names }}\t{{ .Status }}''
preview-ccio-pool	Up 4 weeks (healthy)
preview-cncli-sync	Up 3 days (healthy)
preview-cncli-validate	Up 3 days (healthy)
preview-cncli-leaderlog	Up 3 days (healthy)

@adamsthws
Copy link
Contributor

adamsthws commented Jan 7, 2025

Looks good!
How I have tested...

cp the script into cncli sync, validate, leaderlog, pt-send-slots, pt-send-tip containers

Execute the script with docker exec.

  • Result: exit 0. 'We're healthy - cncli'

Monitor the containers until the healthcheck interval occurs and that they are marked healthy.

  • Result: Docker ps shows containers are healthy.

Adjusted RETRIES

  • Note
    • Setting RETRIES=0 results in exit 127, 'Max retries reached for cncli'.
    • Setting RETRIES=1 results in exit 0. 'We're healthy - cncli'

Adjusted CPU_THRESHOLD

  • Note
    • The cncli process uses very little cpu so even setting threshold as low as 1% i was unable to intentionally get healthcheck to fail.
# docker ps --filter 'name=cncli' --format '{{ .Names }}\t{{ .Status }}' | column -t
cncli-pt-send-tip    Up  33  minutes  (healthy)
cncli-pt-send-slots  Up  33  minutes  (healthy)
cncli-sync           Up  33  minutes  (healthy)
cncli-validate       Up  33  minutes  (healthy)
cncli-leaderlog      Up  33  minutes  (healthy)

@adamsthws
Copy link
Contributor

adamsthws commented Jan 7, 2025

Further testing...

I was able to test with higher cpu load after deleting the cncli db and re-syncing.

Result

# ./healthcheck.sh
Checking health for process: cncli 
./healthcheck.sh: line 44: ((: 67.9: syntax error: invalid arithmetic operator (error token is ".9")
We're healthy - cncli
# exit 0

Line 44 of healthcheck.sh:
The (( CPU_USAGE > cpu_threshold )) construct in Bash is used for arithmetic evaluation, but it only supports integer arithmetic. It doesn't handle floating-point numbers.

This seems to fix it...
Line 41 (round float to nearest integer):

CPU_USAGE=$(ps -C "$process_name" -o %cpu= | awk '{s+=$1} END {print int(s + 0.5)}')

With the above change, when cpu load is higher than CPU_THRESHOLD, this is the result:

 # ./healthcheck.sh
Checking health for process: cncli
Warning: High CPU usage detected for 'cncli' (68%)
Max retries reached for cncli
# exit 1

@TrevorBenson
Copy link
Collaborator Author

TrevorBenson commented Jan 8, 2025

Looks good! How I have tested...

cp the script into cncli sync, validate, leaderlog, pt-send-slots, pt-send-tip containers

Execute the script with docker exec.

* Result: exit 0. 'We're healthy - cncli'

Monitor the containers until the healthcheck interval occurs and that they are marked healthy.

* Result: Docker ps shows containers are healthy.

Adjusted RETRIES

* Note
  
  * Setting RETRIES=0 results in exit 127, 'Max retries reached for cncli'.
  * Setting RETRIES=1 results in exit 0. 'We're healthy - cncli'

Adjusted CPU_THRESHOLD

* Note
  
  * The cncli process uses very little cpu so even setting threshold as low as 1% i was unable to intentionally get healthcheck to fail.
# docker ps --filter 'name=cncli' --format '{{ .Names }}\t{{ .Status }}' | column -t
cncli-pt-send-tip    Up  33  minutes  (healthy)
cncli-pt-send-slots  Up  33  minutes  (healthy)
cncli-sync           Up  33  minutes  (healthy)
cncli-validate       Up  33  minutes  (healthy)
cncli-leaderlog      Up  33  minutes  (healthy)

Yeah, there are rare instances where cncli percentage can be high, but this tends to be when resyncing the entire db and/or a cncli init is running. Occasionally if there is an issue with node process itself, like if it gets stuck chainsync/blockfetch and never completes, I have also seen cncli get a high percentage, but otherwise its quite rare to see it increase.

I figured with mithril-signer or db-sync, it might be more useful.

@TrevorBenson
Copy link
Collaborator Author

@adamsthws Feel free to submit suggestions to adjust the SCRIPT_TO_BINARY_MAP entries. Otherwise this week when I look at some other issues I'll go through each helper script and update the map and set this PR to ready to review.

Thanks for the testing.

@adamsthws
Copy link
Contributor

adamsthws commented Jan 8, 2025

Testing revealed that setting RETRIES=0 results in script exit 1 without running the loop... it would be preferable to run the loop once when RETRIES=0.

Suggestion - Modify the loop condition to handle RETRIES=0 by changing line 39 to the following:

    for (( CHECK=0; CHECK<=RETRIES; CHECK++ )); do
    
    # 'RETRIES=3' results in the loop running a total of 4 times
    # 'RETRIES=0' results in the loop running a total of 1 times

Or...

    for (( CHECK=1; CHECK<=RETRIES || (RETRIES==0 && CHECK==1); CHECK++ )); do
    
    # 'RETRIES=3' results in the loop running a total of 3 times
    # 'RETRIES=0' results in the loop running a total of 1 times

@adamsthws
Copy link
Contributor

I started thinkinng about a cncli specific check. The following function is an idea to check cncli status...

# Function to check cncli status
check_cncli_status() {
    CNCLI=$(which cncli)

    for (( CHECK=1; CHECK<=RETRIES || (RETRIES==0 && CHECK==1); CHECK++ )); do
        CNCLI_OUTPUT=$($CNCLI status \
            --byron-genesis "/opt/cardano/cnode/files/byron-genesis.json" \
            --shelley-genesis "/opt/cardano/cnode/files/shelley-genesis.json" \
            --db "/opt/cardano/cnode/guild-db/cncli/cncli.db")

        CNCLI_STATUS=$(echo "$CNCLI_OUTPUT" | jq -r '.status')
        ERROR_MESSAGE=$(echo "$CNCLI_OUTPUT" | jq -r '.errorMessage')

        if [ "$CNCLI_STATUS" == "ok" ]; then
            echo "We're healthy - cncli status is ok and synced."
            return 0  # Return 0 if the status is ok
        elif [ "$CNCLI_STATUS" == "error" ]; then
            if [ "$ERROR_MESSAGE" == "db not fully synced!" ]; then
                echo "cncli's sqlite database is not fully synced. Attempt $CHECK. Retrying in 3 minutes."
                sleep 180  # Wait 3 minutes then retry to allow time for the database to sync
            elif [ "$ERROR_MESSAGE" == "database not found!" ]; then
                echo "cncli's sqlite database not found. Attempt $CHECK. Retrying in 3 minutes."
                sleep 180  # Wait 3 minutes then retry to allow time for the database to be created
            else
                echo "Error - cncli status: $ERROR_MESSAGE. Attempt $CHECK. Retrying in 3 seconds."
                sleep 3  # Wait 3 seconds then retry for other errors
            fi
        else # If status is not "ok" or "error"
            echo "cncli status: $CNCLI_STATUS. Attempt $CHECK. Retrying in 3 seconds."
            sleep 3  # Wait 3 seconds then retry
        fi
    done

    echo "cncli status check failed after $RETRIES attempts."
    return 1  # Return 1 if retries are exhausted
}

Perhaps would be improved further by also checking if sync is incrementing, so the healthcheck doesn't fail during initial sync.

How would you feel about adding me as a commit co-author if you decide to use this?

@TrevorBenson
Copy link
Collaborator Author

TrevorBenson commented Jan 11, 2025

@adamsthws I'm happy to make you a co-author even for something simple, for example if you know how to submit a suggestion go ahead an apply one for for (( CHECK=0; CHECK<=RETRIES; CHECK++ )); do to modify the PR and I'll merge it as a commit.

@TrevorBenson
Copy link
Collaborator Author

TrevorBenson commented Jan 11, 2025

In regards to the larger block for cncli checks, first it is clear lots of thought went into it.

This portion:

            if [ "$ERROR_MESSAGE" == "db not fully synced!" ]; then
                echo "cncli's sqlite database is not fully synced. Attempt $CHECK. Retrying in 3 minutes."
                sleep 180  # Wait 3 minutes then retry to allow time for the database to sync
            elif [ "$ERROR_MESSAGE" == "database not found!" ]; then
                echo "cncli's sqlite database not found. Attempt $CHECK. Retrying in 3 minutes."
                sleep 180  # Wait 3 minutes then retry to allow time for the database to be created

Sleeps of 180 exceed the current timeout period of 100. Options:

  • Additional documentation. My gut feeling it would lead to additional support requests for operators who don't read the docs, it also makes the monolithic container slightly more complex.
  • Increasing the containers timeout. Potential to reduce observability for node, or other processes.

With container settings of 3 retries and 5 minute interval w/ 100 second timeout it is 15 minutes from the last healthy response, or 10 minutes from the first unhealthy response, before the container exhausts retries and is marked unhealthy. I think this covers the two 180 second sleeps, even if the operator reduces the interval and timeouts when not running the node.

Separately, conversations outside of this PR and thread have pointed to some of the logic used in KOIOS for db-sync, also that it could also be used for checking the sqlite DB for cncli.

#!/bin/bash

export PGPASSWORD=${POSTGRES_PASSWORD}
[[ $(( $(date +%s) - $(date --date="$(psql -U ${POSTGRES_USER} -d ${POSTGRES_DB} -qt -c 'select time from block order by id desc limit 1;')" +%s) )) -lt 3600 ]] || exit 1

I haven't examined what the common drift might be for a db-sync instance from the last block produced and for cncli I suspect we could make it shorter than 1 hour.


These are just my thoughts. If you think that I overlooked some aspect please don't hesitate to continue the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker container healthcheck for CNCLI usage
3 participants