Issue684 service timeout and logging #685

kbenne · 2024-09-27T17:02:15Z

This PR addresses #684

* Within worker, add logs when a message is received and immediatly after a response has been sent. * Include HOSTNAME in the test metadata that is stored in Redis. For a K8s deployment, HOSTNAME will correspond to the name of the pod running the test. This will make it possible to retreive worker logs for a misbehaving test. * When a test is complete, the worker and the associated logs, may no longer exist, however the logs will still be available in the log file contained within the test payload that is pushed to long term storage. * These changes pertain to worker, however there is an existing log message within the web implementation that logs when a message is sent to the worker, but no response is received. Additionally, each message between web and worker is given a unique ID, therefore with all of this togethor there will be breadcrumbs if a message is dropped.

* The default message timeout is now 20 minutes, and the value is configurable using the BOPTEST_MESSAGE_TIMEOUT environment variable.

The message subscription timeout is now configured as BOPTEST_MESSAGE_TIMEOUT + 60 seconds

kbenne · 2024-09-27T17:06:39Z

docker-compose.yml

@@ -31,6 +31,7 @@ services:
      - BOPTEST_REDIS_HOST
      - BOPTEST_REGION
      - BOPTEST_TIMEOUT


BOPTEST_TIMEOUT is really "idle" timeout. In other words how long there can be no interactions with a running test before it will automatically stop. It might be wise to rename and document these config variables a little better.

kbenne · 2024-09-27T17:08:33Z

service/docs/redis.md

+| `tests:${testid}` | `status`    | `Queued \| Running`                                                                      |
+| `tests:${testid}` | `timestamp` | Epoch seconds when test is queued.                                                       |
+| `tests:${testid}` | `user`      | The OAuth user "sub", or undefined for anonymous tests.                                  |
+| `tests:${testid}` | `host`      | The value of HOSTNAME for the worker running the test. On K8s this will be the pod name. |


There are a number of diffs here, due to formatting adjustment, but the addition of host is the actual change. For an instance deployed on Kubernetes, host will be the worker pod name. This is really useful for connecting to a pod that may be running a test and doing diagnostics if there is an issue.

kbenne · 2024-09-27T17:09:18Z

service/test/test_testcase_api.py

@@ -56,6 +56,9 @@ def test_ibpsa_boptest_testcase():
    response = requests.get(f"{host}/name/{testid}")
    check.equal(response.status_code, 200)

+    response = requests.post(f"{host}/advance/{testid}")


A lot of API tests in this file, but there was not actaully a test for advance, which is a useful assertion when working on timeout issues.

kbenne · 2024-09-27T17:10:55Z

service/web/server/src/lib/messaging.js

-        reject(new Error(`Timeout while sending command '${method}' to testid '${workerID}'`))
+        reject(
+          new Error(
+            `Timeout for request: '${requestID}', with method:'${method}', sent to testid: '${workerID}'`,


requestID can be (should be) matched to a message received by a worker running a test. Using this log (which will show up in server logs) we can track down, was a message recieved by a worker, and was a response sent. If not, we get this log and can figure out where in the chain it was dropped.

kbenne · 2024-09-27T17:11:47Z

service/worker/jobs/boptest_run_test/job.py

@@ -91,7 +99,7 @@ def run(self):
    def check_idle_time(self):
        idle_time = datetime.now() - self.last_message_time
        if idle_time.total_seconds() > self.timeout:
-            print("Testid '%s' is terminating due to inactivity." % self.testid)
+            self.logger.info("Testid '%s' is terminating due to inactivity." % self.testid)


Using the logger for this stuff will ensure that we get these messages in the log file that clients are able to download.

kbenne · 2024-09-27T17:12:28Z

service/worker/logger.py

        self.logger = logging.getLogger('worker')
-        self.formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+        self.logger.setLevel(os.environ.get("BOPTEST_LOGLEVEL", "INFO"))
+        fmt = '%(asctime)s UTC\t%(name)-20s%(levelname)s\t%(message)s'


This is designed to match the log formatting used in testcase.py. It would be wise to consoldiate the logger implementation at some future point.

kbenne · 2024-09-27T17:15:43Z

docker-compose.yml

@@ -31,6 +31,7 @@ services:
      - BOPTEST_REDIS_HOST
      - BOPTEST_REGION
      - BOPTEST_TIMEOUT
+      - BOPTEST_MESSAGE_TIMEOUT


On the other hand. BOPTEST_MESSAGE_TIMEOUT is how long the web server waits for a response from the worker running a test.

kbenne added 3 commits September 27, 2024 11:51

Adjust message timeout between web and worker

c6ea785

* The default message timeout is now 20 minutes, and the value is configurable using the BOPTEST_MESSAGE_TIMEOUT environment variable.

Further adjust messaging timeout

38bd2b6

The message subscription timeout is now configured as BOPTEST_MESSAGE_TIMEOUT + 60 seconds

kbenne commented Sep 27, 2024

View reviewed changes

dhblum merged commit 891f99b into issue617_refactorService_with_history2 Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue684 service timeout and logging #685

Issue684 service timeout and logging #685

kbenne commented Sep 27, 2024

kbenne Sep 27, 2024

kbenne Sep 27, 2024

kbenne Sep 27, 2024

kbenne Sep 27, 2024

kbenne Sep 27, 2024

kbenne Sep 27, 2024

kbenne Sep 27, 2024

Issue684 service timeout and logging #685

Issue684 service timeout and logging #685

Conversation

kbenne commented Sep 27, 2024

kbenne Sep 27, 2024

Choose a reason for hiding this comment

kbenne Sep 27, 2024

Choose a reason for hiding this comment

kbenne Sep 27, 2024

Choose a reason for hiding this comment

kbenne Sep 27, 2024

Choose a reason for hiding this comment

kbenne Sep 27, 2024

Choose a reason for hiding this comment

kbenne Sep 27, 2024

Choose a reason for hiding this comment

kbenne Sep 27, 2024

Choose a reason for hiding this comment