Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue684 service timeout and logging #685

Conversation

kbenne
Copy link
Contributor

@kbenne kbenne commented Sep 27, 2024

This PR addresses #684

* Within worker, add logs when a message is received and immediatly
  after a response has been sent.

* Include HOSTNAME in the test metadata that is stored in Redis. For a
  K8s deployment, HOSTNAME will correspond to the name of the pod
  running the test. This will make it possible to retreive worker logs
  for a misbehaving test.

* When a test is complete, the worker and the associated logs, may no
  longer exist, however the logs will still be available in the log file
  contained within the test payload that is pushed to long term storage.

* These changes pertain to worker, however there is an existing log
  message within the web implementation that logs when a message is sent to
  the worker, but no response is received. Additionally, each message
  between web and worker is given a unique ID, therefore with all of
  this togethor there will be breadcrumbs if a message is dropped.
* The default message timeout is now 20 minutes, and the value is configurable using the
BOPTEST_MESSAGE_TIMEOUT environment variable.
The message subscription timeout is now configured as
BOPTEST_MESSAGE_TIMEOUT + 60 seconds
@@ -31,6 +31,7 @@ services:
- BOPTEST_REDIS_HOST
- BOPTEST_REGION
- BOPTEST_TIMEOUT
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOPTEST_TIMEOUT is really "idle" timeout. In other words how long there can be no interactions with a running test before it will automatically stop. It might be wise to rename and document these config variables a little better.

| `tests:${testid}` | `status` | `Queued \| Running` |
| `tests:${testid}` | `timestamp` | Epoch seconds when test is queued. |
| `tests:${testid}` | `user` | The OAuth user "sub", or undefined for anonymous tests. |
| `tests:${testid}` | `host` | The value of HOSTNAME for the worker running the test. On K8s this will be the pod name. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of diffs here, due to formatting adjustment, but the addition of host is the actual change. For an instance deployed on Kubernetes, host will be the worker pod name. This is really useful for connecting to a pod that may be running a test and doing diagnostics if there is an issue.

@@ -56,6 +56,9 @@ def test_ibpsa_boptest_testcase():
response = requests.get(f"{host}/name/{testid}")
check.equal(response.status_code, 200)

response = requests.post(f"{host}/advance/{testid}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of API tests in this file, but there was not actaully a test for advance, which is a useful assertion when working on timeout issues.

reject(new Error(`Timeout while sending command '${method}' to testid '${workerID}'`))
reject(
new Error(
`Timeout for request: '${requestID}', with method:'${method}', sent to testid: '${workerID}'`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requestID can be (should be) matched to a message received by a worker running a test. Using this log (which will show up in server logs) we can track down, was a message recieved by a worker, and was a response sent. If not, we get this log and can figure out where in the chain it was dropped.

@@ -91,7 +99,7 @@ def run(self):
def check_idle_time(self):
idle_time = datetime.now() - self.last_message_time
if idle_time.total_seconds() > self.timeout:
print("Testid '%s' is terminating due to inactivity." % self.testid)
self.logger.info("Testid '%s' is terminating due to inactivity." % self.testid)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the logger for this stuff will ensure that we get these messages in the log file that clients are able to download.

self.logger = logging.getLogger('worker')
self.formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
self.logger.setLevel(os.environ.get("BOPTEST_LOGLEVEL", "INFO"))
fmt = '%(asctime)s UTC\t%(name)-20s%(levelname)s\t%(message)s'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is designed to match the log formatting used in testcase.py. It would be wise to consoldiate the logger implementation at some future point.

@@ -31,6 +31,7 @@ services:
- BOPTEST_REDIS_HOST
- BOPTEST_REGION
- BOPTEST_TIMEOUT
- BOPTEST_MESSAGE_TIMEOUT
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand. BOPTEST_MESSAGE_TIMEOUT is how long the web server waits for a response from the worker running a test.

@dhblum dhblum merged commit 891f99b into issue617_refactorService_with_history2 Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants