-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue684 service timeout and logging #685
Issue684 service timeout and logging #685
Conversation
* Within worker, add logs when a message is received and immediatly after a response has been sent. * Include HOSTNAME in the test metadata that is stored in Redis. For a K8s deployment, HOSTNAME will correspond to the name of the pod running the test. This will make it possible to retreive worker logs for a misbehaving test. * When a test is complete, the worker and the associated logs, may no longer exist, however the logs will still be available in the log file contained within the test payload that is pushed to long term storage. * These changes pertain to worker, however there is an existing log message within the web implementation that logs when a message is sent to the worker, but no response is received. Additionally, each message between web and worker is given a unique ID, therefore with all of this togethor there will be breadcrumbs if a message is dropped.
* The default message timeout is now 20 minutes, and the value is configurable using the BOPTEST_MESSAGE_TIMEOUT environment variable.
The message subscription timeout is now configured as BOPTEST_MESSAGE_TIMEOUT + 60 seconds
@@ -31,6 +31,7 @@ services: | |||
- BOPTEST_REDIS_HOST | |||
- BOPTEST_REGION | |||
- BOPTEST_TIMEOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOPTEST_TIMEOUT is really "idle" timeout. In other words how long there can be no interactions with a running test before it will automatically stop. It might be wise to rename and document these config variables a little better.
| `tests:${testid}` | `status` | `Queued \| Running` | | ||
| `tests:${testid}` | `timestamp` | Epoch seconds when test is queued. | | ||
| `tests:${testid}` | `user` | The OAuth user "sub", or undefined for anonymous tests. | | ||
| `tests:${testid}` | `host` | The value of HOSTNAME for the worker running the test. On K8s this will be the pod name. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a number of diffs here, due to formatting adjustment, but the addition of host
is the actual change. For an instance deployed on Kubernetes, host will be the worker pod name. This is really useful for connecting to a pod that may be running a test and doing diagnostics if there is an issue.
@@ -56,6 +56,9 @@ def test_ibpsa_boptest_testcase(): | |||
response = requests.get(f"{host}/name/{testid}") | |||
check.equal(response.status_code, 200) | |||
|
|||
response = requests.post(f"{host}/advance/{testid}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of API tests in this file, but there was not actaully a test for advance, which is a useful assertion when working on timeout issues.
reject(new Error(`Timeout while sending command '${method}' to testid '${workerID}'`)) | ||
reject( | ||
new Error( | ||
`Timeout for request: '${requestID}', with method:'${method}', sent to testid: '${workerID}'`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requestID can be (should be) matched to a message received by a worker running a test. Using this log (which will show up in server logs) we can track down, was a message recieved by a worker, and was a response sent. If not, we get this log and can figure out where in the chain it was dropped.
@@ -91,7 +99,7 @@ def run(self): | |||
def check_idle_time(self): | |||
idle_time = datetime.now() - self.last_message_time | |||
if idle_time.total_seconds() > self.timeout: | |||
print("Testid '%s' is terminating due to inactivity." % self.testid) | |||
self.logger.info("Testid '%s' is terminating due to inactivity." % self.testid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the logger for this stuff will ensure that we get these messages in the log file that clients are able to download.
self.logger = logging.getLogger('worker') | ||
self.formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') | ||
self.logger.setLevel(os.environ.get("BOPTEST_LOGLEVEL", "INFO")) | ||
fmt = '%(asctime)s UTC\t%(name)-20s%(levelname)s\t%(message)s' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is designed to match the log formatting used in testcase.py. It would be wise to consoldiate the logger implementation at some future point.
@@ -31,6 +31,7 @@ services: | |||
- BOPTEST_REDIS_HOST | |||
- BOPTEST_REGION | |||
- BOPTEST_TIMEOUT | |||
- BOPTEST_MESSAGE_TIMEOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand. BOPTEST_MESSAGE_TIMEOUT is how long the web server waits for a response from the worker running a test.
This PR addresses #684