-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix KSM job metrics #4224
Fix KSM job metrics #4224
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4224 +/- ##
==========================================
+ Coverage 78.46% 84.93% +6.46%
==========================================
Files 163 4 -159
Lines 8610 438 -8172
Branches 1052 80 -972
==========================================
- Hits 6756 372 -6384
+ Misses 1620 45 -1575
+ Partials 234 21 -213 |
""" | ||
Extract timestamp of job names if they match -(\\^.+\\-) - match everything until a `-` | ||
""" | ||
pattern = r"(^.+\-)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we find something more performant than a regex for this? Eg: splitting name over -
, checking .isdigit()
on the last element, and returning it if True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a unit test for this function?
kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py
Outdated
Show resolved
Hide resolved
tags.append(self._format_tag(label_name, trimmed_job, scraper_config)) | ||
else: | ||
tags.append(self._format_tag(label_name, label_value, scraper_config)) | ||
self.job_succeeded_count[frozenset(tags)] += sample[self.SAMPLE_VALUE] | ||
if job_ts != 0 and job_ts > self.succeeded_job_counts[frozenset(tags)].last_job_ts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Let's test this in alerting1 as soon as it's ready for QA. They need the feature for a cronjob. |
if job_ts != 0 and job_ts not in self.failed_job_counts[frozenset(tags)].last_jobs_ts: | ||
print("Add value to fail") | ||
self.failed_job_counts[frozenset(tags)].count += sample[self.SAMPLE_VALUE] | ||
self.failed_job_counts[frozenset(tags)].last_jobs_ts.append(job_ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's better because this gives an accurate count, but now this last_jobs_ts
grows unbounded, increasing memory usage over time.
To highlight that you can add a print(len(self.failed_job_counts[frozenset(tags)].last_jobs_ts))
and run the tests with new job executions. I think the list will grow every time the cronjob triggers and you get a new timestamp.
tags.append(self._format_tag(label_name, trimmed_job, scraper_config)) | ||
else: | ||
tags.append(self._format_tag(label_name, label_value, scraper_config)) | ||
self.job_failed_count[frozenset(tags)] += sample[self.SAMPLE_VALUE] | ||
if job_ts != 0 and job_ts not in self.failed_job_counts[frozenset(tags)].last_jobs_ts: | ||
print("Add value to fail") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine for testing, but let's remove it before merging.
tags.append(self._format_tag(label_name, trimmed_job, scraper_config)) | ||
else: | ||
tags.append(self._format_tag(label_name, label_value, scraper_config)) | ||
self.job_succeeded_count[frozenset(tags)] += sample[self.SAMPLE_VALUE] | ||
if job_ts != 0 and job_ts not in self.succeeded_job_counts[frozenset(tags)].last_jobs_ts: | ||
print("Add value to success") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
if job_ts != 0 and job_ts not in self.succeeded_job_counts[frozenset(tags)].last_jobs_ts: | ||
print("Add value to success") | ||
self.succeeded_job_counts[frozenset(tags)].count += sample[self.SAMPLE_VALUE] | ||
self.succeeded_job_counts[frozenset(tags)].last_jobs_ts.append(job_ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same issue with mem usage
…ing out of bounds array
…the current run and over all the runs
…the current run and over all the runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor questions and nits.
self.monotonic_count(scraper_config['namespace'] + '.job.failed', job.count, list(job_tags)) | ||
if job.current_run_max_ts > 0: | ||
job.previous_run_max_ts = job.current_run_max_ts | ||
job.current_run_max_ts = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe write a small method on JobCount
to do those 3 lines.
return int(ts) | ||
else: | ||
msg = 'Cannot extract ts from job name {}'.format(name) | ||
self.log.debug(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to do log.debug(msg, name)
to not format when the logging level isn't debug.
tags.append(self._format_tag(label_name, trimmed_job, scraper_config)) | ||
else: | ||
tags.append(self._format_tag(label_name, label_value, scraper_config)) | ||
self.job_failed_count[frozenset(tags)] += sample[self.SAMPLE_VALUE] | ||
if job_ts != 0 and job_ts > self.failed_job_counts[frozenset(tags)].previous_run_max_ts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do job_count = self.failed_job_counts[frozenset(tags)]
? It would clarify the code a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure to understand this comment
if job_ts != 0 and job_ts > self.failed_job_counts[frozenset(tags)].previous_run_max_ts: | ||
self.failed_job_counts[frozenset(tags)].count += sample[self.SAMPLE_VALUE] | ||
if job_ts > self.failed_job_counts[frozenset(tags)].current_run_max_ts: | ||
self.failed_job_counts[frozenset(tags)].current_run_max_ts = job_ts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe job_count.current_run_max_ts = max(job_ts, job_count.current_run_max_ts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thanks. Modifying
) | ||
|
||
# Re-run check to make sure we don't count the same jobs | ||
for _ in range(1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should remove that range call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
) | ||
|
||
check.poll = mock.MagicMock(return_value=MockResponse(payload, 'text/plain')) | ||
for _ in range(1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, just left few comments about syntax and structure. Feel free to agree/disagree :)
What does this PR do?
This PR fixes the KSM job succeeded and failed metrics by keeping the last timestamp of each job to not count them multiple times.
It also adds a test to test values of these metrics after multiple runs.
Motivation
Customers reported these metrics were broken multiple times.
Review checklist (to be filled by reviewers)
changelog/
andintegration/
labels attached