Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix KSM job metrics #4224

Merged
merged 18 commits into from
Aug 19, 2019
Merged

Fix KSM job metrics #4224

merged 18 commits into from
Aug 19, 2019

Conversation

Simwar
Copy link
Contributor

@Simwar Simwar commented Jul 29, 2019

What does this PR do?

This PR fixes the KSM job succeeded and failed metrics by keeping the last timestamp of each job to not count them multiple times.
It also adds a test to test values of these metrics after multiple runs.

Motivation

Customers reported these metrics were broken multiple times.

Review checklist (to be filled by reviewers)

  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have changelog/ and integration/ labels attached
  • Feature or bugfix must have tests
  • Git history must be clean
  • If PR adds a configuration option, it must be added to the configuration file.

@Simwar Simwar requested review from a team as code owners July 29, 2019 14:00
@Simwar Simwar changed the title fix job metrics and added an extra test fix KSM job metrics Jul 29, 2019
@codecov
Copy link

codecov bot commented Jul 29, 2019

Codecov Report

Merging #4224 into master will increase coverage by 6.46%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4224      +/-   ##
==========================================
+ Coverage   78.46%   84.93%   +6.46%     
==========================================
  Files         163        4     -159     
  Lines        8610      438    -8172     
  Branches     1052       80     -972     
==========================================
- Hits         6756      372    -6384     
+ Misses       1620       45    -1575     
+ Partials      234       21     -213

"""
Extract timestamp of job names if they match -(\\^.+\\-) - match everything until a `-`
"""
pattern = r"(^.+\-)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we find something more performant than a regex for this? Eg: splitting name over -, checking .isdigit() on the last element, and returning it if True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a unit test for this function?

tags.append(self._format_tag(label_name, trimmed_job, scraper_config))
else:
tags.append(self._format_tag(label_name, label_value, scraper_config))
self.job_succeeded_count[frozenset(tags)] += sample[self.SAMPLE_VALUE]
if job_ts != 0 and job_ts > self.succeeded_job_counts[frozenset(tags)].last_job_ts:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@hithwen hithwen requested a review from a team August 1, 2019 10:18
@hkaj
Copy link
Member

hkaj commented Aug 12, 2019

Let's test this in alerting1 as soon as it's ready for QA. They need the feature for a cronjob.

if job_ts != 0 and job_ts not in self.failed_job_counts[frozenset(tags)].last_jobs_ts:
print("Add value to fail")
self.failed_job_counts[frozenset(tags)].count += sample[self.SAMPLE_VALUE]
self.failed_job_counts[frozenset(tags)].last_jobs_ts.append(job_ts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's better because this gives an accurate count, but now this last_jobs_ts grows unbounded, increasing memory usage over time.

To highlight that you can add a print(len(self.failed_job_counts[frozenset(tags)].last_jobs_ts)) and run the tests with new job executions. I think the list will grow every time the cronjob triggers and you get a new timestamp.

tags.append(self._format_tag(label_name, trimmed_job, scraper_config))
else:
tags.append(self._format_tag(label_name, label_value, scraper_config))
self.job_failed_count[frozenset(tags)] += sample[self.SAMPLE_VALUE]
if job_ts != 0 and job_ts not in self.failed_job_counts[frozenset(tags)].last_jobs_ts:
print("Add value to fail")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine for testing, but let's remove it before merging.

tags.append(self._format_tag(label_name, trimmed_job, scraper_config))
else:
tags.append(self._format_tag(label_name, label_value, scraper_config))
self.job_succeeded_count[frozenset(tags)] += sample[self.SAMPLE_VALUE]
if job_ts != 0 and job_ts not in self.succeeded_job_counts[frozenset(tags)].last_jobs_ts:
print("Add value to success")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

if job_ts != 0 and job_ts not in self.succeeded_job_counts[frozenset(tags)].last_jobs_ts:
print("Add value to success")
self.succeeded_job_counts[frozenset(tags)].count += sample[self.SAMPLE_VALUE]
self.succeeded_job_counts[frozenset(tags)].last_jobs_ts.append(job_ts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue with mem usage

hkaj
hkaj previously approved these changes Aug 13, 2019
Copy link
Contributor

@therve therve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor questions and nits.

self.monotonic_count(scraper_config['namespace'] + '.job.failed', job.count, list(job_tags))
if job.current_run_max_ts > 0:
job.previous_run_max_ts = job.current_run_max_ts
job.current_run_max_ts = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe write a small method on JobCount to do those 3 lines.

return int(ts)
else:
msg = 'Cannot extract ts from job name {}'.format(name)
self.log.debug(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to do log.debug(msg, name) to not format when the logging level isn't debug.

tags.append(self._format_tag(label_name, trimmed_job, scraper_config))
else:
tags.append(self._format_tag(label_name, label_value, scraper_config))
self.job_failed_count[frozenset(tags)] += sample[self.SAMPLE_VALUE]
if job_ts != 0 and job_ts > self.failed_job_counts[frozenset(tags)].previous_run_max_ts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do job_count = self.failed_job_counts[frozenset(tags)] ? It would clarify the code a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to understand this comment

if job_ts != 0 and job_ts > self.failed_job_counts[frozenset(tags)].previous_run_max_ts:
self.failed_job_counts[frozenset(tags)].count += sample[self.SAMPLE_VALUE]
if job_ts > self.failed_job_counts[frozenset(tags)].current_run_max_ts:
self.failed_job_counts[frozenset(tags)].current_run_max_ts = job_ts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe job_count.current_run_max_ts = max(job_ts, job_count.current_run_max_ts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks. Modifying

)

# Re-run check to make sure we don't count the same jobs
for _ in range(1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should remove that range call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

)

check.poll = mock.MagicMock(return_value=MockResponse(payload, 'text/plain'))
for _ in range(1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@AlexandreYang AlexandreYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, just left few comments about syntax and structure. Feel free to agree/disagree :)

@therve therve merged commit 9ef31f7 into master Aug 19, 2019
@therve therve deleted the fix_job_metrics_ksm branch August 19, 2019 09:54
@therve therve restored the fix_job_metrics_ksm branch August 19, 2019 09:54
@therve therve deleted the fix_job_metrics_ksm branch August 19, 2019 09:54
@ofek ofek changed the title fix KSM job metrics Fix KSM job metrics Aug 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants