Fix KSM job metrics #4224

Simwar · 2019-07-29T14:00:03Z

What does this PR do?

This PR fixes the KSM job succeeded and failed metrics by keeping the last timestamp of each job to not count them multiple times.
It also adds a test to test values of these metrics after multiple runs.

Motivation

Customers reported these metrics were broken multiple times.

Review checklist (to be filled by reviewers)

PR title must be written as a CHANGELOG entry (see why)
Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
PR must have changelog/ and integration/ labels attached
Feature or bugfix must have tests
Git history must be clean
If PR adds a configuration option, it must be added to the configuration file.

codecov · 2019-07-29T16:42:47Z

Codecov Report

Merging #4224 into master will increase coverage by 6.46%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4224      +/-   ##
==========================================
+ Coverage   78.46%   84.93%   +6.46%     
==========================================
  Files         163        4     -159     
  Lines        8610      438    -8172     
  Branches     1052       80     -972     
==========================================
- Hits         6756      372    -6384     
+ Misses       1620       45    -1575     
+ Partials      234       21     -213

hkaj · 2019-07-31T14:30:07Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

+        """
+        Extract timestamp of job names if they match -(\\^.+\\-) - match everything until a `-`
+        """
+        pattern = r"(^.+\-)"


Could we find something more performant than a regex for this? Eg: splitting name over -, checking .isdigit() on the last element, and returning it if True.

Could you also add a unit test for this function?

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

hkaj · 2019-07-31T16:27:00Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

                    tags.append(self._format_tag(label_name, trimmed_job, scraper_config))
                else:
                    tags.append(self._format_tag(label_name, label_value, scraper_config))
-            self.job_succeeded_count[frozenset(tags)] += sample[self.SAMPLE_VALUE]
+            if job_ts != 0 and job_ts > self.succeeded_job_counts[frozenset(tags)].last_job_ts:


hkaj · 2019-08-12T12:13:32Z

Let's test this in alerting1 as soon as it's ready for QA. They need the feature for a cronjob.

hkaj · 2019-08-12T16:50:24Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

+            if job_ts != 0 and job_ts not in self.failed_job_counts[frozenset(tags)].last_jobs_ts:
+                print("Add value to fail")
+                self.failed_job_counts[frozenset(tags)].count += sample[self.SAMPLE_VALUE]
+                self.failed_job_counts[frozenset(tags)].last_jobs_ts.append(job_ts)


That's better because this gives an accurate count, but now this last_jobs_ts grows unbounded, increasing memory usage over time.

To highlight that you can add a print(len(self.failed_job_counts[frozenset(tags)].last_jobs_ts)) and run the tests with new job executions. I think the list will grow every time the cronjob triggers and you get a new timestamp.

hkaj · 2019-08-12T16:50:48Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

                    tags.append(self._format_tag(label_name, trimmed_job, scraper_config))
                else:
                    tags.append(self._format_tag(label_name, label_value, scraper_config))
-            self.job_failed_count[frozenset(tags)] += sample[self.SAMPLE_VALUE]
+            if job_ts != 0 and job_ts not in self.failed_job_counts[frozenset(tags)].last_jobs_ts:
+                print("Add value to fail")


That's fine for testing, but let's remove it before merging.

hkaj · 2019-08-12T16:51:08Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

                    tags.append(self._format_tag(label_name, trimmed_job, scraper_config))
                else:
                    tags.append(self._format_tag(label_name, label_value, scraper_config))
-            self.job_succeeded_count[frozenset(tags)] += sample[self.SAMPLE_VALUE]
+            if job_ts != 0 and job_ts not in self.succeeded_job_counts[frozenset(tags)].last_jobs_ts:
+                print("Add value to success")


hkaj · 2019-08-12T16:51:17Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

+            if job_ts != 0 and job_ts not in self.succeeded_job_counts[frozenset(tags)].last_jobs_ts:
+                print("Add value to success")
+                self.succeeded_job_counts[frozenset(tags)].count += sample[self.SAMPLE_VALUE]
+                self.succeeded_job_counts[frozenset(tags)].last_jobs_ts.append(job_ts)


same issue with mem usage

kubernetes_state/tests/test_kubernetes_state.py

…ing out of bounds array

…the current run and over all the runs

therve

Some minor questions and nits.

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

therve · 2019-08-14T09:36:58Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

+            self.monotonic_count(scraper_config['namespace'] + '.job.failed', job.count, list(job_tags))
+            if job.current_run_max_ts > 0:
+                job.previous_run_max_ts = job.current_run_max_ts
+                job.current_run_max_ts = 0


Maybe write a small method on JobCount to do those 3 lines.

therve · 2019-08-14T09:37:40Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

+            return int(ts)
+        else:
+            msg = 'Cannot extract ts from job name {}'.format(name)
+            self.log.debug(msg)


It's better to do log.debug(msg, name) to not format when the logging level isn't debug.

therve · 2019-08-14T09:39:22Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

                    tags.append(self._format_tag(label_name, trimmed_job, scraper_config))
                else:
                    tags.append(self._format_tag(label_name, label_value, scraper_config))
-            self.job_failed_count[frozenset(tags)] += sample[self.SAMPLE_VALUE]
+            if job_ts != 0 and job_ts > self.failed_job_counts[frozenset(tags)].previous_run_max_ts:


Can you do job_count = self.failed_job_counts[frozenset(tags)] ? It would clarify the code a bit.

Not sure to understand this comment

therve · 2019-08-14T09:39:57Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

+            if job_ts != 0 and job_ts > self.failed_job_counts[frozenset(tags)].previous_run_max_ts:
+                self.failed_job_counts[frozenset(tags)].count += sample[self.SAMPLE_VALUE]
+                if job_ts > self.failed_job_counts[frozenset(tags)].current_run_max_ts:
+                    self.failed_job_counts[frozenset(tags)].current_run_max_ts = job_ts


Maybe job_count.current_run_max_ts = max(job_ts, job_count.current_run_max_ts)

Indeed, thanks. Modifying

therve · 2019-08-14T09:40:22Z

kubernetes_state/tests/test_kubernetes_state.py

+    )
+
+    # Re-run check to make sure we don't count the same jobs
+    for _ in range(1):


You should remove that range call.

therve · 2019-08-14T09:40:29Z

kubernetes_state/tests/test_kubernetes_state.py

+    )
+
+    check.poll = mock.MagicMock(return_value=MockResponse(payload, 'text/plain'))
+    for _ in range(1):


AlexandreYang

Hey, just left few comments about syntax and structure. Feel free to agree/disagree :)

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

…Count class

fix job metrics and added some extra test

b041c6e

Simwar requested review from a team as code owners July 29, 2019 14:00

Simwar changed the title ~~fix job metrics and added an extra test~~ fix KSM job metrics Jul 29, 2019

Simwar added 4 commits July 29, 2019 16:28

fix flake8 test

364f806

fix flake8 test on test

3876675

fix black test

8cc2bc9

fix flak8 test again

057546d

hithwen added the integration/kubernetes_state label Jul 30, 2019

hkaj requested changes Jul 31, 2019

View reviewed changes

hithwen requested a review from a team August 1, 2019 10:18

Simwar added 2 commits August 7, 2019 17:04

changed extract timestamp + added test

d2d1640

changed extract timestamp to add log

179edeb

fixed issue when timestamp not in order. Added tests for the logic.

ef864f3

hkaj requested changes Aug 12, 2019

View reviewed changes

Simwar added 9 commits August 13, 2019 12:34

changed logic to have one array that gets reseted instead of one grow…

2cf0176

…ing out of bounds array

removed the array as it was unecessary and work with timestamps over …

463dbed

…the current run and over all the runs

removed the array as it was unecessary and work with timestamps over …

0d4c5b9

…the current run and over all the runs

fixed flake8

9372c09

fixed flake8

847d0a2

fixed brake

2272040

fix one last time black

dfa1ab9

removed two unused dict

8339f95

changed var names

d3e1cc8

hkaj previously approved these changes Aug 13, 2019

View reviewed changes

therve requested changes Aug 14, 2019

View reviewed changes

AlexandreYang requested changes Aug 14, 2019

View reviewed changes

modified following the good comments and added few methods to the Job…

9d7facd

…Count class

Simwar dismissed hkaj’s stale review via 9d7facd August 14, 2019 13:19

therve approved these changes Aug 19, 2019

View reviewed changes

AlexandreYang approved these changes Aug 19, 2019

View reviewed changes

therve merged commit 9ef31f7 into master Aug 19, 2019

therve deleted the fix_job_metrics_ksm branch August 19, 2019 09:54

therve restored the fix_job_metrics_ksm branch August 19, 2019 09:54

therve deleted the fix_job_metrics_ksm branch August 19, 2019 09:54

ofek changed the title ~~fix KSM job metrics~~ Fix KSM job metrics Aug 24, 2019

ofek added the changelog/Fixed label Aug 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix KSM job metrics #4224

Fix KSM job metrics #4224

Simwar commented Jul 29, 2019

codecov bot commented Jul 29, 2019 •

edited

Loading

hkaj Jul 31, 2019

hkaj Jul 31, 2019

hkaj Jul 31, 2019

hkaj commented Aug 12, 2019

hkaj Aug 12, 2019

hkaj Aug 12, 2019

hkaj Aug 12, 2019

hkaj Aug 12, 2019

therve left a comment

therve Aug 14, 2019

therve Aug 14, 2019

therve Aug 14, 2019

Simwar Aug 14, 2019

therve Aug 14, 2019

Simwar Aug 14, 2019

therve Aug 14, 2019

Simwar Aug 14, 2019

therve Aug 14, 2019

Simwar Aug 14, 2019

AlexandreYang left a comment

Fix KSM job metrics #4224

Fix KSM job metrics #4224

Conversation

Simwar commented Jul 29, 2019

What does this PR do?

Motivation

Review checklist (to be filled by reviewers)

codecov bot commented Jul 29, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hkaj commented Aug 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

therve left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexandreYang left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 29, 2019 •

edited

Loading