Add sister gauge metrics to kubernetes_state pod service checks #1578

mwhittington21 · 2018-05-18T09:33:40Z

What does this PR do?

Adds kube_pod_status_phase as gauge. It was previously a status check which is not helpful if you want to visualize the status of your pods over time.

Motivation

The aim is to fix #1433.

Review checklist

PR has a meaningful title or PR has the no-changelog label attached
Feature or bugfix has tests
Git history is clean
If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo

mwhittington21 · 2018-05-18T09:34:35Z

Regarding Docs I think it'll need to be added as a metric to https://docs.datadoghq.com/integrations/kubernetes/

Do I need to make an issue for that?

rishabh · 2018-05-18T14:11:04Z

Hi,

Thank you for your contribution!

Regarding docs, if you're adding/changing metric parameters (names, types, etc.) you'll need to edit the metadata.csv file and make the appropriate changes.

Regarding the PR, this is potentially breaking as some users rely on the service check. I recommend instead adding a gauge metric rather than replacing the service check.

mwhittington21 · 2018-05-21T08:07:19Z

Whoops, I didn't even stop to consider the service check users. Sorry about that.

I've added the new metric to the metadata.csv file and re-enabled the service checks.

rishabh · 2018-05-21T22:07:02Z

kubernetes_state/metadata.csv

@@ -52,6 +52,7 @@ kubernetes_state.hpa.target_cpu,gauge,,,,Target CPU percentage of pods managed b
 kubernetes_state.hpa.desired_replicas,gauge,,,,Desired number of replicas of pods managed by this autoscaler,0,kubernetes,k8s_state.hpa.desired_replicas
 kubernetes_state.pod.ready,gauge,,,,"In association with the `condition` tag, whether the pod is ready to serve requests, e.g. `condition:true` keeps the pods that are in a ready state",1,kubernetes,k8s_state.pod.ready
 kubernetes_state.pod.scheduled,gauge,,,,Reports the status of the scheduling process for the pod with its tags,0,kubernetes,k8s_state.pod.scheduled
+kubernetes_state.pod.status_phase,gauge,,,,Count of current pods with their phase as a tag,0,kubernetes,k8s_state.pod.status_phase


By convention, we make the value of the metric a numerical representation of the status of a pod, and then we would add a tag to the metric with the name of the pod. So maybe the value of status metric would be 2 if the tagged pod was healthy, 1 if it was in a warning state, and 0 if it was down.

This would allow us to see the status history of every pod when displayed on a timeseries graph.

Sure, sounds reasonable. I'll give that a try.

EDIT: After a quick think, I'll need more granularity than that. It does matter to my team whether a pod is in each of the unique phases. "Failed" is important for detecting anomalies, "unknown" is an important distinction from failed, "running" is obviously important, "successful" is an important distinction from "running", and "pending" is very useful for looking at cluster capacity.

Would you recommend that I break it down into a value for each of these and provide the mapping in the csv?

EDIT 2: Just going to give a quick insight into our use case for this metric in case it helps speed up the review process. We use it both to count the number of pods running in a namespace / the system as a whole, and also to filter and track those pods by the phase they are running in so that we can obtain accurate counts of pods in each phase (filtered by namespace sometimes too).

Ah, I was just thinking about this exact use case. You're right, you'll need to tag the status to each metric instance 👍

Also, I recommend that you send this metric per pod. That way, you can get even more granularity as you'll know exactly which pod has which status. (You'll also be able to use the Host Map and the like on the dashboards if you want to.)

Then, when you're graphing this metric, you can count (aggregate) the metric and filter by the status tag to get the total number of running/failed pods. As for your metric value when you're sending the metric, you can just set the value as 0 (just make a comment about how the tags represent the value and 0 is just a token value).

So in all, a rough idea of what it could look like:

metric_name = self.NAMESPACE + '.pod.status_phase' tags = # your pod, namespace, and phase tags + self.custom_tags # Metric representing the status of a pod with the tag containing the status (0 is a token value) self.gauge(metric_name, 0, tags)

Your .csv would then just need a slight edit :)

rishabh · 2018-05-21T22:37:33Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

        # Will submit a service check which status is given by its phase.
        # More details about the phase in the message of the check.
        check_basename = self.NAMESPACE + '.pod.phase'
        for metric in message.metric:
            self._condition_to_tag_check(metric, check_basename, self.pod_phase_to_status,
                                         tags=[self._label_to_tag("pod", metric.label),
                                               self._label_to_tag("namespace", metric.label)] + self.custom_tags)
+            # More verbose tagging for gauge vs service check
+            tags = [self._format_tag(label.name, label.value) for label in metric.label] + self.custom_tags


Which additional tags are you looking to get?

The only ones I actually need are "pod", "namespace" and "phase". Happy to trim it down to just those. I added it like this to mirror the functionality of adding it to the whitelist of tags (which I believe would just add all tags from the metric?).

It might be helpful to just trim it down to those + the custom tags as that'll make it easier to maintain 👍

mwhittington21 · 2018-05-22T07:25:30Z

Alright I've addressed the changes now.

Instead of going with a value of 0 for the metric, I've gone with 1 to both be consistent with kube-state-metrics output itself, as well as metrics within the "kubernetes_state" addon such as kubernetes_state.node.status. I used the metadata docstring from kubernetes_state.node.status as a base to write the new docstring for kubernetes_state.pod.status_phase.

Also stripped down the tags as you mentioned.

rishabh

Awesome, and yes, I think going with 1 is great if it keeps us consistent with the metrics output itself 👍

I think the following change could be an alternative way to get the tags (because we're then using existing code) :)

rishabh · 2018-05-23T04:27:37Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

+            tags = []
+            for label in metric.label:
+                if label.name in ["pod", "namespace", "phase"]:
+                    tags += [self._format_tag(label.name, label.value)]


Doesn't _label_to_tag do the same thing as the for loop? Can we do something like this?

tags = [self._label_to_tag("pod", metric.label), self._label_to_tag("namespace", metric.label), self._label_to_tag("phase", metric.label)] + self.custom_tags

And then we can just do tags=tags for both _condition_to_tag_check and gauge calls.

You're absolutely right, that's a much nicer way to do things. Updating now.

rishabh · 2018-05-23T20:05:30Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

@@ -329,6 +331,10 @@ def kube_pod_status_phase(self, message, **kwargs):
                                         tags=[self._label_to_tag("pod", metric.label),
                                               self._label_to_tag("namespace", metric.label)] + self.custom_tags)

+            tags = [self._label_to_tag("pod", metric.label), self._label_to_tag("namespace", metric.label),
+                    self._label_to_tag("phase", metric.label)] + self.custom_tags


I'm nit-picking but I think we can just combine all the tags into a single variable and provide that to both self._condition_to_tag_check(metrics, ...) and self.gauge(metric_name, ...) calls. This will also add the phase as a tag to the service checks, which may provide some extra information and filtering capabilities.

Thank you for your patience!

Seems like a reasonable idea, I wasn't sure the overall effect on storage that adding the phase to the service check would have so I just avoided touching that code. Will fix.

mwhittington21 · 2018-05-25T03:40:34Z

Should be good to go now.

I do see the failing test, not sure how to re-run in case it was flaky. I've spent the last few hours trying to run tests locally to no avail. Nearly there now, maybe, but with 0 documentation on how to run them for the new agent this is pretty hard.

mwhittington21 · 2018-05-25T06:57:52Z

Finally got some tests running locally and worked out why it was failing.

Should be fixed now.

mwhittington21 · 2018-05-28T23:53:42Z

@rishabh @masci just checking this hasn't fallen through some cracks.

masci · 2018-05-29T08:37:43Z

@mwhittington21 indeed didn't notice your last commit, thanks for the headsup!

mwhittington21 · 2018-05-31T01:38:52Z

I don't really want to be pushy here because I understand how everyone's time is precious, but if there's no more work to be done here my team would really appreciate it if this could get merged. We'd like the change to make it into the next agent release so we can unblock some internal teams waiting on us.

hkaj · 2018-06-01T12:52:15Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

+            tags = [self._label_to_tag("pod", metric.label), self._label_to_tag("namespace", metric.label),
+                    self._label_to_tag("phase", metric.label)] + self.custom_tags
+            self._condition_to_tag_check(metric, check_basename, self.pod_phase_to_status, tags=tags)
+            self.gauge(metric_name, 1, tags)


You're reporting the metric even if the value is 0 in kube-state-metrics. If the value is 0 we should report 0 as well, otherwise you get all pods in all phases all the time.

Check out this prometheus output for example:

kube_pod_status_phase{namespace="default",phase="Pending",pod="nginx-5876f47b65-992pl"} 0 kube_pod_status_phase{namespace="default",phase="Pending",pod="nginx-5876f47b65-kbxzm"} 0 kube_pod_status_phase{namespace="default",phase="Pending",pod="nginx-5876f47b65-wlrpr"} 0 kube_pod_status_phase{namespace="default",phase="Running",pod="nginx-5876f47b65-992pl"} 1 kube_pod_status_phase{namespace="default",phase="Running",pod="nginx-5876f47b65-kbxzm"} 1 kube_pod_status_phase{namespace="default",phase="Running",pod="nginx-5876f47b65-wlrpr"} 1

Alright I've addressed this by just transplanting the value from the metric. Verified it's working on a dashboard. Good catch.

mwhittington21 · 2018-06-05T01:30:23Z

Good catch on the metric values. That was a silly mistake that I've fixed now.

ofek

Thanks @mwhittington21, this LGTM! @hkaj Do you concur?

xvello · 2018-06-19T12:57:14Z

I think that these new gauges might duplicate some of the data we currently collect under kubernetes.pods.running via the kubelet integration. These kubelet gauges also offer more comprehensive tags, which allow you to filter them by namespaces and custom label_as_tags.

However, since KSM sees pending/failed pods that the kubelet might not see I think it's a good idea to add this information. To reduce the number of metrics collected and make the timeseries easier to navigate, I think counting pods per namespace and per phase would give enough visibility without overloading the kube_state_metrics service.

You ca see an example of this approach with kubernetes_state.nodes.by_condition gauges. We use an agent-side counter to submit one point per condition per cluster. We would need to set the metric type to gauge to have proper server-side aggregation though.

mwhittington21 · 2018-07-03T05:45:29Z

I think I've done this right although I was a little confused about some aspects like making sure the tests were correct.

hkaj · 2018-07-03T17:39:18Z

kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py

+                self._label_to_tag("namespace", metric.label),
+                self._label_to_tag("phase", metric.label)
+            ] + self.custom_tags
+            self.count(metric_name, metric.gauge.value, tags)


Could you use increment instead of count? Count is for when we want to submit the count once, what we want to do here I think is increment the counter for this metric name and set of tags each time we see it.

see https://docs.datadoghq.com/developers/metrics/counts/ for more details

@hkaj are you sure?

integrations-core/datadog_checks_base/datadog_checks/checks/base.py

Line 136 in 335436b

self._log_deprecation("increment")

seems to indicate that increment is deprecated.

The message is:

base.py 148 WARNING DEPRECATION NOTICE: `AgentCheck.increment`/`AgentCheck.decrement` are deprecated, please use `AgentCheck.gauge` or `AgentCheck.count` instead, with a different metric name

I guess I'm looking for some clarification from @xvello since I was copying his PR for how to go about doing this aggregation.

Oh, you're right, that function is not supported anymore in agent 6 :(

The issue is that calling count several time in the check doesn't work reliably for incrementation, you're supposed to call self.count once per run. I tested your change on my cluster and the value is wrong (it's not normalized by time).

The PR you linked actually suffers from the same problem, and is getting fixed by #1840

Copying this, i.e. incrementing a collections.Counter and reporting its value at the end of the loop with a simple gauge is the way to go, according to https://github.com/DataDog/datadog-agent/blob/7820efa6b45722ce141e6af46cb0d4ef966e4e7d/pkg/aggregator/sender.go#L214-L216

Ah excellent. Thanks for the clarification. I'll give this a go!

It was previously only a status check which is not helpful if you want to visualize the status of your pods over time.

Reduce the cardinality of the tags for the pod.status_phase metric to increase scalability. Rework the pod.status_phase metadata docstring to better represent what the metric is and how to use it. Refactored the code to make use of existing constructs. Adds "phase" tag to the service check portion of the pod_status_phase prometheus metric processing. Update tests to match.

Now submits an agent local count tagged with the five pod phases and with the different namespaces with pods. This should allow tracking aggregates of these metrics cluster-wide while retaining performance. Updated tests and description to match.

…unters

mwhittington21 · 2018-07-04T16:01:52Z

That seems much better, thanks for the guidance @hkaj

hkaj

This looks great! Plus the tests make more sense now :)
Thanks for your patience @mwhittington21

masci added community integration/kubernetes_state labels May 18, 2018

masci added the changelog/Changed label May 19, 2018

masci changed the title ~~[Changed] Change kube_pod_status_phase from status check to gauge in kubernetes_state integration~~ Change kube_pod_status_phase from status check to gauge May 19, 2018

mwhittington21 force-pushed the mwhittington/1433-add-pod-status-graphing branch 2 times, most recently from 6a6fa5e to ef84137 Compare May 21, 2018 08:06

masci assigned rishabh May 21, 2018

rishabh requested changes May 21, 2018

View reviewed changes

rishabh requested changes May 23, 2018

View reviewed changes

mwhittington21 force-pushed the mwhittington/1433-add-pod-status-graphing branch from 8e4705f to a9855b8 Compare May 23, 2018 06:47

rishabh changed the title ~~Change kube_pod_status_phase from status check to gauge~~ Add sister gauge metrics to kubernetes_state pod service checks May 23, 2018

rishabh reviewed May 23, 2018

View reviewed changes

mwhittington21 force-pushed the mwhittington/1433-add-pod-status-graphing branch from a9855b8 to 76a866a Compare May 24, 2018 01:37

mwhittington21 force-pushed the mwhittington/1433-add-pod-status-graphing branch from 76a866a to 97bd211 Compare May 25, 2018 07:05

masci mentioned this pull request May 31, 2018

[kubernetes] Misleading metric name #1216

Closed

rishabh previously approved these changes May 31, 2018

View reviewed changes

mfpierre added this to the 6.3 milestone Jun 1, 2018

hkaj requested changes Jun 1, 2018

View reviewed changes

mwhittington21 dismissed rishabh’s stale review via 14d3af9 June 4, 2018 05:43

ofek previously approved these changes Jun 5, 2018

View reviewed changes

masci modified the milestones: 6.3, 6.4 Jun 11, 2018

masci removed this from the 6.4 milestone Jun 25, 2018

mwhittington21 dismissed ofek’s stale review via f01e1b2 July 3, 2018 05:42

mwhittington21 force-pushed the mwhittington/1433-add-pod-status-graphing branch from 14d3af9 to f01e1b2 Compare July 3, 2018 05:42

mwhittington21 requested a review from a team as a code owner July 3, 2018 05:42

irabinovitch requested a review from xvello July 3, 2018 12:53

hkaj requested changes Jul 3, 2018

View reviewed changes

mwhittington21 added 5 commits July 5, 2018 01:47

Add kube_pod_status_phase as gauge

3984042

It was previously only a status check which is not helpful if you want to visualize the status of your pods over time.

Make pod.status_phase metric report value from prometheus

6d92848

Rework pod.status_phase check

ce2c3c0

Now submits an agent local count tagged with the five pod phases and with the different namespaces with pods. This should allow tracking aggregates of these metrics cluster-wide while retaining performance. Updated tests and description to match.

Update pod.status_phase metric submission to aggregate client side co…

22933ee

…unters

mwhittington21 force-pushed the mwhittington/1433-add-pod-status-graphing branch from 6355044 to 22933ee Compare July 4, 2018 15:51

hkaj approved these changes Jul 5, 2018

View reviewed changes

masci approved these changes Jul 5, 2018

View reviewed changes

hkaj merged commit ac29ac8 into DataDog:master Jul 5, 2018

mwhittington21 deleted the mwhittington/1433-add-pod-status-graphing branch July 5, 2018 23:55

Add sister gauge metrics to kubernetes_state pod service checks #1578

Add sister gauge metrics to kubernetes_state pod service checks #1578

Conversation

mwhittington21 commented May 18, 2018

What does this PR do?

Motivation

Review checklist

mwhittington21 commented May 18, 2018

rishabh commented May 18, 2018 • edited Loading

mwhittington21 commented May 21, 2018

Choose a reason for hiding this comment

mwhittington21 May 21, 2018 • edited Loading

Choose a reason for hiding this comment

rishabh May 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwhittington21 commented May 22, 2018

rishabh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishabh May 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwhittington21 commented May 25, 2018 • edited Loading

mwhittington21 commented May 25, 2018 • edited Loading

mwhittington21 commented May 28, 2018

masci commented May 29, 2018

mwhittington21 commented May 31, 2018

Choose a reason for hiding this comment

mwhittington21 Jun 5, 2018 • edited Loading

Choose a reason for hiding this comment

mwhittington21 commented Jun 5, 2018

ofek left a comment

Choose a reason for hiding this comment

xvello commented Jun 19, 2018

mwhittington21 commented Jul 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwhittington21 Jul 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwhittington21 commented Jul 4, 2018

hkaj left a comment

Choose a reason for hiding this comment

rishabh commented May 18, 2018 •

edited

Loading

mwhittington21 May 21, 2018 •

edited

Loading

rishabh May 22, 2018 •

edited

Loading

rishabh left a comment •

edited

Loading

rishabh May 23, 2018 •

edited

Loading

mwhittington21 commented May 25, 2018 •

edited

Loading

mwhittington21 commented May 25, 2018 •

edited

Loading

mwhittington21 Jun 5, 2018 •

edited

Loading

mwhittington21 Jul 4, 2018 •

edited

Loading