Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[integrations-core][kubernetes_state] Metrics update #853

Merged
merged 21 commits into from
Nov 16, 2017

Conversation

CharlyF
Copy link
Contributor

@CharlyF CharlyF commented Nov 6, 2017

What does this PR do?

Adding new metrics from Kubernetes_state.

Motivation

We have received requests from the community.

Testing Guidelines

I will add unitests if the methods to collects data (e.g. terminated/waiting reason) are the way we chose to proceed.

Versioning

  • Bumped the version check in manifest.json
  • Updated CHANGELOG.md. Please use Unreleased as the date in the title
    for the new section.

@CharlyF CharlyF changed the title [WIP][integrations-core][kubernetes_state] Metrics update [integrations-core][kubernetes_state] Metrics update Nov 7, 2017
@CharlyF CharlyF requested a review from xvello November 7, 2017 18:04
Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good first iteration, please also update the metadata.csv

@@ -68,6 +72,8 @@ def __init__(self, name, init_config, agentConfig, instances=None):
'kube_pod_container_status_running': 'container.running',
'kube_pod_container_status_terminated': 'container.terminated',
'kube_pod_container_status_waiting': 'container.waiting',
'kube_pod_container_resource_requests_nvidia_gpu_devices': 'container.gpu.resource_request',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just rename them container.gpu.request / .limit like for cpu and memory

@@ -73,9 +82,9 @@ def assertMetricNotAllZeros(self, metric_name):

@mock.patch('checks.prometheus_check.PrometheusCheck.poll')
def test__update_kube_state_metrics(self, mock_poll):
f_name = os.path.join(os.path.dirname(__file__), 'ci', 'fixtures', 'prometheus', 'protobuf.bin')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the protobuf payload if unused, we can still retrieve it from the git history

tags = [self._format_tag(label.name, label.value) for label in metric.label]
self.gauge(metric_name, 0, tags)

def kube_cronjob_next_schedule_time(self, message, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great use case, we must test during RC phase that it recovers correctly (one could taint noschedule all the nodes, make sure the cronjob goes critical, then remove the taint and observe it recovering).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a message in the service check to compute the delay (now - scheduled)

@@ -231,6 +254,33 @@ def kube_pod_status_phase(self, message, **kwargs):
status = self.pod_phase_to_status.get(phase, self.UNKNOWN)
self.service_check(check_basename + phase, status, tags=tags)

def kube_cronjob_status_last_schedule_time(self, message, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant with kube_cronjob_next_schedule_time, and the .delay name is misleading (I'm expecting it to be positive it scheduling is lagging behind). I'd rather remove it unless we have a clear use case for this gauge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I like the recommendation of adding a message in the SC kube_cronjob_next_schedule_time.

@@ -326,6 +376,22 @@ def kube_node_spec_unschedulable(self, message, **kwargs):
else:
self.log.error("Metric type %s unsupported for metric %s" % (message.type, message.name))

def kube_pod_container_status_waiting_reason(self, message, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these two metrics, I'm not sure whether we should tag by podname, or just increment a node counter. @hkaj what's your opinion on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the pod name, node name and reason for this metric. One reason a pod can be waiting is ErrImagePull, as a user I'd like to know if I fat-fingered the image name in my pod spec and all the pods of a same family are failing because of that.

I would also like to know if a node is making all its containers wait because it can't pull images or its containers are stuck in creation.

Now how we expose this info is up to discussion. I think the metric name should be self.NAMESPACE + '.pod_container_status_waiting.' + reason. Reasons are finite and hardcoded here: https://github.com/kubernetes/kube-state-metrics/blob/b26cfb26fe5e858f8033a641a321abef25496c62/collectors/pod.go#L38 so it should be fine. If we add pods and nodes as tags we should be able to group/filter across the cluster quite easily. I'm open to other options though @CharlyF & @xvello

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hkaj I liked the option of not adding reason in the metric name so that one could monitor the metric reason and be alerted if there are more than X pods waiting because of {{reason}} on {{node.name}}.
Less cumbersome to set for a user imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point, then what do you think of removing .reason from the metric name? pod.containers_waiting with the reason tag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think reason should remain in the name to distinguish the metric from

'kube_pod_container_status_waiting': 'container.waiting',
(so that both can be used)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm shouldn't we just add the reason tag to the existing metric then? Does the new one add any value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually get the point - Indeed the .reason will be at one, on the tag representing the reason whereas the regular is at 1 if the container is waiting.
https://github.com/kubernetes/kube-state-metrics/blob/master/collectors/pod_test.go#L284-L292.

Now, I can't just remove kube_pod_container_status_waiting and I don't know if using kube_pod_container_status_waiting_reason to override or even replace container.waiting is a good idea because this metric became available with 1.1.0 so people monitoring the metric already might start missing it. Thoughts ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we keep the existing metric, but instead of adding the new one we match its pod with the existing metric and just add the reason as a tag to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best solution I came up with is adding a meta crawl first DataDog/dd-agent#3583

Copy link
Contributor Author

@CharlyF CharlyF Nov 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did I implement it this way:
The prometheus process_metric method goes through all the items in ignored, metrics_mapper and then getattr (i.e. methods in k8s-state check with the metric name as a method name).
While processing the waiting_reason metric, we don't have access to the message of waiting (i.e. pod name).
The approach is: If we see the new waiting_reason, populate the old metric name and add the reason as a tag.
We can't use a boolean because we can't trust the order in which the metrics are coming from prometheus and hence, might miss the _reason and compute the old one before.
The solution:
We can't use a boolean for all the new metrics because the waiting_reason might be exposed but terminated_reason might not and we don't want to miss terminated.
Secondly, from my testings, the decision of whether to compute the new/old metrics or not has to be before processing the metrics in the kube-state check because getattr is non blocking and metrics are missed.
Using a dict of meta metrics will make this re-usable should we want to process several times the prometheus output to associate more metadata with the underlying pods.
cc @xvello and @hkaj

@CharlyF CharlyF force-pushed the charly/kube_state_metrics_update branch from 97d75ed to da26a02 Compare November 8, 2017 15:24
check_basename = self.NAMESPACE + '.cronjob.on_schedule_check'
curr_time = int(time.time())
for metric in message.metric:
on_schedule = int(metric.gauge.value) - curr_time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure: there's no TZ issue with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not be a TZ issue as we rely on the epoch timestamp

@@ -326,6 +376,22 @@ def kube_node_spec_unschedulable(self, message, **kwargs):
else:
self.log.error("Metric type %s unsupported for metric %s" % (message.type, message.name))

def kube_pod_container_status_waiting_reason(self, message, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the pod name, node name and reason for this metric. One reason a pod can be waiting is ErrImagePull, as a user I'd like to know if I fat-fingered the image name in my pod spec and all the pods of a same family are failing because of that.

I would also like to know if a node is making all its containers wait because it can't pull images or its containers are stuck in creation.

Now how we expose this info is up to discussion. I think the metric name should be self.NAMESPACE + '.pod_container_status_waiting.' + reason. Reasons are finite and hardcoded here: https://github.com/kubernetes/kube-state-metrics/blob/b26cfb26fe5e858f8033a641a321abef25496c62/collectors/pod.go#L38 so it should be fine. If we add pods and nodes as tags we should be able to group/filter across the cluster quite easily. I'm open to other options though @CharlyF & @xvello

def kube_pod_container_status_waiting_reason(self, message, **kwargs):
""" Reason a container is in a waiting state """
# Capturing the pod name so the metric can be compared to kube_pod_container_status_waiting
metric_name = self.NAMESPACE + '.pod_container_status_waiting.reason'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this metric name, see my comment above

@CharlyF CharlyF force-pushed the charly/kube_state_metrics_update branch from e158e97 to c53fa39 Compare November 14, 2017 17:33
@CharlyF
Copy link
Contributor Author

CharlyF commented Nov 15, 2017

Sister PR: DataDog/dd-agent#3583

@CharlyF
Copy link
Contributor Author

CharlyF commented Nov 15, 2017

@hkaj I have rolled back my work on adding the reason tag to the legacy metric because of the following.
As it stands, the .waiting and the .waiting.reason (respectively .terminated and .terminated.reason) are not consistent.
This output will better explain the problem:
Run at T1

kube_pod_container_status_waiting{container="nginxs",namespace="default",pod="nginxs-d78b8df7d-bc6vp"} 1
[...]
kube_pod_container_status_waiting_reason{container="nginx",namespace="default",pod="nginx-58dd565ddf-lvpxt",reason="ErrImagePull"} 0
kube_pod_container_status_waiting_reason{container="nginxs",namespace="default",pod="nginxs-d78b8df7d-bc6vp",reason="ContainerCreating"} 0
kube_pod_container_status_waiting_reason{container="nginxs",namespace="default",pod="nginxs-d78b8df7d-bc6vp",reason="ErrImagePull"} 1

Run at T1 + 30seconds

kube_pod_container_status_waiting{container="nginxs",namespace="default",pod="nginxs-d78b8df7d-bc6vp"} 1
[...]
kube_pod_container_status_waiting_reason{container="nginx",namespace="default",pod="nginx-58dd565ddf-lvpxt",reason="ErrImagePull"} 0
kube_pod_container_status_waiting_reason{container="nginxs",namespace="default",pod="nginxs-d78b8df7d-bc6vp",reason="ContainerCreating"} 0
kube_pod_container_status_waiting_reason{container="nginxs",namespace="default",pod="nginxs-d78b8df7d-bc6vp",reason="ErrImagePull"} 0

As you can see, the .reason will be at 1 but only for a few seconds, but then will be back to 0. Whereas the pod is still waiting so the waiting metric is at 1.
This is probably due to the fact that the reason is at first ErrImagePull but then goes to ImagePullBackOff (which is not an available reason).

An image:
image 2017-11-15 at 6 34 32 pm

I am implementing these back into the check as counts so they can be used along with the regular .waiting metrics. The point being to see across the last hour, how many OOM have we had (with the termination.reason metric) etc:
image 2017-11-15 at 6 36 55 pm

cc: @xvello

@CharlyF CharlyF mentioned this pull request Nov 15, 2017
2 tasks
@@ -34,6 +36,9 @@ kubernetes_state.limitrange.memory.max_limit_request_ratio,gauge,,,,Maximum memo
kubernetes_state.node.cpu_capacity,gauge,,cpu,,The total CPU resources of the node,0,kubernetes,k8s_state.node.cpu_capacity
kubernetes_state.node.memory_capacity,gauge,,byte,,The total memory resources of the node,0,kubernetes,k8s_state.node.memory_capacity
kubernetes_state.node.pods_capacity,gauge,,,,The total pod resources of the node,0,kubernetes,k8s_state.node.pods_capacity
kubernetes_state.node.gpu.cards_allocatable,gauge,,,The Nvidia GPU resources of a node that are available for scheduling,0,kubernetes,k8s_state.node.gpu.cards_allocatable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the chameleon from our descriptions

@@ -273,6 +297,32 @@ def kube_pod_status_phase(self, message, **kwargs):
status = self.pod_phase_to_status.get(phase, self.UNKNOWN)
self.service_check(check_basename + phase, status, tags=tags)

def kube_pod_container_status_waiting_reason(self, message, **kwargs):
metric_name = self.NAMESPACE + '.container.status_report.count.waiting'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name LGTM, need to add it to the metadata though

def kube_pod_container_status_waiting_reason(self, message, **kwargs):
metric_name = self.NAMESPACE + '.container.status_report.count.waiting'
for metric in message.metric:
tags = [self._format_tag(label.name, label.value) for label in metric.label]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should at least remove the container tag from these two metrics, see https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation/pod-metrics.md

I'm still thinking even pod cardinality is overkill for that metric, I'd remove the pod label too, especially as we'll be submitting counts even on normal operations (terminated OK, waiting on containercreating), that'll spam a lot of contexts.
@hkaj can you chime in on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename container in kube_container_name to be consistent with the kubernetes integration
Let's not submit the pod tag, and add a //TODO in agent6, use the tagger to get pod creator tags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//todo should also be here so users know that we purposefully skipped it but it's coming

Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizations/nitpick, plus don't forget https://github.com/DataDog/integrations-core/pull/853/files#r151357487

Then we're good to go!

@@ -172,6 +173,8 @@ def check(self, instance):
self.job_succeeded_count = defaultdict(int)
self.job_failed_count = defaultdict(int)

self.whitelisted_reasons = {"waiting":["ErrImagePull"],"terminated":["OOMKilled","ContainerCannotRun","Error"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They'll be created at every run. you can put them as module constants at the top, like METRIC_TYPES. You should assign two separate arrays instead of a dict for performance.

@@ -91,6 +91,7 @@ def __init__(self, name, init_config, agentConfig, instances=None):
'kube_replicationcontroller_status_replicas': 'replicationcontroller.replicas',
'kube_statefulset_replicas': 'statefulset.replicas_desired',
'kube_statefulset_status_replicas': 'statefulset.replicas',

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove that empty line

tags.append(self._format_tag(label.name, label.value))
else:
reason = False
report = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean skip_metric = True ?

You can also start with skip_metric = True then put it to false when encountering a valid reason. That way you don't need the else

Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:chameleon:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants