Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes_state: plumb more container waiting reasons #1763

Merged
merged 1 commit into from
Jul 4, 2018

Conversation

stevvooe
Copy link
Contributor

We'd like to create monitors that fire when containers are stuck waiting
for various reasons. Two particular reasons, ImagePullBackoff and
CrashLoopBackoff, can be used to detect bad or broken deployments. These
have been plumbed as of kube-state-metric 1.3 but are not currently
whitelisted in the DataDog agent integration. The tests have also been
update with fixture data.

Signed-off-by: Stephen Day stephen.day@getcruise.com

We'd like to create monitors that fire when containers are stuck waiting
for various reasons. Two particular reasons, ImagePullBackoff and
CrashLoopBackoff, can be used to detect bad or broken deployments. These
have been plumbed as of kube-state-metric 1.3 but are not currently
whitelisted in the DataDog agent integration. The tests have also been
update with fixture data.

Signed-off-by: Stephen Day <stephen.day@getcruise.com>
Copy link
Member

@hkaj hkaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's great, thanks @stevvooe !

@hkaj hkaj added this to the 6.3.1 milestone Jun 20, 2018
@JulienBalestra JulienBalestra modified the milestones: 6.3.1, 6.4 Jun 25, 2018
@masci masci removed this from the 6.4 milestone Jun 25, 2018
@stevvooe
Copy link
Contributor Author

@hkaj @masci What's the timeline for this getting merged and released?

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hkaj hkaj merged commit 41adce4 into DataDog:master Jul 4, 2018
@hkaj
Copy link
Member

hkaj commented Jul 4, 2018

thanks @stevvooe ! This will go out with 6.4 (scheduled for end of July)

@stevvooe stevvooe deleted the sday-plumb-crashloopbackoff branch July 5, 2018 17:53
@deiwin
Copy link
Contributor

deiwin commented Jul 25, 2018

Could ContainerCreating also be included? Need this to monitor for known issues with https://github.com/aws/amazon-vpc-cni-k8s.

Why's there a whitelist in the first place? I see some discussion in #853, but don't understand the reason for skipping metrics with new reasons instead of simply passing the reason through.

@stevvooe
Copy link
Contributor Author

stevvooe commented Aug 1, 2018

@deiwin I think the whitelist is reduce the amount of metric volume that may be ignored or unused.

I think you could easily add it with a PR like this one. I only focused on the failure scenarios, as those are the most problematic. What would be the use case of monitoring ContainerCreating?

@deiwin
Copy link
Contributor

deiwin commented Aug 13, 2018

What would be the use case of monitoring ContainerCreating?

With the CNI linked to above, pods can get stuck in the ContainerCreating phase when the CNI is unable to reserve an IP for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants