-
Notifications
You must be signed in to change notification settings - Fork 56
Description
Description
We're trying to create dashboards and alerts that capture transient states of Kubernetes container. In particular, we're interested in tracking Error and OOMKilled termination states. AFAICT the New Relic integration is not always able to capture OOMKills correctly when the container restarts (comparing to kube_pod_container_status_last_terminated_reason), because at the moment it scrapes the Kubelet the container has already been restarted and even though at some point in between scrapes the status changed to Terminated and the reason to OOMKilled, as it is not the current state, it never gets reported.
My hope with the new containerOOMEventsDelta attribute was that the NRI integration would be able to capture those states, and return the number of times containers had been OOM kills in between scrapes. What I'm seeing is that the following occurs:
- Main container process is OOM Killed
- If the NRI integration manages to scrape the Kubelet when the container is in
Terminatedstate, it produced aK8sContainerSamplewithstate = 'Terminated'andreason = 'OOMKilled'. If the NRI integration does not catch the container inTerminatedstate, that information is lost. containerOOMEventsDeltaremains at0
I shall mention that containerOOMEventsDelta is working as expected when it's a child process the one that's killed, not the main container. This is a great addition, and something we'd been waiting for (as mentioned in https://www.netice9.com/blog/guide-to-oomkill-alerting-in-kubernetes-clusters OOM kills in child processes can sometimes go unnoticed). I just hoped that containerOOMEventsDelta would also include kills on the main container.
Expected Behavior
- Main container process is OOM Killed
- If the NRI integration manages to scrape the Kubelet when the container is in
Terminatedstate, it produced aK8sContainerSamplewithstate = 'Terminated'andreason = 'OOMKilled'. If the NRI integration does not catch the container inTerminatedstate, that information is lost. containerOOMEventsDeltais reported as1
Troubleshooting or NR Diag results
Provide any other relevant log data.
TIP: Scrub logs and diagnostic information for sensitive information
Steps to Reproduce
- Saturate memory on main container
- Wait for OOM kill
Your Environment
Kubernetes 1.24
nri-kubernetes v3.15.1
Additional context
Add any other context about the problem here. For example, relevant community posts or support tickets.
For Maintainers Only or Hero Triaging this bug
Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):