-
Notifications
You must be signed in to change notification settings - Fork 56
Description
^^ Provide a general summary of the issue in the title above. ^^
Description
Issue #711 fixed a crash in the nrk8s-ksm pod around the use of unsupported Custom Resource State metrics, however it seems that when "info" type Custom Resource State metrics (defined in OpenMetrics spec v1.0.0) are used, the K8sReplicasetSample table no longer receives any metrics from the clusters.
When we reported this issue, support opened a feature request on our behalf to support the info type metrics; however being that this seems to be a clear bug (see expected behavior below), I thought to report this separately from the feature request.
Expected Behavior
Whether custom resource state metrics are supported or just ignored by nri-kubernetes, the K8sReplicasetSample table should continue to receive new metrics.
Troubleshooting or NR Diag results
N/A
Steps to Reproduce
You can start by running the below query and noting down the age of the most recent metrics before doing the rest of the steps:
SELECT latest(timestamp)
FROM K8sReplicasetSample
WHERE clusterName = 'your-testing-cluster-name'
SINCE 5 DAYS AGO
After noting that down, setup KSM to generate custom resource state metrics with type "info".
Most of the details can likely be found in the feature request, FRB-00009393, but if anything mentioned below was not copied from our support case to the feature request, please check support case #00283300.
You will need FluxCD installed on a cluster and reconciling some custom resources (GitRepository, Kustomization, HelmRepository, HelmChart, HelmRelease), and then you'll need to configure KSM from the nri-bundle chart to generate the Custom Resource State metrics as documented by one of the links that I shared in the support case.
When you see the nrk8s-ksm pod logs start outputting the errors mentioned in the support case (which happen to be the same errors that are documented as the expected errors in the PR that fixed #711), wait a few hours, or a day and then try to query the K8sReplicasetSample table again, and you should see that the latest time for the timestamp quickly ages up and never comes back down until you comment out the KSM configuration that generates the custom resource state metrics.
Your Environment
We're running nri-bundle helm chart version 6.0.10 in AKS with Kubernetes 1.33. We do not run our own KSM nor our own Prometheus; all of our monitoring is handled by New Relic, so we added the CustomResourceState configuration directly to the nri-bundle helm chart values.yaml.
kube-state-metrics:
enabled: true
rbac:
extraRules:
- apiGroups:
- source.toolkit.fluxcd.io
- kustomize.toolkit.fluxcd.io
- helm.toolkit.fluxcd.io
- notification.toolkit.fluxcd.io
- image.toolkit.fluxcd.io
resources:
- gitrepositories
- buckets
- helmrepositories
- helmcharts
- ocirepositories
- kustomizations
- helmreleases
- alerts
- providers
- receivers
- imagerepositories
- imagepolicies
- imageupdateautomations
verbs: [ "list", "watch" ]
customResourceState:
... # the custom resource state configuration is rather large
# so I've chosen not to paste it here since it's identical to that which is available in one of the links shared on the support case.
Additional context
For Maintainers Only or Hero Triaging this bug
Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):