Skip to content

When Custom Resource State "info" type metrics are used, the K8sReplicasetSample table stops receiving metrics from Kubernetes clusters #1293

@tspearconquest

Description

@tspearconquest

^^ Provide a general summary of the issue in the title above. ^^

Description

Issue #711 fixed a crash in the nrk8s-ksm pod around the use of unsupported Custom Resource State metrics, however it seems that when "info" type Custom Resource State metrics (defined in OpenMetrics spec v1.0.0) are used, the K8sReplicasetSample table no longer receives any metrics from the clusters.

When we reported this issue, support opened a feature request on our behalf to support the info type metrics; however being that this seems to be a clear bug (see expected behavior below), I thought to report this separately from the feature request.

Expected Behavior

Whether custom resource state metrics are supported or just ignored by nri-kubernetes, the K8sReplicasetSample table should continue to receive new metrics.

Troubleshooting or NR Diag results

N/A

Steps to Reproduce

You can start by running the below query and noting down the age of the most recent metrics before doing the rest of the steps:

SELECT latest(timestamp)
FROM K8sReplicasetSample
WHERE clusterName = 'your-testing-cluster-name'
SINCE 5 DAYS AGO

After noting that down, setup KSM to generate custom resource state metrics with type "info".

Most of the details can likely be found in the feature request, FRB-00009393, but if anything mentioned below was not copied from our support case to the feature request, please check support case #00283300.

You will need FluxCD installed on a cluster and reconciling some custom resources (GitRepository, Kustomization, HelmRepository, HelmChart, HelmRelease), and then you'll need to configure KSM from the nri-bundle chart to generate the Custom Resource State metrics as documented by one of the links that I shared in the support case.

When you see the nrk8s-ksm pod logs start outputting the errors mentioned in the support case (which happen to be the same errors that are documented as the expected errors in the PR that fixed #711), wait a few hours, or a day and then try to query the K8sReplicasetSample table again, and you should see that the latest time for the timestamp quickly ages up and never comes back down until you comment out the KSM configuration that generates the custom resource state metrics.

Your Environment

We're running nri-bundle helm chart version 6.0.10 in AKS with Kubernetes 1.33. We do not run our own KSM nor our own Prometheus; all of our monitoring is handled by New Relic, so we added the CustomResourceState configuration directly to the nri-bundle helm chart values.yaml.

    kube-state-metrics:
      enabled: true
      rbac:
        extraRules:
          - apiGroups:
              - source.toolkit.fluxcd.io
              - kustomize.toolkit.fluxcd.io
              - helm.toolkit.fluxcd.io
              - notification.toolkit.fluxcd.io
              - image.toolkit.fluxcd.io
            resources:
              - gitrepositories
              - buckets
              - helmrepositories
              - helmcharts
              - ocirepositories
              - kustomizations
              - helmreleases
              - alerts
              - providers
              - receivers
              - imagerepositories
              - imagepolicies
              - imageupdateautomations
            verbs: [ "list", "watch" ]
      customResourceState:
        ... # the custom resource state configuration is rather large
            # so I've chosen not to paste it here since it's identical to that which is available in one of the links shared on the support case.

Additional context

For Maintainers Only or Hero Triaging this bug

Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugCategorizes issue or PR as related to a bug.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions