Problem after node reboot: cluster network not healthy anymore #13141

maltelehmann · 2025-11-05T10:23:26Z

maltelehmann
Nov 5, 2025

Hi!

at first, thank you very much for your awesome work!

I am encountering a problem that cannot be clearly pinpointed to an environment or k3s issue. Maybe you can help to investigate this further?

The problem got my attention since a Kafka pod running on a VM called arch2 was not able to talk to the other Kafka brokers running on VMs called arch1 and arch3.

I made sure that it the network policies die not cause the problems and checked the Kubernetes network by starting a pod on each node in the cluster and letting each pod ping each other. The result was hat pods on arch2 could not talk to arch1 and arch3 and vice versa.

Next I found that the connectivity issue started after arch2 had been rebooted on Nov 03 13:30 UTC. To make the system running again, I restarted the k3s-agents on arch1/2/3 and the network check was successful again.

To understand the root cause of the problem, I checked the k3s logs and found the folloing:

arch2: I could only access logs after the reboot and the logs did not show any anomalies except for the pods running on the node crash looping
arch1/3: the nodes show the following error every minute, until the restart of the k3s-agent, afterwards everything is fine -> since the restart fixes the problem, I believe there is no underlying persisting issue that causes these messages. I found that these messages start after upgrading from v1.31.7+k3s1 to v1.32.8+k3s1

Nov 03 13:30:25 arch1 k3s[1133438]: W1103 13:30:25.624264 1133438 reflector.go:569] k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Node: nodes is forbidden: User "system:node:arch1" cannot list resource "nodes" in API group "" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object
Nov 03 13:30:25 arch1 k3s[1133438]: E1103 13:30:25.624299 1133438 reflector.go:166] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User \"system:node:arch1\" cannot list resource \"nodes\" in API group \"\" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object"
Nov 03 13:31:03 arch1 k3s[1133438]: W1103 13:31:03.233593 1133438 reflector.go:569] k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Node: nodes is forbidden: User "system:node:arch1" cannot list resource "nodes" in API group "" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object
Nov 03 13:31:03 arch1 k3s[1133438]: E1103 13:31:03.233650 1133438 reflector.go:166] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User \"system:node:arch1\" cannot list resource \"nodes\" in API group \"\" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object"
Nov 03 13:31:52 arch1 k3s[1133438]: W1103 13:31:52.096581 1133438 reflector.go:569] k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Node: nodes is forbidden: User "system:node:arch1" cannot list resource "nodes" in API group "" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object
Nov 03 13:31:52 arch1 k3s[1133438]: E1103 13:31:52.096617 1133438 reflector.go:166] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User \"system:node:arch1\" cannot list resource \"nodes\" in API group \"\" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object"
Nov 03 13:32:30 arch1 k3s[1133438]: W1103 13:32:30.556052 1133438 reflector.go:569] k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Node: nodes is forbidden: User "system:node:arch1" cannot list resource "nodes" in API group "" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object
Nov 03 13:32:30 arch1 k3s[1133438]: E1103 13:32:30.556089 1133438 reflector.go:166] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User \"system:node:arch1\" cannot list resource \"nodes\" in API group \"\" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object"
Nov 03 13:33:07 arch1 k3s[1133438]: W1103 13:33:07.774750 1133438 reflector.go:569] k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Node: nodes is forbidden: User "system:node:arch1" cannot list resource "nodes" in API group "" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object
Nov 03 13:33:07 arch1 k3s[1133438]: E1103 13:33:07.774797 1133438 reflector.go:166] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User \"system:node:arch1\" cannot list resource \"nodes\" in API group \"\" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object"
Nov 03 13:33:39 arch1 k3s[1133438]: W1103 13:33:39.786934 1133438 reflector.go:569] k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Node: nodes is forbidden: User "system:node:arch1" cannot list resource "nodes" in API group "" at the cluster scope: node 'arch1' cannot read all nodes, only its own Node object

The k3s Config I use is:

$ cat /etc/rancher/k3s/config.yaml 
## COMMON CONFIGURATION

token: 'dummy'
data-dir: '/srv/k3s/data1'
node-ip: '10.111.111.64'

node-name: 'mngr1'


protect-kernel-defaults: true

kubelet-arg: ["root-dir=/srv/k3s/data1/kubelet", "config=/etc/rancher/k3s/kubelet-config.yaml"]
kube-proxy-arg: []


node-label:
- node.redispatch/name=server0


prefer-bundled-bin: true
flannel-iface: 'eth0'


## SERVER SPECIFIC CONFIGURATION
kube-apiserver-arg: ["default-not-ready-toleration-seconds=120", "default-unreachable-toleration-seconds=120", "audit-log-path=/var/log/kubernetes/audit.log", "audit-policy-file=/etc/rancher/k3s/policy.yaml"]
kube-cloud-controller-manager-arg: ["webhook-secure-port=0"]
kube-controller-manager-arg: []
kube-scheduler-arg: []

disable-kube-proxy: false

cluster-cidr: '10.42.0.0/16'
service-cidr: '10.43.0.0/16'
cluster-dns: '10.43.0.10'

tls-san: ["mngr1", "mngr2", "mngr3"]
disable: ["coredns", "traefik"]

etcd-expose-metrics: true

cluster-init: true
flannel-backend: 'wireguard-native'
disable-network-policy: false
secrets-encryption: true

The OS is:

$ cat /etc/redhat-release 
AlmaLinux release 8.10 (Cerulean Leopard)
$ uname -a
Linux arch3 4.18.0-553.77.1.el8_10.x86_64 #1 SMP Tue Sep 30 05:56:43 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux

Do you have any recommendations how to find out what is the root cause of the problem?

Thank you for your help!

brandond · 2025-11-05T18:17:24Z

brandond
Nov 5, 2025
Collaborator

Did you upgrade all your nodes? The Watch error you are seeing is related to upstream changes to the node authorizer in 1.32 - as far as I can tell you would only see these errors if you upgraded the server but forgot to upgrade the agents. Or do you have something else that is attempting to use the kubelet's kubeconfig to list nodes?

I don't see any suggestion of what might cause pods on different nodes to fail to communicate. Flannel is pretty simple, I've not in general seen it break randomly when it was previously working fine.

3 replies

maltelehmann Nov 6, 2025
Author

Thank's a lot for the help. I have nothing else doing the list nodes. So it seems to be a problem with the upgrade process.

I upgraded all the nodes, but there seems to be a timing issue in my deployment (using SUC). The agents have the mngr1 as the server in the k3s config. And the arch4/6 agents were upgraded before mngr1. Probably they did not like to talk to an old server version. I will fix this timing problem in my deployment.

This might to totally reasonable, but I am wandering why the agents are stuck in the state even after all servers have been upgraded.

brandond Nov 6, 2025
Collaborator

Servers definitely need to be upgraded BEFORE agents. The example plans (PLANS, one for servers, and another for agents) provided in the docs ensure that this will be ordered properly. Ref: https://docs.k3s.io/upgrades/automated#configuration

Note in particular:

The prepare step in the agent-plan will cause upgrade jobs for that plan to wait for the server-plan to complete before they execute. This logic is built into the image used for the prepare step, and is not part of system-upgrade-controller itself.

maltelehmann Nov 6, 2025
Author

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem after node reboot: cluster network not healthy anymore #13141

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Problem after node reboot: cluster network not healthy anymore #13141

Uh oh!

maltelehmann Nov 5, 2025

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

brandond Nov 5, 2025 Collaborator

Uh oh!

maltelehmann Nov 6, 2025 Author

Uh oh!

Uh oh!

brandond Nov 6, 2025 Collaborator

Uh oh!

maltelehmann Nov 6, 2025 Author

maltelehmann
Nov 5, 2025

Replies: 1 comment 3 replies

brandond
Nov 5, 2025
Collaborator

maltelehmann Nov 6, 2025
Author

brandond Nov 6, 2025
Collaborator

maltelehmann Nov 6, 2025
Author