Problem after node reboot: cluster network not healthy anymore #13141
Closed
maltelehmann
started this conversation in
General
Replies: 1 comment 3 replies
-
|
Did you upgrade all your nodes? The Watch error you are seeing is related to upstream changes to the node authorizer in 1.32 - as far as I can tell you would only see these errors if you upgraded the server but forgot to upgrade the agents. Or do you have something else that is attempting to use the kubelet's kubeconfig to list nodes? I don't see any suggestion of what might cause pods on different nodes to fail to communicate. Flannel is pretty simple, I've not in general seen it break randomly when it was previously working fine. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
at first, thank you very much for your awesome work!
I am encountering a problem that cannot be clearly pinpointed to an environment or k3s issue. Maybe you can help to investigate this further?
The problem got my attention since a Kafka pod running on a VM called
arch2was not able to talk to the other Kafka brokers running on VMs calledarch1andarch3.I made sure that it the network policies die not cause the problems and checked the Kubernetes network by starting a pod on each node in the cluster and letting each pod ping each other. The result was hat pods on
arch2could not talk toarch1andarch3and vice versa.Next I found that the connectivity issue started after
arch2had been rebooted onNov 03 13:30 UTC. To make the system running again, I restarted thek3s-agentsonarch1/2/3and the network check was successful again.To understand the root cause of the problem, I checked the k3s logs and found the folloing:
arch2: I could only access logs after the reboot and the logs did not show any anomalies except for the pods running on the node crash loopingarch1/3: the nodes show the following error every minute, until the restart of the k3s-agent, afterwards everything is fine -> since the restart fixes the problem, I believe there is no underlying persisting issue that causes these messages. I found that these messages start after upgrading fromv1.31.7+k3s1tov1.32.8+k3s1The k3s Config I use is:
The OS is:
Do you have any recommendations how to find out what is the root cause of the problem?
Thank you for your help!
Beta Was this translation helpful? Give feedback.
All reactions