Troubleshooting
Common errors encountered when setting up and operating the k3s cluster.
Node Issues
Node stays NotReady
Symptoms:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
vps1 NotReady control-plane,master 2m v1.32.2+k3s1Possible causes and fixes:
1. k3s service not running
# On the affected node:
systemctl status k3s # master
systemctl status k3s-agent # worker
# If failed:
journalctl -u k3s -f --since "5 minutes ago"
systemctl restart k3s2. ip_forward disabled
sysctl net.ipv4.ip_forward
# If 0:
sysctl -w net.ipv4.ip_forward=1
# Make permanent:
echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.d/99-z-k3s.conf
sysctl --system3. Kernel modules missing
lsmod | grep -E 'overlay|br_netfilter'
# If not listed:
modprobe overlay
modprobe br_netfilterWorker joins but shows NotReady
Symptoms: Worker appears in kubectl get nodes but its status is NotReady.
Cause: VXLAN tunnel between nodes not established due to firewall or token mismatch.
# On master — verify VXLAN port is open to worker:
ufw status | grep 8472
# If missing:
ufw allow from <WORKER_IP> to any port 8472 proto udp comment 'flannel VXLAN from worker'
ufw reload
# Check Flannel pod logs on the worker node:
kubectl logs -n kube-system -l app=flannel --all-containers--protect-kernel-defaults causes k3s to fail to start
Symptoms:
FATA[0000] Rancher k3s not supported on this system: ...
flag --protect-kernel-defaults set but sysctl vm.panic_on_oom=X ≠ required valueFix: Apply the k3s sysctl file:
cat > /etc/sysctl.d/99-z-k3s.conf << 'EOF'
vm.panic_on_oom = 0
vm.overcommit_memory = 1
kernel.panic = 10
kernel.panic_on_oops = 1
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF
sysctl --system
systemctl restart k3sNetworking Issues
Pods can't communicate across nodes
Symptoms: Ping between pods on different nodes times out.
Diagnosis:
# Check Flannel pods (should be Running on every node)
kubectl get pods -n kube-system -l app=flannel -o wide
# Check VXLAN interface on each node
ip link show flannel.1
# Check routes
ip route | grep flannelFix: Ensure UDP 8472 is open between nodes in both directions:
# On master:
ufw allow from <WORKER_IP> to any port 8472 proto udp
# On worker:
ufw allow from <MASTER_IP> to any port 8472 proto udp
ufw reloadTraefik shows no EXTERNAL-IP
Symptoms:
kubectl get svc -n ingress
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
traefik LoadBalancer 10.43.x.x <pending> 80:xxx/TCP,443:xxx/TCPCause: ServiceLB (the k3s built-in load balancer) was disabled and nothing else assigns external IPs.
Fix: k3s's servicelb was intentionally disabled. Traefik gets its external IP assigned differently — it uses the worker node's IP directly when the pod is scheduled. Check if the pod is running:
kubectl get pods -n ingress -o wide
# Verify it's on the worker nodeIf still <pending>, check that NodePort or host network mode is not needed and that the Helm values have service.type: LoadBalancer.
DNS resolution not working inside pods
Symptoms: nslookup or curl to service names fails inside pods.
# Test from a pod:
kubectl run dns-test --image=alpine --restart=Never --rm -it -- \
nslookup kubernetes.default.svc.cluster.localCheck CoreDNS:
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dnsFix:
# Restart CoreDNS
kubectl rollout restart deployment/coredns -n kube-systemTLS / cert-manager Issues
Certificate stuck in False / not issuing
Diagnosis:
kubectl get certificate -A
kubectl describe certificate <name> -n apps
kubectl get certificaterequest -A
kubectl describe certificaterequest <name> -n apps
# Check cert-manager controller logs
kubectl logs -n cert-manager -l app=cert-manager -fCommon causes:
1. ClusterIssuer not Ready
kubectl get clusterissuer
kubectl describe clusterissuer letsencrypt-productionIf Not Ready, check that cert-manager pods are running and the ACME account email is valid.
2. HTTP-01 challenge failing
The ACME challenge requires that http://<domain>/.well-known/acme-challenge/ is reachable from the internet.
- DNS A record for the domain must point to the worker node's public IP
- Ports 80 and 443 must be open on the worker
- Traefik must be running and routing traffic
# Verify DNS (from local machine):
dig +short app.example.com
# Should return the worker's public IP
# Verify port 80 is reachable:
curl -I http://app.example.com/.well-known/acme-challenge/test3. Let's Encrypt rate limit hit
Switch to the staging issuer first, confirm it issues a certificate, then switch to production:
cert-manager.io/cluster-issuer: letsencrypt-stagingDelete the existing secret to force re-issuance after switching:
kubectl delete secret <tls-secret-name> -n appsTLS certificate using staging (untrusted) when production expected
# Check which issuer is annotated
kubectl get ingress <name> -n apps -o yaml | grep cluster-issuer
# Delete the staging secret and re-apply with production issuer
kubectl delete secret <tls-secret-name> -n apps
kubectl annotate ingress <name> -n apps \
cert-manager.io/cluster-issuer=letsencrypt-production --overwriteAccess Issues
kubectl can't reach the API server
Symptoms:
Unable to connect to the server: dial tcp <MASTER_IP>:6443: connect: connection refusedCheck:
# Is k3s running on master?
ssh kevin@<MASTER_IP> "systemctl status k3s"
# Is port 6443 open?
nc -zv <MASTER_IP> 6443
# UFW on master:
ssh kevin@<MASTER_IP> "ufw status | grep 6443"Fix:
# On master — ensure API port is open:
ufw allow 6443/tcp comment 'k3s API server'
ufw reloadkubeconfig has wrong IP / context
Symptoms: kubectl connects but shows wrong cluster or wrong nodes.
Fix: Re-fetch the kubeconfig:
make kubeconfig
# or:
./scripts/get-kubeconfig.sh <MASTER_IP> <SSH_USER> k3s-infra
kubectl config use-context k3s-infra
kubectl get nodesStorage Issues
PVC stuck in Pending
Symptoms:
kubectl get pvc -n apps
NAME STATUS VOLUME CAPACITY ...
data PendingCause: local-path uses WaitForFirstConsumer — the PVC remains Pending until a pod consumesit.
Fix: Create/apply the pod/deployment that uses the PVC. Once the pod is scheduled, the PV is created and the PVC transitions to Bound.
PVC stuck in Terminating
# Remove the finalizer blocking deletion:
kubectl patch pvc <name> -n apps -p '{"metadata":{"finalizers":null}}'Useful Diagnostic Commands
# Overall cluster health
kubectl get nodes -o wide
kubectl get pods -A
kubectl top nodes # requires metrics-server
# Events (most useful for debugging)
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
kubectl get events -n apps --sort-by='.lastTimestamp'
# Describe a failing pod
kubectl describe pod <name> -n <namespace>
kubectl logs <name> -n <namespace> --previous # previous container logs
# k3s logs on nodes
journalctl -u k3s -n 100 --no-pager # master
journalctl -u k3s-agent -n 100 --no-pager # worker
# Verify all components
kubectl get componentstatuses
kubectl cluster-info