13 — Troubleshooting¶
A recipe-style operator manual for failures that actually came out of
building Stage 1. Every recipe is grounded in a real deviation
recorded in plans/PLAN-stage1-*.md; nothing here is speculative.
Each recipe has four blocks: Symptom (what you see), Likely
cause (observed root cause), Diagnose (read-only commands), and
Fix (smallest action that returns you to the canonical flow).
Plan citations (§N.x) point at the source-of-truth section. If a
recipe and the plan disagree, the plan wins.
Organised by phase boundary: Substrate (0..3) → Bootstrap (4) → Phase 5 helm / TF → Pivot (7) → kube-proxy / MetalLB → image / cloud-init → diagnostic recipes → reset.
Substrate (Phase 0..3) issues¶
LXD storage pool: "device contains existing filesystem signature"¶
Symptom. lxd_storage_pools fails with an mkfs.btrfs error like
appears to contain an existing filesystem (xfs / ext4 / LVM2_member /
…). No pool is created.
Likely cause. The block device in
lxd_storage_pools_pools[].source carries a stale signature. LXD runs
mkfs.btrfs without -f — the role contract is "hand LXD a clean
device".
Diagnose.
Any non-empty output means leftover signatures.
Fix. sudo wipefs -af /dev/disk/by-id/<your-device>, then re-run.
The Molecule shared task prepare-clean-disk.yml does this for the
local harness; prod hosts ship a clean disk via the installer image.
See plan §13.4 Deviations.
"Numerical result out of range" creating a libvirt bridge¶
Symptom. virsh net-define /
make -C tests/vagrant/debian13 networks / Vagrantfile triggers fail
with error creating bridge interface: Numerical result out of range.
The VM never comes up.
Likely cause. Linux IFNAMSIZ ≤15 chars. Names like
virbr-k8slab-mgmt (17 chars) are rejected by
ioctl(SIOCBRADDBR).
Diagnose.
Any name= longer than 15 chars is the offender.
Fix. Use the shipped short names (k8slab-mgmt, k8slab-ext6,
k8slab-probe — all ≤12 chars). XML files in tests/vagrant/debian13/
carry header comments documenting this. See plan §9.2 Step 1
deviation.
vagrant-libvirt fails on IPv6-only network: "undefined method 'to_range' for nil"¶
Symptom. vagrant up aborts with NoMethodError: undefined method
'to_range' for nil inside vagrant-libvirt's private_network
plumbing. Specifically on networks intended to carry only IPv6.
Likely cause. The plugin's validator unconditionally computes a
DHCPv4 range. No <ip family="ipv4"> element → nil.to_range →
crash.
Diagnose. virsh net-dumpxml k8slab-ext6 | grep '<ip' — if you
only see family="ipv6", the plugin crashes.
Fix. Add a minimal RFC 5737 TEST-NET-1 (192.0.2.0/30) IPv4
block to the network XML — traffic-free, just enough for the
validator. Also pin type: "dhcp" on the IPv6-only NIC in the
Vagrantfile; otherwise the plugin's auto-configure crashes the same
way. See plan §9.2 Deviations.
synced_folder "/vagrant" requires interactive sudo on host¶
Symptom. make -C tests/vagrant/debian13 up blocks on an
interactive sudo prompt on the host when vagrant-libvirt
exports the NFS share backing
/vagrant. Unattended runs hang.
Likely cause. Default vagrant-libvirt syncs the project root to
/vagrant via NFS, which edits /etc/exports via sudo. Nothing in
the VM reads /vagrant — Ansible drives via SSH — so the export is
pure cost.
Diagnose. grep -n synced_folder
tests/vagrant/debian13/Vagrantfile — if you do not see disabled:
true, the export is wired up.
Fix. Disable in the Vagrantfile:
Canonical change in Step 8 — saves first-boot time and removes the
host-side sudo prompt. See plan §9.2 Step 9.
Bootstrap (Phase 4) issues¶
clusterctl init fails with "already an instance of "¶
Symptom. Re-running bootstrap_clusterctl, or clusterctl init
manually, fails with Error: there is already an instance of "incus"
InfrastructureProvider installed in the "capn-system" namespace. The
play stops on the init task.
Likely cause. clusterctl init is not idempotent at the CLI
level — all-or-nothing.
Diagnose.
If the Deployment exists, init has already happened.
Fix. Nothing to do — the role's kubernetes.core.k8s_info
pre-check against capn-system/capn-controller-manager skips the init
task on every subsequent run; canonical re-converge yields
changed=0. Do not delete providers by hand to "force a clean
init"; you will desync the rest of the role's state. See plan §13.10
Implementation notes.
Bootstrap kubeconfig points at 127.0.0.1¶
Symptom. kubectl --kubeconfig=.artifacts/mgmt.kubeconfig get
nodes from the runner fails with dial tcp 127.0.0.1:6443: connect:
connection refused. The k3s server in the bootstrap LXC is healthy.
Likely cause. k3s' default server: https://127.0.0.1:6443 is
correct only from the container's own POV. The
bootstrap_clusterctl role is the component that rewrites this URL
when materialising the host-side kubeconfig.
Diagnose. grep server .artifacts/mgmt.kubeconfig. Expect either
https://<bootstrap-eth0-ipv4>:6443 (default) or
https://<host-ip>:16443 if lxd_bootstrap_instance_devices.k3s-api
is set (LXD proxy device).
Fix. Re-run bootstrap_clusterctl — its rewrite is byte-stable
and idempotent. For runner reach from outside the LXD host, set
export_artifacts_mgmt_api_server_url:
"https://<reachable>:16443" and add the k3s-api proxy device on
lxd_bootstrap_instance (plan §15.5). The kubeconfig already pins
tls-server-name: kubernetes.default.svc, so TLS verifies regardless
of which URL you use. See plan §13.10 Step 8 extensions.
CAPN identity Secret not found by CAPN controller¶
Symptom. A workload Cluster CR is stuck in Provisioning. CAPN
controller logs show repeated failed to get LXCCluster identity
Secret: secrets "<name>" not found. The Secret does exist —
elsewhere.
Likely cause. CAPN v1alpha2 looks up the identity Secret in the
same namespace as the LXCCluster CR. There is no cross-namespace
lookup. Placing it only in capn-system (as some older docs suggest)
will not work for a Cluster in capi-clusters.
Diagnose.
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig get clusters,lxcclusters -A
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig \
-n <cluster-ns> get secret <k8s_lab_infrastructure_secret_name>
Missing in <cluster-ns> → this recipe.
Fix. Add every workload-cluster namespace to
k8s_lab_capn_identity_namespaces (default ["capi-clusters"]) and
re-run bootstrap_capn_secret. The role fans the Secret out via
server-side apply, idempotent on re-runs, and cleans stale Secrets at
teardown. See plan §13.11 and §15.4.
k3s dual-stack node-IP not picked up¶
Symptom. Bootstrap k3s crash-loops; kube-router logs say
Shutdown request received: failed to start networking: unable to
initialize network policy controller: IPv6 was enabled, but no IPv6
address was found on the node. systemd shows Restart=always
flapping.
Likely cause. kubelet auto-detects one address from the
default route — IPv4 only — even when eth0 has both v4 and global
v6. Under dual-stack --cluster-cidr, k3s' embedded controllers
refuse to start without a v6 node-IP.
Diagnose.
sudo lxc shell capi-bootstrap-0 --project capi-lab
ip -o addr show dev eth0
systemctl cat k3s.service # ExecStart should call k3s-server-launcher
journalctl -u k3s.service -n 200
ExecStart=k3s server … directly (not the launcher wrapper) → wrapper
is missing.
Fix. bootstrap_k3s drops a wrapper at
/usr/local/sbin/k3s-server-launcher that resolves eth0's global v4
+ v6 at start and exec's k3s server --node-ip=<v4>,<v6> "$@".
Re-converge the role to put it back. Do not try to set --node-ip
via EnvironmentFile / ExecStartPre — systemd loads
EnvironmentFile once before any Exec*, before the address is
known. See plan §13.9 --node-ip Implementation note.
disable-cloud-controller missing → bootstrap k3s crash-loops¶
Symptom. Bootstrap k3s in an unprivileged LXC crash-loops with
RBAC errors about extension-apiserver-authentication; systemd shows
activating → failed → activating.
Likely cause. k3s' embedded cloud-controller-manager hits a known
race (k3s-io/k3s#7328) in unprivileged LXC — CCM tries to read the
extension-apiserver-authentication configmap before
poststarthook/rbac/bootstrap-roles creates the RoleBinding. CCM
exits, k3s shuts down, systemd restarts, loop forever.
Diagnose.
sudo lxc shell capi-bootstrap-0 --project capi-lab
journalctl -u k3s.service -n 200 | grep -i 'cloud-controller\|extension-apiserver'
systemctl cat k3s.service | grep -- '--disable-cloud-controller'
Missing --disable-cloud-controller → this recipe.
Fix. The bootstrap_k3s template hardcodes
--disable-cloud-controller. Re-converge; do not strip this flag
via overrides. See plan §13.9 substrate-required flags.
Phase 5 / Helm / Terraform issues¶
helm upgrade rejected: "admission webhook denied: field is immutable"¶
Symptom. terraform apply (or manual helm upgrade) on
capi-cluster-class / capi-workload-cluster aborts with admission
webhook "validation.cluster.x-k8s.io" denied the request: ...:
Forbidden: field is immutable.
Likely cause. Once a Cluster CR has referenced a ClusterClass +
*Template set, CAPI's admission webhook forbids editing most fields.
Naïve helm upgrade with changed values hits the webhook.
Diagnose.
helm --kubeconfig=.artifacts/mgmt.kubeconfig list -A | grep -E 'class|cluster'
helm --kubeconfig=.artifacts/mgmt.kubeconfig get values <release>
Confirm the changed values land on a CAPI template field.
Fix. Bump Chart.yaml: version:. The chart-version-as-CR-name
pattern (metadata.name = "{prefix}-{Chart.Version | replace "." "-"}")
makes Helm create a fresh ClusterClass + *Templates under new names;
the workload Cluster CR derives its reference from the same slug by
reading the workload chart's
Chart.yaml.annotations.k8s-lab.io/capi-cluster-class-chart-version
pin. The Terraform module only echoes the rendered name in outputs.
Old objects live until cleanup; helm rollback restores the previous
pair. See plan §2.9 and §12.10.
helm install metallb-config Pods Pending on NotReady Nodes¶
Symptom. terraform apply hangs on the MetalLB releases; pods
remain Pending; nodes are NotReady with container runtime network
not ready.
Likely cause. CNI install (tigera-operator) is async: the
operator Deployment goes Available before it has reconciled the
Calico Installation CR + calico-node DaemonSet. helm install --wait
only blocks on the operator Deployment, not on the data plane. The
next Helm release schedules pods before nodes go Ready, and they
stick.
Diagnose.
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig get nodes
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig \
-n calico-system get installation default -o yaml
Installation/default not Ready → CNI still coming up.
Fix. Insert an explicit Nodes-Ready poll
(kubernetes.core.k8s_info) between the CNI and MetalLB installs.
The e2e-local converge already does this; mirror it in custom
playbooks. The cni-calico Gate B helm-test hook is the secondary
barrier — let it run between releases. See plan §3 step 3.
terraform apply fails on helm test ... Job exit 1 (Gate B / CNI)¶
Symptom. terraform apply runs the cni-calico install, then a
null_resource calling helm test cni-calico returns non-zero with
Error: pod cni-calico-test-* failed. Gate B has fired.
Likely cause. CNI bring-up is not viable. The unprivileged-LXC
substrate is fixed in this repo (plan §2.8); the variable is
almost always the CNI itself — kernel modules, BPF, NetworkPolicy
mode, IPv6 handling. The shipped Calico config covers the known cases
(vxlan + nf_tables data plane, natOutgoing: Enabled for both
families); local edits often regress one.
Diagnose.
KUBECONFIG=.artifacts/mgmt.kubeconfig kubectl -n calico-system \
logs ds/calico-node --tail=400
KUBECONFIG=.artifacts/mgmt.kubeconfig kubectl get pods -A -o wide \
| grep -v Running
sudo lxc info <node-LXC> --project capi-lab --show-log
Look for missing kernel modules (vxlan, nf_tables,
nf_conntrack), AppArmor denials, namespace permission errors.
Fix. Diagnose the root cause first. Common cases: a profile lost
linux.kernel_modules (vxlan, nf_tables) → fix lxd_profiles; a
consumer override removed nf_conntrack from base_system extras →
restore baseline; a stealth CNI swap → revert.
A real CNI swap is a deliberate design step: new wrapper chart under
charts/, plus a new cni_* input on the workload_cluster Terraform
module signature. Not a runtime toggle. See plan §2.8, §12.2,
§13.6.
External Gate A curl from VM fails¶
Symptom. The metallb-config helm-test hook PASSes, but verify-side
curl http://[<VIP>]:80/ from the Vagrant VM fails with No route to
host / Network is unreachable. Only the external acceptance fails.
Likely cause. External IPv6 segment is not actually carrying traffic between VIP-holding node and probe. Two recorded regressions:
radvd(in-VM RA source onext6-ra-peer, plan§9.2 Step 9) is not running — nodeeth1has no global IPv6 to bind the VIP to.br-ext6is up but does not see the in-VM RA peer — theext6-ra↔ext6-ra-peerveth is broken or never created.
Diagnose.
# On the worker that holds the MetalLB VIP:
sudo lxc shell <speaker-leader-node> --project capi-lab
ip -6 addr show dev eth1 # expect 2001:db8:42:100::/64
# On the Vagrant VM:
ip link show ext6-ra-peer
sudo systemctl status radvd
sudo tcpdump -i ext6-ra-peer 'icmp6 && ip6[40] == 134' -n -c 5
Only fe80::/64 on eth1 → RA reception is broken upstream.
Fix.
- Re-run
tests/molecule/shared/tasks/ext6-ra-source.yml(auto-run by every scenario'sprepare.yml); it recreates the veth, the radvd config, and starts the daemon. - Confirm
lxd_host_ext_bridge_uplink: ext6-rain shared group_vars —br-ext6enslavesext6-raso RAs reach every container'seth1. - Prod: confirm the upstream router actually sends RA (
tcpdumpfrom the host's uplink NIC).
See plan §9.2 Step 9 pivot.
CAPN simplestreams 404 on requested k8s version¶
Symptom. Workload Cluster never gets nodes; CAPN logs say Failed
getting image: The requested image couldn't be found for fingerprint
"kubeadm/v1.32.1".
Likely cause. CAPN's default simplestreams server
(https://images.linuxcontainers.org/capn/) publishes only <minor>.0
plus a small curated set of patch releases. Arbitrary patches 404.
Diagnose.
curl -sSL https://images.linuxcontainers.org/capn/streams/v1/images.json \
| jq '.products | keys[]' | grep kubeadm
Cross-reference against k8s_lab_kubernetes_version.
Fix. Pin k8s_lab_kubernetes_version to a published <minor>.0
(safest — plan §8a Verified version log), or stand up a private
simplestreams remote with your patch and override CAPN provider config
to point at it (consumer-repo concern). See plan §8a and §12.9.
Pivot (Phase 7) issues¶
clusterctl move fails: "ClusterClass not found"¶
Symptom. pivot_clusterctl_move fails on move with failed to
retrieve ClusterClass "<name>": ClusterClass.cluster.x-k8s.io "<name>"
not found. mgmt-1 nodes are Ready; bootstrap still has the Cluster
CR.
Likely cause. Step ordering. The mgmt-1 helm releases
(capi-cluster-class, capi-workload-cluster for mgmt-topology) must
complete on bootstrap before clusterctl move — move walks the CAPI
CR graph and refuses to relocate a Cluster whose ClusterClass is
missing. The canonical e2e-local converge enforces this; a custom
playbook that splits Phase 5 from Phase 7 may not.
Diagnose.
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig get clusterclasses -A
helm --kubeconfig=.artifacts/mgmt.kubeconfig list -A
ClusterClass in Cluster.spec.topology.classRef absent → this recipe.
Fix. Re-run the Phase 5 helm releases on bootstrap, then retry the
pivot — idempotent. Reference: plan §3.1 (the nine ordered steps).
Post-pivot CAPI controller cannot reach LB¶
Symptom. Post-pivot, self-hosted CAPI on mgmt-1 reports failed to
dial control plane endpoint. Bootstrap k3s reached the same LB
without trouble. The LB LXC is up and listening.
Likely cause. Network surface asymmetry. On bootstrap, CAPI
controllers ran host-network as k3s server processes (source IP = the
bootstrap LXC's eth0 IPv6 on capi-int, native L3 to the LB). On
mgmt-1 they run as Pods with Calico-managed Pod IPv6 in
fd42:77:2::/56. Pod→substrate IPv6 SNAT must be on, or pod IPs
cannot reach the LB.
Diagnose.
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig \
-n calico-system get installation default -o yaml | grep -A2 natOutgoing
# Expect 'natOutgoing: Enabled' under both ipPools.
Fix. The cni-calico chart sets natOutgoing: Enabled on both
IPv4 and IPv6 IPPools by default. If an override turned it off,
restore the default and helm upgrade cni-calico. The mandatory pivot
exists precisely to exercise this surface change in every e2e run. See
plan §3.3 Network surface asymmetry.
cleanup_bootstrap removes the wrong runner-side directory¶
Symptom. cleanup_bootstrap deletes more than expected on the
runner. Files outside the project's .artifacts/ are gone.
Likely cause. cleanup_bootstrap_artifacts_root set to a
relative path (e.g. .artifacts resolved against a different
working directory) or a non-canonical path with ... The role's
preflight blocks / and //, but does not resolve relative paths
against an absolute base.
Diagnose.
Fix. Either set cleanup_bootstrap_artifacts_root to an absolute,
project-scoped path (e.g. /home/<runner>/k8s-lab/.artifacts), or
leave it empty (default — file-deletion skipped; LXC instance still
removed). The Molecule scenario uses
<repo>/.artifacts-cleanup-bootstrap-test to keep the test isolated.
See plan §19.1 Implementation notes.
kube-proxy / NodePort / MetalLB¶
NodePort accepted on internal IP, not external¶
Symptom. curl http://<node-eth1-IPv6>:<nodeport>/ from an
external probe times out, but the same NodePort on <node-eth0-IP>
from the host succeeds.
Likely cause. kube-proxy default listens on all node IPs. The
two-NIC node design (plan §4) requires NodePort bound only on the
external IPv6 segment. The cluster-config sets
--nodeport-addresses=<external IPv6 CIDR>; if a custom
KubeProxyConfiguration dropped that flag, kube-proxy is back to the
default.
Diagnose.
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig \
-n kube-system get configmap kube-proxy -o yaml | grep -A1 nodePortAddresses
Empty or absent → kube-proxy is listening everywhere.
Fix. Restore nodePortAddresses to the external IPv6 CIDR. The
shipped capi-cluster-class bakes this into KubeadmControlPlaneTemplate;
if a chart-version bump lost your values, re-apply them in the
consumer-side overlay. See plan §5.4.
MetalLB VIP allocated but not reachable¶
Symptom. kubectl get svc shows the LoadBalancer Service has an
external IPv6; kubectl describe svc shows MetalLB IpAllocated
events. But curl http://[<VIP>]:80/ times out.
Likely cause. L2 mode requires NDP on the announcing interface. If
L2Advertisement.spec.interfaces does not list eth1, MetalLB
announces on the wrong NIC (or none); nodeSelectors that exclude the
VIP-holding node mute NDP on it.
Diagnose.
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig \
-n metallb-system get l2advertisement -o yaml
# On the speaker leader:
sudo lxc shell <speaker-leader-node> --project capi-lab
tcpdump -i eth1 'icmp6 && ip6[40] >= 134 && ip6[40] <= 137' -n -c 10
# 134=RA, 135=NS, 136=NA, 137=Redirect — expect NS/NA for the VIP.
No NDP for the VIP → this recipe.
Fix. Chart-version-bump metallb-config with:
Then helm upgrade metallb-config. See plan §5.5.
Image / cloud-init issues¶
Worker node has no global IPv6 on eth1¶
Symptom. A new CAPN-spawned worker boots and joins, but
kubectl get nodes -o wide shows only a link-local IPv6 on eth1.
MetalLB cannot announce VIPs on it; NodePort on eth1 does not bind.
Likely cause. Two sub-causes (plan §2.10, §16.2):
- The node image is not cloud-init capable, so the
KubeadmConfigSpec.filespayload was never written. - The image is cloud-init capable but
preKubeadmCommands(sysctl --load+networkctl reload) did not run — cloud-init failed mid-flight.
Diagnose.
sudo lxc shell <node> --project capi-lab
cat /etc/sysctl.d/99-capi-ra.conf # accept_ra=2, accept_ra_defrtr=1
ls /etc/systemd/network/ | grep capi-ext # 30-capi-ext.network present
cloud-init status --long
journalctl -u cloud-final -n 200
Missing files → image not cloud-init capable. Files present but
sysctl --load never ran → cloud-init failed.
Fix. Switch to a cloud-init capable image (CAPN's
images:capi/kubeadm/<ver> line is the supported baseline; plan
§12.9). For an already-broken node, hand-run sudo sysctl --load
/etc/sysctl.d/99-capi-ra.conf && sudo networkctl reload as a bandage,
but recreate the Machine via CAPI for a real fix — kubeadm init may
have baked the wrong addresses into Node objects.
Note: runtime net.ipv6.conf.eth1.accept_ra = 0 is expected under
systemd-networkd — with IPv6AcceptRA=yes the daemon processes RAs in
user space, kernel sysctl stays 0. Confirm via ip -6 addr show dev
eth1, not the sysctl. See plan §9.2 Step 9 end-to-end proof.
Diagnostic recipes (always-useful)¶
All read-only, safe to run any time.
# Substrate inventory:
sudo lxc list --project capi-lab -f csv -c n,s,t,4,6
# Shell into a node:
sudo lxc shell <instance> --project capi-lab
# LXD daemon logs:
journalctl -u snap.lxd.daemon -n 200
# CAPI snapshot:
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig \
get clusters,machines,kubeadmcontrolplanes,machinedeployments -A
# CAPN reconciliation:
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig \
-n capn-system logs deploy/capn-controller-manager --tail=400
# CAPI core reconciliation:
kubectl --kubeconfig=.artifacts/mgmt.kubeconfig \
-n capi-system logs deploy/capi-controller-manager --tail=400
# Re-run a Gate test interactively:
helm test <release> --logs --kubeconfig=.artifacts/mgmt.kubeconfig
# RA traffic on the external segment:
sudo tcpdump -i br-ext6 'icmp6 && ip6[40] == 134' -n -c 5
When all else fails¶
If a recipe above does not match and you cannot find the cause in 30 minutes of focused investigation, reset and re-run from scratch — the canonical flow is designed to be cheap to recreate.
# Local harness (Vagrant VM):
make clean-local
# Production / consumer repo:
# Run the consumer-repo's destroy.yml playbook, which in turn invokes
# `cleanup_bootstrap`, `clusterctl delete --all`, and the LXD-side
# teardown roles in reverse canonical order.
Then bring everything back up:
make test-local-e2e # local: full Phase 0..9 cycle
# or
make deploy-workload # add a workload cluster on an
# already-self-hosted mgmt-1
If the failure reproduces on a clean slate:
- Capture the relevant logs (
journalctl,kubectl logs, role converge output,terraform applytranscript). - Identify the closest
§Nplan reference for the failing step (the table in01-overview.md"Where the source of truth lives" maps questions to plan sections). - File an issue with the logs + the
§Nreference. The plan is the single source of truth — if the symptom contradicts a documented invariant, that is the bug to chase.
The plan says why. The documentation says how. The code says what. This chapter is the bridge between the three when something between them slips.