06 — Quickstart (local lab)¶
This chapter takes you from a fresh clone of the repository to a fully
working two-cluster Kubernetes lab on a single local Vagrant VM, without
touching any real hardware. It is the recommended first
hands-on after 01-overview.md and
02-architecture.md (plan §9.1, §10.2).
The local lab is the only end-to-end driver shipped in this repository;
real environments are operated through a private consumer repository
(see 07-deployment-guide.md). What you
build here uses the same code paths the consumer repo would call.
What you'll have at the end¶
After a successful run, the local lab will look like this:
- One local Vagrant VM running on libvirt/KVM (
k8slab_host), attached to a single libvirt networkk8slab-mgmt-nat. - Inside the VM, an LXD substrate hosting:
- the transient bootstrap k3s LXC (
capi-bootstrap-0) — destroyed again before the run finishes; - the mgmt-1 self-hosted CAPI management cluster (1 CP + 2 workers) running CAPI / CABPK / KCP / CAPN controllers as Pods;
- the
lab-defaultworkload cluster (3 CP + 2 workers) managed by mgmt-1 via CAPN. - Calico CNI on both clusters, with
natOutgoingfor IPv6 pod-to-substrate traffic. - MetalLB on both clusters, announcing IPv6 VIPs from
2001:db8:42:100::/64on the in-VMbr-ext6bridge (the local substitute for an upstream IPv6 segment — see plan§9.2). - All Gate A and Gate B helm tests passing on both clusters,
including an external curl from the VM to the workload's MetalLB
VIP (plan
§6,§17). - A runner-side
.artifacts/mgmt.kubeconfigpointing at the self-hosted mgmt-1 — usable directly withkubectl.
End-to-end runtime on a recent laptop with KVM acceleration:
- First run (cold image cache, fresh box, fresh apt + snap): 30–45 min.
- Subsequent run with
make destroy-vmin between: ~20 min. - Repeat
test-local-e2eover an already-built VM: ~15–18 min.
The dominant cost is image and binary downloads.
Prerequisites¶
The full checklist lives in 05-prerequisites.md.
At a glance:
- A Linux runner. macOS is not tested by this harness.
- A Python virtualenv with Ansible 2.16.x, Molecule,
molecule-plugins[vagrant], ansible-lint, and yamllint active on$PATH. The Makefile does not own venv creation. - Vagrant + libvirt + KVM with the runner user in the
libvirtgroup and/dev/kvmaccessible. - Terraform, helm, kubectl on
$PATH. - ≈ 16 GB free RAM (VM defaults to 12 GB) and ≈ 50 GB free disk for the libvirt pool and in-VM LXD btrfs storage.
1. Clone and set up the runner¶
git clone <repo-url> k8s-lab
cd k8s-lab
python3 -m venv .venv
source .venv/bin/activate
pip install \
'ansible-core==2.16.*' \
molecule \
'molecule-plugins[vagrant]' \
ansible-lint \
yamllint
# Project-local Ansible collections (gitignored under ansible/collections).
make deps
make deps runs ansible-galaxy collection install --force -r
ansible/requirements.yml -p ansible/collections/. The --force flag
is deliberate — without it ansible-galaxy skips collections already
present in the venv's site-packages, leaving holes in the
project-local tree.
Verify the runner is sane:
This runs lint-yaml, lint-ansible, lint-terraform, lint-helm
in sequence. A clean checkout passes all four. If lint-ansible
fails on a missing collection, re-run make deps.
2. Bring up the Vagrant VM and harness¶
This is two delegated steps under the hood:
make -C tests/vagrant/debian13 up— defines and starts thek8slab-mgmt-natlibvirt network fromtests/vagrant/debian13/libvirt-networks/mgmt-nat.xml, thenvagrant up --provider=libvirt host. The Vagrantfile bootsdebian/trixie64(configurable viaK8SLAB_BOX), allocates 8 vCPUs, 12 GB RAM, and a 40 GB qcow2 data disk pinned to/dev/disk/by-id/virtio-k8slab-lxdpoolfor the LXD btrfs pool.make -C tests/molecule harness-smoke— runsmolecule --versionand lists the scenarios undertests/molecule/that are wired (have amolecule.yml). It does not run any scenario; it confirms that the Python side of the harness is ready.
A successful run leaves you with the k8slab_host libvirt domain
running and the network active. The runner reaches it through the
inventory script at tests/vagrant/debian13/inventory.py.
3. Run the full canonical flow¶
This is the master target. It boots the VM if it is not already up,
then runs the e2e-local Molecule scenario through both converge
and verify. The scenario implements every step of the canonical
flow described in plan §3.1 and chapter
02-architecture.md §3 — bootstrap, mgmt-1,
pivot, cleanup, workload — in a single playbook with no dispatch
branches. Pivot is mandatory and runs every time (plan §3.3).
What happens, phase by phase, with rough timings on a warm-cache run:
| Phase | What happens | Time |
|---|---|---|
| 0–4. Substrate + bootstrap k3s | Roles base_system → binary_fetch → lxd_host → lxd_project → lxd_storage_pools → lxd_network_int_managed → lxd_profiles → lxd_bootstrap_instance → bootstrap_k3s → bootstrap_clusterctl → bootstrap_capn_secret → export_artifacts. Ends with .artifacts/mgmt.kubeconfig pointing at bootstrap k3s and clusterctl init complete. |
~5–7 min |
| 5. mgmt-1 Cluster CR on bootstrap | kubernetes.core.helm installs capi-cluster-class + capi-workload-cluster (mgmt topology, 1 CP + 2 W) on bootstrap. CAPN provisions LXC nodes and a haproxy LB instance. |
~3–4 min |
| 5b. CNI Calico on mgmt-1 | helm install cni-calico --wait plus an explicit k8s_info poll on all Nodes Ready (Calico operator reconciles asynchronously, so --wait alone is not sufficient). |
~1–2 min |
| 5c. MetalLB on mgmt-1 | helm install metallb + metallb-config. |
~30 s |
| 5d. Gate A/B helm tests on mgmt-1 | helm test on capi-workload-cluster (cluster-ready), cni-calico (Gate B — CNI viability), metallb-config (Gate A — external L2). Failure here stops the run before pivot. |
~1 min |
| 6. Pivot | pivot_clusterctl_move — clusterctl init on mgmt-1, then clusterctl move from bootstrap → mgmt-1. CAPI CRs and the move-labelled CAPN identity Secret migrate. |
~2–3 min |
7. Re-emit mgmt.kubeconfig |
A second include_role: export_artifacts with run_meta_chain: false overwrites .artifacts/mgmt.kubeconfig with mgmt-1 creds. |
<10 s |
7b. cleanup_bootstrap |
The capi-bootstrap-0 LXC is destroyed. Helm releases that lived on bootstrap go away with it. |
~30 s |
| 8. Workload Cluster on mgmt-1 | Helm install of capi-cluster-class + capi-workload-cluster again, this time with workload-topology values (3 CP + 2 W, name lab-default). |
~3–5 min |
| 8b. CNI + MetalLB on workload | Calico install → poll Nodes Ready → MetalLB + MetalLB config. | ~2 min |
| 9. Verify (Gate A/B on workload) | helm test on the three workload charts plus an external curl from the Vagrant VM via ext6-ra-peer to the announced MetalLB IPv6 VIP. Workload Nodes Ready=True via the rewritten kubeconfig. CAPI snapshot of the self-hosted mgmt-1. |
~1 min |
The run is verbose: molecule streams every Ansible task, every helm
install, and every helm test PASS/FAIL line. If anything fails, the
output points directly at the role / helm release / helm test that
broke. See 13-troubleshooting.md for the
common ones.
4. Inspect the result¶
After make test-local-e2e returns successfully, the runner-side
.artifacts/ directory holds the handoff bundle (plan §11.1):
.artifacts/
├── mgmt.kubeconfig # admin kubeconfig for self-hosted mgmt-1
├── mgmt.auto.tfvars.json # Phase 5 Terraform handoff
├── harness-vm-id # local harness VM identity
└── clusters/
└── lab-default.kubeconfig # workload kubeconfig (debug copy from verify.yml)
mgmt.kubeconfig is the same file used throughout the run: it pointed
at bootstrap k3s during Phase 4, was rewritten in place by the second
export_artifacts after pivot, and now holds mgmt-1 admin
credentials. Use it directly:
You should see three Ready nodes — mgmt-1-CP-0, mgmt-1-W-0,
mgmt-1-W-1 — with both an IPv4 (capi-int subnet 10.77.0.0/24) and
an external IPv6 in 2001:db8:42:100::/64. The CAPI controllers run
on mgmt-1 itself (self-hosted):
kubectl get pods -n capi-system
kubectl get pods -n capn-system
kubectl get clusters -A
# the workload Cluster CR shows up under capi-clusters/lab-default
For the workload, verify already wrote a debug copy to
.artifacts/clusters/lab-default.kubeconfig. To regenerate it from
the Terraform fixture (after make deploy-workload, see §5):
This reads terraform output -raw kubeconfig from the workload
fixture and writes it to .artifacts/clusters/<cluster_name>.kubeconfig
with mode 0600. Use it the same way:
kubectl --kubeconfig=.artifacts/clusters/lab-default.kubeconfig \
get nodes -o wide
# 3 CP + 2 W, all Ready, dual-stack as on mgmt-1
To poke around inside the VM:
make -C tests/vagrant/debian13 ssh
# inside the guest:
sudo lxc list -f csv -c n,s --project capi-lab
# capi-bootstrap-0 is gone after cleanup_bootstrap; you should see
# mgmt-1-CP-0, mgmt-1-W-0..1, mgmt-1-LB-0,
# lab-default-CP-0..2, lab-default-W-0..1, lab-default-LB-0
The tests/vagrant/debian13/Makefile also exposes status (vagrant
status) and inventory (Ansible-compatible JSON) for diagnostics.
5. Deploy an additional workload (optional)¶
After make test-local-e2e the runner has everything it needs to
drive the Terraform workload-cluster route (plan §3.2, §16.6).
The e2e run has already installed lab-default through direct
kubernetes.core.helm tasks, so exercise the Terraform route by
removing that workload first:
This target asserts .artifacts/mgmt.auto.tfvars.json exists, then
cds into tests/fixtures/terraform/workload-clusters/lab-default/
and runs terraform init -upgrade + terraform apply -auto-approve
-var-file=<repo>/.artifacts/mgmt.auto.tfvars.json. The -var-file
is needed because Terraform auto-loads *.auto.tfvars.json only from
cwd, not from the repo-level .artifacts/.
The fixture exists in this repo only as a wired test case for the
workload_cluster module. Defining additional, distinct workload
clusters is a consumer-repo activity (see
07-deployment-guide.md).
Both routes install the same Helm releases against the same CAPN substrate; only the orchestrator differs.
6. Tear down¶
The destroy / clean target graph is documented in plan §19.2 and
implemented in the root Makefile. The naming convention is:
destroy-*operates on running infrastructure (Terraform state, helm releases, the VM, libvirt domains). Eachdestroy-*cascades the matchingclean-*targets so the operator never has to remember which files to wipe afterwards.clean-*is file-only and idempotent; safe to run with no infrastructure at all.- Compound targets (
clean-local,reset-all) chain destroy + clean for "start over" scenarios.
Destroy the workload cluster¶
This branches automatically on what state actually exists:
- Terraform state present (
tests/fixtures/terraform/workload-clusters/lab-default/terraform.tfstate) plus.artifacts/mgmt.auto.tfvars.json→terraform destroy -auto-approve -var-file=.... This is the reverse ofmake deploy-workload. - No TF state but
.artifacts/mgmt.kubeconfig+ tfvars present → the workload was installed by the Molecule e2e-local converge (which useskubernetes.core.helmdirectly, no Terraform). The target falls back to twohelm uninstallcalls on mgmt-1 (<name>Cluster CR first → CAPI cascade-deletes Machines + LXC nodes → then<name>-classClusterClass). Calico / MetalLB live inside the workload and disappear with it. - Neither → nothing to destroy; the target only runs the clean-up.
In all three branches the target then runs clean-tfstate and
clean-workload-kubeconfig so the next deploy starts fresh.
Destroy the Vagrant VM¶
This delegates to make -C tests/vagrant/debian13 destroy:
vagrant destroy -f plus a belt-and-braces sweep of any libvirt
domains and volumes the Vagrantfile is known to allocate (so a
SIGKILLed vagrant up cannot leave orphans), plus removal of stale
.vagrant/machines/*/libvirt/ directories whose UUID no longer maps
to a real domain. Libvirt networks are left in place; to wipe
them too use make -C tests/vagrant/debian13 wipe.
destroy-vm then cascades clean-mgmt-bundle,
clean-workload-kubeconfig, and clean-tfstate.
Full local reset¶
make clean-local # "start over fast": destroy-vm + clean-molecule.
# No terraform destroy — the VM is already gone, so
# clean-tfstate (file delete) is sufficient.
make reset-all # "exercise the full destroy chain":
# destroy-workload (real terraform destroy) →
# destroy-vm → clean-molecule. Slower.
7. Re-running individual scenarios¶
tests/molecule/Makefile exposes pattern targets in the form
<scenario>-<driver>-<action>. To re-run a single role's full
Molecule cycle without touching the rest of the canonical flow:
Scenarios: base-system, binary-fetch, lxd-host, lxd-project,
lxd-storage-pools, lxd-network-int-managed, lxd-profiles,
lxd-bootstrap-instance, bootstrap-k3s, bootstrap-clusterctl,
bootstrap-capn-secret, export-artifacts, cleanup-bootstrap,
e2e-local. Each scenario name matches its role directory under
ansible/roles/ (snake_case internally, kebab-case as a scenario).
Actions: create, prepare, converge, idempotence, verify,
test, destroy, lint. The delegated and vagrant driver
tokens are interchangeable.
For the full testing model see 12-testing.md.
Common first-run problems¶
A short list; the deeper catalogue is in
13-troubleshooting.md.
- libvirt permissions.
virsh net-define"permission denied" orvagrant upcomplaining aboutqemu:///systemmeans the runner user is not in thelibvirtgroup or/dev/kvmis unreadable. Add the user to the group, log out and back in, verify withvirsh -c qemu:///system net-list. - Vagrant box not found. The Vagrantfile defaults to
debian/trixie64(override withK8SLAB_BOX=...). If missing, runvagrant box add debian/trixie64 --provider=libvirt. - "Volume for domain is already created" on
vagrant upafter a killed run.make destroy-vmhandles this via thevirsh vol-delete+ stale-state-dir sweep in the Vagrant Makefile. If it persists,make -C tests/vagrant/debian13 wipe. - "btrfs pool device dirty" on the in-VM LXD storage pool. The
Molecule shared
preparestep wipes a loopback device, so this is rare on a clean Vagrant VM. Recovery:make destroy-vm && make test-local-harness. - Out of memory mid-run. Full canonical flow + image pulls +
helm testworkloads need ~10 GB resident inside the VM. 16 GB free host RAM is the floor. make lintfails on a missing collection. Re-runmake deps.
For deeper failure modes (pivot getting stuck, helm tests failing,
CAPN unable to reach the LXD daemon, Calico not reconciling) see
13-troubleshooting.md. For day-2
operations see 11-operations.md.