Monitoring & Observability
The observability stack provides metrics, dashboards, and centralized logging for the entire cluster.
Stack components
| Component | Role | Helm chart |
|---|---|---|
| kube-prometheus-stack | Prometheus + Grafana + Alertmanager + exporters | prometheus-community/kube-prometheus-stack |
| Loki | Centralized log storage | grafana/loki |
| Promtail | Log collector (DaemonSet) | grafana/promtail |
Architecture
┌──────────────────────────────────────────────────────────┐
│ monitoring namespace │
│ │
│ ┌──────────────┐ ┌───────────┐ ┌──────────────┐ │
│ │ Prometheus │◄───│ Exporters │ │ Loki │ │
│ │ (metrics) │ │ (node, │ │ (log store) │ │
│ └──────┬───────┘ │ cadvisor)│ └──────▲───────┘ │
│ │ └───────────┘ │ │
│ ┌──────▼───────┐ ┌───────┴──────┐ │
│ │ Grafana │◄───────────────────│ Promtail │ │
│ │ (dashboards)│ │ (DaemonSet) │ │
│ └──────────────┘ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────┘
▲ HTTPS via Traefik IngressRoute + cert-managerPrerequisites
Before deploying monitoring:
- Base stack deployed (
make deploy) - Grafana admin secret created:
make deploy-grafana-secretThis creates a grafana-admin-secret in the monitoring namespace with:
username: adminpassword: <GRAFANA_PASSWORD from .env>
Deploy
make deploy-monitoringThis runs scripts/deploy-monitoring.sh which:
- Adds
prometheus-communityandgrafanaHelm repos - Installs
kube-prometheus-stack(Prometheus + Grafana + Alertmanager) - Installs
Loki(single-binary, filesystem storage) - Installs
Promtail(log collector DaemonSet on every node) - Applies the Grafana
IngressRoute+ TLSCertificate - Imports the Grafana logs dashboard
kube-prometheus-stack
The kube-prometheus-stack Helm chart installs:
- Prometheus — metrics collection and storage
- Grafana — visualization dashboards
- Alertmanager — alert routing and silencing
- kube-state-metrics — Kubernetes object metrics
- node-exporter — host-level metrics (CPU, memory, disk)
- Prometheus Operator — manages
ServiceMonitorandPrometheusRuleCRDs
Grafana access
URL: https://<GRAFANA_DOMAIN>
Username: admin
Password: <GRAFANA_PASSWORD>Prometheus access (port-forward)
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# Open: http://localhost:9090Alertmanager access (port-forward)
kubectl port-forward svc/alertmanager-operated -n monitoring 9093:9093
# Open: http://localhost:9093Loki
Loki stores logs indexed by labels (no full-text indexing). It is queried from Grafana using LogQL.
Configuration (kubernetes/monitoring/loki-values.yaml): deployed in single-binary mode with filesystem storage — suitable for single-node homelab use.
Query logs in Grafana
- Go to Explore → select Loki datasource
- Use a LogQL query:
{namespace="apps"}
{namespace="ingress", job="traefik"} |= "error"
{app="my-app"} | json | level="error"Loki service endpoint (in-cluster)
http://loki.monitoring.svc.cluster.local:3100Promtail
Promtail is deployed as a DaemonSet — one pod per node. It:
- Reads container logs from
/var/log/pods/ - Attaches Kubernetes labels (namespace, pod, container, app)
- Pushes log streams to Loki
Configuration (kubernetes/monitoring/promtail-values.yaml): uses the default pipeline stages to extract structured labels from Kubernetes metadata.
Grafana dashboards
The following dashboards are available after deploy:
| Dashboard | Source | What it shows |
|---|---|---|
| Kubernetes cluster overview | kube-prometheus built-in | Node CPU/memory, pod counts |
| Node exporter | kube-prometheus built-in | Host CPU, memory, disk, network |
| Traefik | ServiceMonitor auto-discovery | Request rates, latencies, errors |
| Logs — Errors | grafana-logs-dashboard.yaml | Error-focused log explorer |
Import additional dashboards
Grafana has a large community dashboard library. Import by ID from Dashboards → Import:
| ID | Name |
|---|---|
315 | Kubernetes cluster monitoring |
1860 | Node exporter full |
13713 | Loki log summary |
17501 | Traefik |
Traefik metrics integration
Traefik exposes Prometheus metrics on port 9100. The serviceMonitor in traefik-values.yaml creates a ServiceMonitor resource that tells Prometheus Operator to scrape Traefik automatically:
metrics:
prometheus:
serviceMonitor:
enabled: true
namespace: ingress
jobLabel: traefik
interval: 30sUpgrade
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--version <NEW_VERSION> \
--namespace monitoring \
--values kubernetes/monitoring/kube-prometheus-values.yaml \
--reuse-valuesUpdate the version in .env (KUBE_PROMETHEUS_VERSION) and re-run make deploy-monitoring.