Overview¶
The monitoring stack runs in the mon namespace and provides metrics, logs, and alerting for both the Kubernetes cluster and the Synology NAS.
Components¶
| Component | Purpose | Ingress |
|---|---|---|
| Prometheus | Metrics collection and short-term storage | prometheus.hdhomelab.com |
| Thanos | Long-term metrics storage, query federation | thanos.hdhomelab.com |
| SNMP Exporter | Synology NAS metrics via SNMPv3 | — |
| Component | Purpose | Ingress |
|---|---|---|
| Loki | Log aggregation | — |
| Fluent-bit | Log shipping from K8s nodes (DaemonSet) | — |
| Promtail | Log shipping from NAS Docker containers | — |
| Component | Purpose | Ingress |
|---|---|---|
| Grafana | Dashboards and alerting | grafana.hdhomelab.com |
Architecture¶
graph LR
subgraph Sources
K8s[K8s targets\nServiceMonitors]
NAS[NAS\nnode-exporter · cAdvisor\nTraefik · Watchtower]
SNMP[Synology\nSNMP]
NASLogs[NAS Docker\ncontainers]
K8sLogs[K8s containers\n+ Talos system logs]
end
subgraph Metrics
Prom[Prometheus]
Thanos[Thanos]
Minio[(MinIO\nS3)]
end
subgraph Logs
Fluent[Fluent-bit]
Promtail[Promtail]
Loki[Loki]
LokiMinio[(MinIO\nS3)]
end
Grafana[Grafana]
K8s --> Prom
NAS --> Prom
SNMP --> Prom
Prom --> Thanos
Thanos <--> Minio
Thanos --> Grafana
K8sLogs -->|tail + TCP| Fluent
NASLogs --> Promtail
Fluent --> Loki
Promtail --> Loki
Loki <--> LokiMinio
Loki --> Grafana
Hold "Alt" / "Option" to enable pan & zoom
Retention¶
| Data | Store | Retention |
|---|---|---|
| Prometheus metrics | local-path PVC | 3 days |
| Thanos raw | MinIO (S3) | 10 days |
| Thanos 5m downsampled | MinIO (S3) | 90 days |
| Thanos 1h downsampled | MinIO (S3) | 10 years |
| Loki logs | MinIO (S3) | 28 days |
Long-range queries
For queries beyond 3 days, use the Thanos datasource in Grafana (thanos-001) rather than Prometheus. Thanos Query automatically merges recent data from the Prometheus sidecar with historical blocks from MinIO.