Metrics¶
kube-prometheus-stack¶
The Prometheus stack is deployed via the kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, kube-state-metrics, and node-exporter into a single release.
Configuration is split between the HelmRelease values and ConfigMaps generated by Kustomize.
ConfigMap pattern¶
Rather than inlining large YAML blobs into the HelmRelease, configuration files (Grafana datasources, alert rules, contact points, notification policies, kube-state-metrics custom resources) are kept as standalone files and assembled into ConfigMaps by Kustomize's configMapGenerator:
# kustomization.yaml
configMapGenerator:
- name: grafana-datasources
files:
- datasources.yaml
- name: grafana-alert-rules
files:
- rules.yaml
- name: flux-kube-state-metrics-config
files:
- kube-state-metrics-config.yaml
options:
disableNameSuffixHash: true
disableNameSuffixHash
By default, Kustomize appends a content hash to ConfigMap names (e.g., grafana-datasources-5f8b9c). disableNameSuffixHash: true keeps names stable so HelmRelease valuesFrom references don't break on every reconciliation.
The same pattern is used for Grafana dashboards — each dashboard JSON file gets its own ConfigMap with the label grafana_dashboard: "1", which Grafana's sidecar picks up automatically.
Prometheus scraping¶
Prometheus uses three complementary scraping patterns:
The recommended pattern for in-cluster workloads. Apps that expose a /metrics endpoint define a ServiceMonitor or PodMonitor resource, and Prometheus picks it up automatically.
Both selectors are set to watch all monitors cluster-wide regardless of namespace or labels:
For workloads that don't have a ServiceMonitor, Prometheus can scrape based on pod/service annotations:
Additional relabel configs attach namespace, pod, and cluster=noah labels to every metric.
NAS services are not in the cluster and can't use ServiceMonitors. They are defined as static scrape configs:
| Job | Port | Interval | Notes |
|---|---|---|---|
| synology-node | 9100 |
5s | Node exporter |
| synology-cadvisor | 8380 |
5s | Container metrics |
| synology-traefik | 8080 |
30s | Reverse proxy metrics |
| synology-watchtower | 8480 |
10m | Docker update metrics, bearer token auth |
| synology-snmp | 9116 |
5s | Via SNMP exporter (SNMPv3) |
kube-state-metrics Flux extension¶
kube-state-metrics is extended with a custom resource state config to expose Flux CD objects as Prometheus metrics. This is loaded via the flux-kube-state-metrics-config ConfigMap and generates gotk_resource_info metrics for every Flux resource type (Kustomization, HelmRelease, GitRepository, ImagePolicy, etc.), enabling the Flux dashboards in Grafana.
Thanos¶
Deployment modes¶
Thanos supports several deployment architectures. The main trade-off is between operational complexity and query capability:
| Mode | Description | Trade-offs |
|---|---|---|
| Sidecar | Runs alongside Prometheus; uploads blocks to object storage | Simple; Prometheus remains the write path |
| Receiver | Replaces Prometheus remote write; Thanos owns the write path | More complex; designed for multi-tenant ingestion |
| Ruler | Standalone rule evaluation against Thanos Query | Decouples alerting from Prometheus |
Why sidecar?
The sidecar mode is the simplest option. Prometheus remains the authoritative data source for recent data; the sidecar's only job is to upload compacted 2-hour blocks to MinIO. The Receiver model is designed for multi-cluster or multi-tenant setups and adds operational overhead that isn't warranted here.
How it fits together¶
graph TB
subgraph Prometheus Pod
P[Prometheus\nTSDB]
S[Thanos Sidecar]
P -- local read --> S
P -- TSDB blocks every 2h --> S
end
M[(MinIO\nS3 bucket)]
S -- upload blocks --> M
SG[StoreGateway\nserves historical blocks]
M -- reads --> SG
C[Compactor\ndownsamples + deduplicates]
M -- reads/writes --> C
QF[QueryFrontend\ncaching + splitting]
Q[Query\nfan-out]
QF --> Q
Q -- recent data\ngrpc --> S
Q -- historical data\ngrpc --> SG
Grafana[Grafana] --> QF
Data flow:
- Prometheus scrapes and stores metrics locally in TSDB
- Every 2 hours, Prometheus flushes a block to disk; the sidecar uploads it to MinIO
- Prometheus retains 3 days locally as a buffer
- StoreGateway indexes and serves blocks from MinIO for anything older than what Prometheus holds
- Compactor runs in the background to downsample and deduplicate blocks
- Query fans out requests to both the sidecar (recent) and StoreGateway (historical), deduplicating results
- QueryFrontend sits in front of Query to cache and split long-range requests; this is the datasource Grafana uses
Retention tiers¶
| Resolution | Retention | Use case |
|---|---|---|
| Raw (full fidelity) | 10 days | Short-term debugging |
| 5m downsampled | 90 days | Weekly/monthly trends |
| 1h downsampled | 10 years | Long-term capacity planning |
SNMP Exporter¶
The SNMP exporter runs in-cluster and acts as a proxy: Prometheus scrapes it at :9116, and the exporter performs the actual SNMP walk against the NAS on each request. This avoids running any agent on the NAS itself.
How it works¶
sequenceDiagram
participant P as Prometheus
participant E as SNMP Exporter
participant N as Synology NAS
P->>E: GET /snmp?target=NAS&module=synology&auth=snmpv3
E->>N: SNMPv3 walk (authPriv, MD5/DES)
N-->>E: OID values
E-->>P: Prometheus metrics
DSM setup¶
SNMPv3 must be enabled on the NAS before the exporter can connect. In DSM, go to Control Panel → Terminal & SNMP and enable:
- SNMP service
- SNMPv3
- SNMP privacy
Warning
The username and passwords configured in DSM must match the credentials stored in Vault (snmp/synology).
Authentication¶
The exporter uses SNMPv3 authPriv security level — both authentication and encryption are required:
| Setting | Value |
|---|---|
| Security level | authPriv |
| Auth protocol | MD5 |
| Privacy protocol | DES |
| Credentials | Pulled from Vault via ExternalSecret |
OID walks¶
The synology module walks six OID trees:
| OID | Description |
|---|---|
1.3.6.1.2.1.2 |
Network interfaces (IF-MIB) |
1.3.6.1.2.1.31.1.1 |
Extended interface counters (64-bit HC counters) |
1.3.6.1.4.1.6574.1 |
Synology system status (temp, power, fans, model, DSM version) |
1.3.6.1.4.1.6574.2 |
Disk info (status, temperature, model, type) |
1.3.6.1.4.1.6574.3 |
RAID status (name, status, free/total size) |
1.3.6.1.4.1.6574.6 |
Service user counts per service |
Collected metrics¶
From 1.3.6.1.4.1.6574.1.*:
| Metric | Description |
|---|---|
systemStatus |
Overall NAS health (1 = normal) |
temperature |
NAS chassis temperature (°C) |
powerStatus |
Power supply status |
systemFanStatus |
System fan status |
cpuFanStatus |
CPU fan status |
upgradeAvailable |
Whether a DSM update is available |
modelName / version |
NAS model and DSM version (info labels) |
From 1.3.6.1.4.1.6574.2.* — per disk, labeled by diskID:
| Metric | Description |
|---|---|
diskStatus |
1=Normal, 2=Initialized, 3=NotInitialized, 4=SystemPartitionFailed, 5=Crashed |
diskTemperature |
Disk temperature (°C) |
diskModel / diskType |
Model name and type (SATA/SSD) |
From 1.3.6.1.4.1.6574.3.* — per volume, labeled by raidName:
| Metric | Description |
|---|---|
raidStatus |
RAID array health |
raidFreeSize |
Free space (bytes) |
raidTotalSize |
Total capacity (bytes) |
From 1.3.6.1.2.1.2.* + 1.3.6.1.2.1.31.* — per interface, labeled by ifName:
Includes ifInOctets, ifOutOctets, ifInErrors, ifOutErrors, ifOperStatus, ifHighSpeed, and 64-bit HC variants (ifHCInOctets, ifHCOutOctets).
From 1.3.6.1.4.1.6574.6.* — per service, labeled by serviceName:
| Metric | Description |
|---|---|
serviceUsers |
Number of active users per service (HTTP, FTP, CIFS, etc.) |