Skip to content

Metrics

kube-prometheus-stack

The Prometheus stack is deployed via the kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, kube-state-metrics, and node-exporter into a single release.

Configuration is split between the HelmRelease values and ConfigMaps generated by Kustomize.

ConfigMap pattern

Rather than inlining large YAML blobs into the HelmRelease, configuration files (Grafana datasources, alert rules, contact points, notification policies, kube-state-metrics custom resources) are kept as standalone files and assembled into ConfigMaps by Kustomize's configMapGenerator:

# kustomization.yaml
configMapGenerator:
- name: grafana-datasources
  files:
  - datasources.yaml
- name: grafana-alert-rules
  files:
  - rules.yaml
- name: flux-kube-state-metrics-config
  files:
  - kube-state-metrics-config.yaml
  options:
    disableNameSuffixHash: true

disableNameSuffixHash

By default, Kustomize appends a content hash to ConfigMap names (e.g., grafana-datasources-5f8b9c). disableNameSuffixHash: true keeps names stable so HelmRelease valuesFrom references don't break on every reconciliation.

The same pattern is used for Grafana dashboards — each dashboard JSON file gets its own ConfigMap with the label grafana_dashboard: "1", which Grafana's sidecar picks up automatically.

Prometheus scraping

Prometheus uses three complementary scraping patterns:

The recommended pattern for in-cluster workloads. Apps that expose a /metrics endpoint define a ServiceMonitor or PodMonitor resource, and Prometheus picks it up automatically.

Both selectors are set to watch all monitors cluster-wide regardless of namespace or labels:

prometheusSpec:
  serviceMonitorSelectorNilUsesHelmValues: false
  podMonitorSelectorNilUsesHelmValues: false

For workloads that don't have a ServiceMonitor, Prometheus can scrape based on pod/service annotations:

prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8080"

Additional relabel configs attach namespace, pod, and cluster=noah labels to every metric.

NAS services are not in the cluster and can't use ServiceMonitors. They are defined as static scrape configs:

Job Port Interval Notes
synology-node 9100 5s Node exporter
synology-cadvisor 8380 5s Container metrics
synology-traefik 8080 30s Reverse proxy metrics
synology-watchtower 8480 10m Docker update metrics, bearer token auth
synology-snmp 9116 5s Via SNMP exporter (SNMPv3)

kube-state-metrics Flux extension

kube-state-metrics is extended with a custom resource state config to expose Flux CD objects as Prometheus metrics. This is loaded via the flux-kube-state-metrics-config ConfigMap and generates gotk_resource_info metrics for every Flux resource type (Kustomization, HelmRelease, GitRepository, ImagePolicy, etc.), enabling the Flux dashboards in Grafana.


Thanos

Deployment modes

Thanos supports several deployment architectures. The main trade-off is between operational complexity and query capability:

Mode Description Trade-offs
Sidecar Runs alongside Prometheus; uploads blocks to object storage Simple; Prometheus remains the write path
Receiver Replaces Prometheus remote write; Thanos owns the write path More complex; designed for multi-tenant ingestion
Ruler Standalone rule evaluation against Thanos Query Decouples alerting from Prometheus

Why sidecar?

The sidecar mode is the simplest option. Prometheus remains the authoritative data source for recent data; the sidecar's only job is to upload compacted 2-hour blocks to MinIO. The Receiver model is designed for multi-cluster or multi-tenant setups and adds operational overhead that isn't warranted here.

How it fits together

graph TB
  subgraph Prometheus Pod
    P[Prometheus\nTSDB]
    S[Thanos Sidecar]
    P -- local read --> S
    P -- TSDB blocks every 2h --> S
  end

  M[(MinIO\nS3 bucket)]
  S -- upload blocks --> M

  SG[StoreGateway\nserves historical blocks]
  M -- reads --> SG

  C[Compactor\ndownsamples + deduplicates]
  M -- reads/writes --> C

  QF[QueryFrontend\ncaching + splitting]
  Q[Query\nfan-out]

  QF --> Q
  Q -- recent data\ngrpc --> S
  Q -- historical data\ngrpc --> SG

  Grafana[Grafana] --> QF
Hold "Alt" / "Option" to enable pan & zoom

Data flow:

  1. Prometheus scrapes and stores metrics locally in TSDB
  2. Every 2 hours, Prometheus flushes a block to disk; the sidecar uploads it to MinIO
  3. Prometheus retains 3 days locally as a buffer
  4. StoreGateway indexes and serves blocks from MinIO for anything older than what Prometheus holds
  5. Compactor runs in the background to downsample and deduplicate blocks
  6. Query fans out requests to both the sidecar (recent) and StoreGateway (historical), deduplicating results
  7. QueryFrontend sits in front of Query to cache and split long-range requests; this is the datasource Grafana uses

Retention tiers

Resolution Retention Use case
Raw (full fidelity) 10 days Short-term debugging
5m downsampled 90 days Weekly/monthly trends
1h downsampled 10 years Long-term capacity planning

SNMP Exporter

The SNMP exporter runs in-cluster and acts as a proxy: Prometheus scrapes it at :9116, and the exporter performs the actual SNMP walk against the NAS on each request. This avoids running any agent on the NAS itself.

How it works

sequenceDiagram
  participant P as Prometheus
  participant E as SNMP Exporter
  participant N as Synology NAS

  P->>E: GET /snmp?target=NAS&module=synology&auth=snmpv3
  E->>N: SNMPv3 walk (authPriv, MD5/DES)
  N-->>E: OID values
  E-->>P: Prometheus metrics
Hold "Alt" / "Option" to enable pan & zoom

DSM setup

SNMPv3 must be enabled on the NAS before the exporter can connect. In DSM, go to Control Panel → Terminal & SNMP and enable:

  • SNMP service
  • SNMPv3
  • SNMP privacy

Warning

The username and passwords configured in DSM must match the credentials stored in Vault (snmp/synology).

Authentication

The exporter uses SNMPv3 authPriv security level — both authentication and encryption are required:

Setting Value
Security level authPriv
Auth protocol MD5
Privacy protocol DES
Credentials Pulled from Vault via ExternalSecret

OID walks

The synology module walks six OID trees:

OID Description
1.3.6.1.2.1.2 Network interfaces (IF-MIB)
1.3.6.1.2.1.31.1.1 Extended interface counters (64-bit HC counters)
1.3.6.1.4.1.6574.1 Synology system status (temp, power, fans, model, DSM version)
1.3.6.1.4.1.6574.2 Disk info (status, temperature, model, type)
1.3.6.1.4.1.6574.3 RAID status (name, status, free/total size)
1.3.6.1.4.1.6574.6 Service user counts per service

Collected metrics

From 1.3.6.1.4.1.6574.1.*:

Metric Description
systemStatus Overall NAS health (1 = normal)
temperature NAS chassis temperature (°C)
powerStatus Power supply status
systemFanStatus System fan status
cpuFanStatus CPU fan status
upgradeAvailable Whether a DSM update is available
modelName / version NAS model and DSM version (info labels)

From 1.3.6.1.4.1.6574.2.* — per disk, labeled by diskID:

Metric Description
diskStatus 1=Normal, 2=Initialized, 3=NotInitialized, 4=SystemPartitionFailed, 5=Crashed
diskTemperature Disk temperature (°C)
diskModel / diskType Model name and type (SATA/SSD)

From 1.3.6.1.4.1.6574.3.* — per volume, labeled by raidName:

Metric Description
raidStatus RAID array health
raidFreeSize Free space (bytes)
raidTotalSize Total capacity (bytes)

From 1.3.6.1.2.1.2.* + 1.3.6.1.2.1.31.* — per interface, labeled by ifName:

Includes ifInOctets, ifOutOctets, ifInErrors, ifOutErrors, ifOperStatus, ifHighSpeed, and 64-bit HC variants (ifHCInOctets, ifHCOutOctets).

From 1.3.6.1.4.1.6574.6.* — per service, labeled by serviceName:

Metric Description
serviceUsers Number of active users per service (HTTP, FTP, CIFS, etc.)