Metrics¶

kube-prometheus-stack¶

The Prometheus stack is deployed via the kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, kube-state-metrics, and node-exporter into a single release.

Configuration is split between the HelmRelease values and ConfigMaps generated by Kustomize.

ConfigMap pattern¶

Rather than inlining large YAML blobs into the HelmRelease, configuration files (Grafana datasources, alert rules, contact points, notification policies, kube-state-metrics custom resources) are kept as standalone files and assembled into ConfigMaps by Kustomize's configMapGenerator:

# kustomization.yaml
configMapGenerator:
- name: grafana-datasources
  files:
  - datasources.yaml
- name: grafana-alert-rules
  files:
  - rules.yaml
- name: flux-kube-state-metrics-config
  files:
  - kube-state-metrics-config.yaml
  options:
    disableNameSuffixHash: true

disableNameSuffixHash

By default, Kustomize appends a content hash to ConfigMap names (e.g., grafana-datasources-5f8b9c). disableNameSuffixHash: true keeps names stable so HelmRelease valuesFrom references don't break on every reconciliation.

The same pattern is used for Grafana dashboards — each dashboard JSON file gets its own ConfigMap with the label grafana_dashboard: "1", which Grafana's sidecar picks up automatically.

Prometheus scraping¶

Prometheus uses three complementary scraping patterns:

ServiceMonitor / PodMonitorAnnotation-basedStatic (NAS targets)

The recommended pattern for in-cluster workloads. Apps that expose a /metrics endpoint define a ServiceMonitor or PodMonitor resource, and Prometheus picks it up automatically.

Both selectors are set to watch all monitors cluster-wide regardless of namespace or labels:

prometheusSpec:
  serviceMonitorSelectorNilUsesHelmValues: false
  podMonitorSelectorNilUsesHelmValues: false

For workloads that don't have a ServiceMonitor, Prometheus can scrape based on pod/service annotations:

prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8080"

Additional relabel configs attach namespace, pod, and cluster=noah labels to every metric.

NAS services are not in the cluster and can't use ServiceMonitors. They are defined as static scrape configs:

Job	Port	Interval	Notes
synology-node	`9100`	5s	Node exporter
synology-cadvisor	`8380`	5s	Container metrics
synology-traefik	`8080`	30s	Reverse proxy metrics
synology-watchtower	`8480`	10m	Docker update metrics, bearer token auth
synology-snmp	`9116`	5s	Via SNMP exporter (SNMPv3)

kube-state-metrics Flux extension¶

kube-state-metrics is extended with a custom resource state config to expose Flux CD objects as Prometheus metrics. This is loaded via the flux-kube-state-metrics-config ConfigMap and generates gotk_resource_info metrics for every Flux resource type (Kustomization, HelmRelease, GitRepository, ImagePolicy, etc.), enabling the Flux dashboards in Grafana.

Thanos¶

Deployment modes¶

Thanos supports several deployment architectures. The main trade-off is between operational complexity and query capability:

Mode	Description	Trade-offs
Sidecar	Runs alongside Prometheus; uploads blocks to object storage	Simple; Prometheus remains the write path
Receiver	Replaces Prometheus remote write; Thanos owns the write path	More complex; designed for multi-tenant ingestion
Ruler	Standalone rule evaluation against Thanos Query	Decouples alerting from Prometheus

Why sidecar?

The sidecar mode is the simplest option. Prometheus remains the authoritative data source for recent data; the sidecar's only job is to upload compacted 2-hour blocks to MinIO. The Receiver model is designed for multi-cluster or multi-tenant setups and adds operational overhead that isn't warranted here.

How it fits together¶

graph TB
  subgraph Prometheus Pod
    P[Prometheus\nTSDB]
    S[Thanos Sidecar]
    P -- local read --> S
    P -- TSDB blocks every 2h --> S
  end

  M[(MinIO\nS3 bucket)]
  S -- upload blocks --> M

  SG[StoreGateway\nserves historical blocks]
  M -- reads --> SG

  C[Compactor\ndownsamples + deduplicates]
  M -- reads/writes --> C

  QF[QueryFrontend\ncaching + splitting]
  Q[Query\nfan-out]

  QF --> Q
  Q -- recent data\ngrpc --> S
  Q -- historical data\ngrpc --> SG

  Grafana[Grafana] --> QF

Hold "Alt" / "Option" to enable pan & zoom

Data flow:

Prometheus scrapes and stores metrics locally in TSDB
Every 2 hours, Prometheus flushes a block to disk; the sidecar uploads it to MinIO
Prometheus retains 3 days locally as a buffer
StoreGateway indexes and serves blocks from MinIO for anything older than what Prometheus holds
Compactor runs in the background to downsample and deduplicate blocks
Query fans out requests to both the sidecar (recent) and StoreGateway (historical), deduplicating results
QueryFrontend sits in front of Query to cache and split long-range requests; this is the datasource Grafana uses

Retention tiers¶

Resolution	Retention	Use case
Raw (full fidelity)	10 days	Short-term debugging
5m downsampled	90 days	Weekly/monthly trends
1h downsampled	10 years	Long-term capacity planning

SNMP Exporter¶

The SNMP exporter runs in-cluster and acts as a proxy: Prometheus scrapes it at :9116, and the exporter performs the actual SNMP walk against the NAS on each request. This avoids running any agent on the NAS itself.

How it works¶

sequenceDiagram
  participant P as Prometheus
  participant E as SNMP Exporter
  participant N as Synology NAS

  P->>E: GET /snmp?target=NAS&module=synology&auth=snmpv3
  E->>N: SNMPv3 walk (authPriv, MD5/DES)
  N-->>E: OID values
  E-->>P: Prometheus metrics

Hold "Alt" / "Option" to enable pan & zoom

DSM setup¶

SNMPv3 must be enabled on the NAS before the exporter can connect. In DSM, go to Control Panel → Terminal & SNMP and enable:

SNMP service
SNMPv3
SNMP privacy

Warning

The username and passwords configured in DSM must match the credentials stored in Vault (snmp/synology).

Authentication¶

The exporter uses SNMPv3 authPriv security level — both authentication and encryption are required:

Setting	Value
Security level	authPriv
Auth protocol	MD5
Privacy protocol	DES
Credentials	Pulled from Vault via ExternalSecret

OID walks¶

The synology module walks six OID trees:

OID	Description
`1.3.6.1.2.1.2`	Network interfaces (IF-MIB)
`1.3.6.1.2.1.31.1.1`	Extended interface counters (64-bit HC counters)
`1.3.6.1.4.1.6574.1`	Synology system status (temp, power, fans, model, DSM version)
`1.3.6.1.4.1.6574.2`	Disk info (status, temperature, model, type)
`1.3.6.1.4.1.6574.3`	RAID status (name, status, free/total size)
`1.3.6.1.4.1.6574.6`	Service user counts per service

Collected metrics¶

SystemDisksRAIDNetworkServices

From 1.3.6.1.4.1.6574.1.*:

Metric	Description
`systemStatus`	Overall NAS health (1 = normal)
`temperature`	NAS chassis temperature (°C)
`powerStatus`	Power supply status
`systemFanStatus`	System fan status
`cpuFanStatus`	CPU fan status
`upgradeAvailable`	Whether a DSM update is available
`modelName` / `version`	NAS model and DSM version (info labels)

From 1.3.6.1.4.1.6574.2.* — per disk, labeled by diskID:

Metric	Description
`diskStatus`	1=Normal, 2=Initialized, 3=NotInitialized, 4=SystemPartitionFailed, 5=Crashed
`diskTemperature`	Disk temperature (°C)
`diskModel` / `diskType`	Model name and type (SATA/SSD)

From 1.3.6.1.4.1.6574.3.* — per volume, labeled by raidName:

Metric	Description
`raidStatus`	RAID array health
`raidFreeSize`	Free space (bytes)
`raidTotalSize`	Total capacity (bytes)

From 1.3.6.1.2.1.2.* + 1.3.6.1.2.1.31.* — per interface, labeled by ifName:

Includes ifInOctets, ifOutOctets, ifInErrors, ifOutErrors, ifOperStatus, ifHighSpeed, and 64-bit HC variants (ifHCInOctets, ifHCOutOctets).

From 1.3.6.1.4.1.6574.6.* — per service, labeled by serviceName:

Metric	Description
`serviceUsers`	Number of active users per service (HTTP, FTP, CIFS, etc.)