Grafana¶

Grafana runs as part of the kube-prometheus-stack HelmRelease in the mon namespace, backed by a PostgreSQL database and authenticated via Authentik OAuth2.

Configuration¶

Grafana is configured through three complementary mechanisms:

Mechanism	Used for
`grafana.ini` in HelmRelease values	Server settings, auth, database, feature flags
Environment variables (from Secrets)	Sensitive values injected at runtime
Mounted ConfigMaps / Secrets	Datasources, dashboards, alert rules, OAuth credentials

Environment variables¶

The HelmRelease uses envFromSecret to load the entire grafana-secrets Secret as environment variables. Grafana natively supports GF_* env vars that override any grafana.ini setting:

Variable	Source	Purpose
`GF_DATABASE_USER`	Vault → `psql/grafana`	PostgreSQL username
`GF_DATABASE_PASSWORD`	Vault → `psql/grafana`	PostgreSQL password
`GF_SECURITY_ADMIN_USER`	Vault → `grafana/admin`	Local admin username
`GF_SECURITY_ADMIN_PASSWORD`	Vault → `grafana/admin`	Local admin password
`TELEGRAM_BOT_TOKEN`	Vault → `grafana/telegram-bot`	Injected separately via `envValueFrom` for alerting

Secret management

All secrets are managed by ExternalSecrets operator, pulling from Vault. Three ExternalSecrets are defined: grafana-secrets (DB + admin creds), grafana-oauth (OIDC client credentials), and grafana-telegram-bot (Telegram bot token).

PostgreSQL backend¶

Grafana uses PostgreSQL at pg.hdhomelab.com for persistent state (dashboards created in UI, users, preferences, alert history).

The grafana database is provisioned by OpenTofu (tf-deploy/psql) — not by the pod itself. This keeps database lifecycle management outside of Kubernetes and avoids Grafana needing elevated DB permissions at startup.

The database connection is configured in grafana.ini:

[database]
type = postgres
host = pg.hdhomelab.com
name = grafana

Credentials come from the GF_DATABASE_USER / GF_DATABASE_PASSWORD env vars, which Grafana picks up automatically.

OAuth2 (Authentik)¶

Grafana is authenticated via Authentik using the generic OAuth2 provider. The login screen still shows the built-in form, but an authentik button is available for SSO.

[auth]
signout_redirect_url = https://auth.hdhomelab.com/application/o/grafana/end-session/
oauth_auto_login = false

[auth.generic_oauth]
enabled = true
name = authentik
client_id = $__file{/etc/secrets/auth_generic_oauth/client_id}
client_secret = $__file{/etc/secrets/auth_generic_oauth/client_secret}
scopes = openid profile email
auth_url = https://auth.hdhomelab.com/application/o/authorize/
token_url = https://auth.hdhomelab.com/application/o/token/
api_url = https://auth.hdhomelab.com/application/o/userinfo/
role_attribute_path = contains(groups[*], 'grafana_admin') && 'GrafanaAdmin' || ...

The OAuth credentials are mounted as files from the grafana-oauth Secret (not env vars) to avoid them appearing in the process environment:

extraSecretMounts:
- name: auth-generic-oauth
  secretName: grafana-oauth
  mountPath: /etc/secrets/auth_generic_oauth
  readOnly: true

Role mapping¶

Grafana roles are derived from Authentik group membership via a JMESPath expression evaluated against the userinfo response:

Authentik Group	Grafana Role
`grafana_admin`	GrafanaAdmin
`grafana_editor`	Editor
`grafana_viewer`	Viewer
(no matching group)	None (login denied)

Sidecar¶

The Grafana sidecar (k8s-sidecar) watches for ConfigMaps in the mon namespace and hot-reloads Grafana without a restart. Four sidecar channels are active:

Channel	Label / Annotation	Mount path
Dashboards	`grafana_dashboard: "1"`	`/tmp/dashboards/`
Datasources	(mounted directly)	`/etc/grafana/provisioning/datasources/`
Alerts	`grafana_alert: "1"`	`/etc/grafana/provisioning/alerting/`
Notifiers	`grafana_notifier: "1"`	`/etc/grafana/provisioning/notifiers/`

Folder placement

The grafana_dashboard_folder annotation on a ConfigMap controls which folder the dashboard appears in inside Grafana. If the annotation is absent, the dashboard lands in the folder set by folderAnnotation default (kubernetes).

ConfigMaps¶

All provisioning files are managed as ConfigMaps generated by Kustomize in flux/monitoring/noah/kube-prometheus-stack/:

configMapGenerator:
- name: grafana-datasources            # datasources.yaml
- name: grafana-alert-contact-points   # contactpoints.yaml
- name: grafana-alert-notification-policies  # policies.yaml
- name: grafana-alert-rules            # rules.yaml

All use disableNameSuffixHash: true so ConfigMap names stay stable across reconciliations — the HelmRelease extraConfigmapMounts references them by name.

Dashboard ConfigMaps live in a separate directory (flux/monitoring/noah/grafana-dashboards/) and are picked up by the sidecar automatically:

configMapGenerator:
- name: grafana-dashboard-traefik
  files: [traefik.json]
  options:
    disableNameSuffixHash: true
    labels:
      grafana_dashboard: "1"
    annotations:
      grafana_dashboard_folder: network

Datasources¶

Datasources are provisioned from datasources.yaml mounted at /etc/grafana/provisioning/datasources/:

Name	UID	Target	Notes
Prometheus	`prometheus-001`	`kube-prometheus-stack-prometheus:9090`	Short-term, high-fidelity; default datasource
Thanos	`thanos-001`	`thanos-query-frontend:9090`	Long-range queries; `max_source_resolution=auto`
Loki	`loki-001`	`loki-gateway:80`	Log queries

Which datasource to use?

Use Prometheus for recent data (< 3 days) where full resolution matters. Use Thanos for anything older — it automatically merges the Prometheus sidecar and historical MinIO blocks.

Dashboards¶

Dashboards are stored as JSON files in flux/monitoring/noah/grafana-dashboards/, each wrapped in its own ConfigMap. The sidecar picks them up and reloads Grafana on changes.

Dashboard	Folder	Description
traefik	network	Edge proxy request rates, latencies, response codes
pihole	network	DNS query rates and ad-block stats
synology	(General)	NAS storage, RAID health, disk temps, system metrics
flux-cluster	flux	Flux reconciliation status across all kustomizations
flux-control-plane	flux	Flux controller CPU, memory, and queue depth
local-path-storage	kubernetes	Local-path provisioner PVC usage
minecraft	minecraft	JVM memory, GC, player counts
cs2	games	CS2 server metrics
cs2-demo-manager	games	Demo management service metrics
pinchflat	pinchflat	YouTube download queue and throughput

Adding a new dashboard

Export the dashboard JSON from Grafana UI, save it to flux/monitoring/noah/grafana-dashboards/<name>.json, and add a configMapGenerator entry in that directory's kustomization.yaml with the grafana_dashboard: "1" label and desired grafana_dashboard_folder annotation.

Alerting¶

Grafana Unified Alerting is used. Prometheus AlertManager is disabled. All alert rules, contact points, and routing policies are provisioned from ConfigMaps.

Contact points¶

The Telegram contact point reads the bot token from the TELEGRAM_BOT_TOKEN environment variable (injected from Vault) and uses a custom Go template to format messages:

# contactpoints.yaml
contactPoints:
- name: telegram-minecraft
  receivers:
  - uid: telegram-001
    type: telegram
    settings:
      bottoken: $TELEGRAM_BOT_TOKEN
      chatid: "<chat-id>"
      message: '{{ template "telegram.summary" . }}'

templates:
- name: "Custom Templates"
  template: |
    {{ define "telegram.summary" }}
    {{ if gt (len .Alerts.Firing) 0 }}{{ template "__summary_only_message" .Alerts.Firing }}{{ end }}
    {{ if gt (len .Alerts.Resolved) 0 }}{{ template "__summary_only_message" .Alerts.Resolved }}{{ end }}
    {{ end }}

    {{ define "__summary_only_message" }}{{ range .}}{{ .Annotations.summary }}{{ end }}{{ end }}

The template renders only the summary annotation, keeping messages concise.

Alert rules¶

Two alert groups are defined in rules.yaml, both in the minecraft folder:

Memory UsagePlayers Online

Fires when Minecraft pod memory working set exceeds 98% of its memory request. Evaluated every minute over a 10-minute window.

sum(container_memory_working_set_bytes{namespace="minecraft", ...}) by (pod)
/ sum(cluster:namespace:pod_memory:active:kube_pod_container_resource_requests{namespace="minecraft"}) by (pod)
> 0.98

Annotation: {{ $labels.pod }} memory usage reaches {{ humanizePercentage $values.A.Value }}.

Fires when the player count changes — detects both joins and leaves. Compares the current player count against the count 1 minute ago:

abs(
  minecraft_status_players_online_count{namespace="minecraft"}
  - minecraft_status_players_online_count{namespace="minecraft"} offset 1m
) > 0

Annotation: {{ $values.A.Value }} player(s) now online on {{ reReplaceAll "-mc-monitor" "" $labels.service }} (was {{ $values.B.Value }})

Notification policy¶

# policies.yaml
policies:
- receiver: grafana-default-email   # catch-all
  group_by: [grafana_folder, alertname]
  routes:
  - receiver: telegram-minecraft
    object_matchers:
    - [alertname, =~, "Minecraft.*"]
    group_wait: 5s
    group_interval: 5s
    repeat_interval: 4w            # silence for 4 weeks after first notify

4-week repeat interval

The long repeat interval prevents Telegram spam for persistent states (e.g., server stays at high memory). Alerts still fire immediately on state changes.