Grafana¶
Grafana runs as part of the kube-prometheus-stack HelmRelease in the mon namespace, backed by a PostgreSQL database and authenticated via Authentik OAuth2.
Configuration¶
Grafana is configured through three complementary mechanisms:
| Mechanism | Used for |
|---|---|
grafana.ini in HelmRelease values |
Server settings, auth, database, feature flags |
| Environment variables (from Secrets) | Sensitive values injected at runtime |
| Mounted ConfigMaps / Secrets | Datasources, dashboards, alert rules, OAuth credentials |
Environment variables¶
The HelmRelease uses envFromSecret to load the entire grafana-secrets Secret as environment variables. Grafana natively supports GF_* env vars that override any grafana.ini setting:
| Variable | Source | Purpose |
|---|---|---|
GF_DATABASE_USER |
Vault → psql/grafana |
PostgreSQL username |
GF_DATABASE_PASSWORD |
Vault → psql/grafana |
PostgreSQL password |
GF_SECURITY_ADMIN_USER |
Vault → grafana/admin |
Local admin username |
GF_SECURITY_ADMIN_PASSWORD |
Vault → grafana/admin |
Local admin password |
TELEGRAM_BOT_TOKEN |
Vault → grafana/telegram-bot |
Injected separately via envValueFrom for alerting |
Secret management
All secrets are managed by ExternalSecrets operator, pulling from Vault. Three ExternalSecrets are defined: grafana-secrets (DB + admin creds), grafana-oauth (OIDC client credentials), and grafana-telegram-bot (Telegram bot token).
PostgreSQL backend¶
Grafana uses PostgreSQL at pg.hdhomelab.com for persistent state (dashboards created in UI, users, preferences, alert history).
The grafana database is provisioned by OpenTofu (tf-deploy/psql) — not by the pod itself. This keeps database lifecycle management outside of Kubernetes and avoids Grafana needing elevated DB permissions at startup.
The database connection is configured in grafana.ini:
Credentials come from the GF_DATABASE_USER / GF_DATABASE_PASSWORD env vars, which Grafana picks up automatically.
OAuth2 (Authentik)¶
Grafana is authenticated via Authentik using the generic OAuth2 provider. The login screen still shows the built-in form, but an authentik button is available for SSO.
[auth]
signout_redirect_url = https://auth.hdhomelab.com/application/o/grafana/end-session/
oauth_auto_login = false
[auth.generic_oauth]
enabled = true
name = authentik
client_id = $__file{/etc/secrets/auth_generic_oauth/client_id}
client_secret = $__file{/etc/secrets/auth_generic_oauth/client_secret}
scopes = openid profile email
auth_url = https://auth.hdhomelab.com/application/o/authorize/
token_url = https://auth.hdhomelab.com/application/o/token/
api_url = https://auth.hdhomelab.com/application/o/userinfo/
role_attribute_path = contains(groups[*], 'grafana_admin') && 'GrafanaAdmin' || ...
The OAuth credentials are mounted as files from the grafana-oauth Secret (not env vars) to avoid them appearing in the process environment:
extraSecretMounts:
- name: auth-generic-oauth
secretName: grafana-oauth
mountPath: /etc/secrets/auth_generic_oauth
readOnly: true
Role mapping¶
Grafana roles are derived from Authentik group membership via a JMESPath expression evaluated against the userinfo response:
| Authentik Group | Grafana Role |
|---|---|
grafana_admin |
GrafanaAdmin |
grafana_editor |
Editor |
grafana_viewer |
Viewer |
| (no matching group) | None (login denied) |
Sidecar¶
The Grafana sidecar (k8s-sidecar) watches for ConfigMaps in the mon namespace and hot-reloads Grafana without a restart. Four sidecar channels are active:
| Channel | Label / Annotation | Mount path |
|---|---|---|
| Dashboards | grafana_dashboard: "1" |
/tmp/dashboards/ |
| Datasources | (mounted directly) | /etc/grafana/provisioning/datasources/ |
| Alerts | grafana_alert: "1" |
/etc/grafana/provisioning/alerting/ |
| Notifiers | grafana_notifier: "1" |
/etc/grafana/provisioning/notifiers/ |
Folder placement
The grafana_dashboard_folder annotation on a ConfigMap controls which folder the dashboard appears in inside Grafana. If the annotation is absent, the dashboard lands in the folder set by folderAnnotation default (kubernetes).
ConfigMaps¶
All provisioning files are managed as ConfigMaps generated by Kustomize in flux/monitoring/noah/kube-prometheus-stack/:
configMapGenerator:
- name: grafana-datasources # datasources.yaml
- name: grafana-alert-contact-points # contactpoints.yaml
- name: grafana-alert-notification-policies # policies.yaml
- name: grafana-alert-rules # rules.yaml
All use disableNameSuffixHash: true so ConfigMap names stay stable across reconciliations — the HelmRelease extraConfigmapMounts references them by name.
Dashboard ConfigMaps live in a separate directory (flux/monitoring/noah/grafana-dashboards/) and are picked up by the sidecar automatically:
configMapGenerator:
- name: grafana-dashboard-traefik
files: [traefik.json]
options:
disableNameSuffixHash: true
labels:
grafana_dashboard: "1"
annotations:
grafana_dashboard_folder: network
Datasources¶
Datasources are provisioned from datasources.yaml mounted at /etc/grafana/provisioning/datasources/:
| Name | UID | Target | Notes |
|---|---|---|---|
| Prometheus | prometheus-001 |
kube-prometheus-stack-prometheus:9090 |
Short-term, high-fidelity; default datasource |
| Thanos | thanos-001 |
thanos-query-frontend:9090 |
Long-range queries; max_source_resolution=auto |
| Loki | loki-001 |
loki-gateway:80 |
Log queries |
Which datasource to use?
Use Prometheus for recent data (< 3 days) where full resolution matters. Use Thanos for anything older — it automatically merges the Prometheus sidecar and historical MinIO blocks.
Dashboards¶
Dashboards are stored as JSON files in flux/monitoring/noah/grafana-dashboards/, each wrapped in its own ConfigMap. The sidecar picks them up and reloads Grafana on changes.
| Dashboard | Folder | Description |
|---|---|---|
| traefik | network | Edge proxy request rates, latencies, response codes |
| pihole | network | DNS query rates and ad-block stats |
| synology | (General) | NAS storage, RAID health, disk temps, system metrics |
| flux-cluster | flux | Flux reconciliation status across all kustomizations |
| flux-control-plane | flux | Flux controller CPU, memory, and queue depth |
| local-path-storage | kubernetes | Local-path provisioner PVC usage |
| minecraft | minecraft | JVM memory, GC, player counts |
| cs2 | games | CS2 server metrics |
| cs2-demo-manager | games | Demo management service metrics |
| pinchflat | pinchflat | YouTube download queue and throughput |
Adding a new dashboard
Export the dashboard JSON from Grafana UI, save it to flux/monitoring/noah/grafana-dashboards/<name>.json, and add a configMapGenerator entry in that directory's kustomization.yaml with the grafana_dashboard: "1" label and desired grafana_dashboard_folder annotation.
Alerting¶
Grafana Unified Alerting is used. Prometheus AlertManager is disabled. All alert rules, contact points, and routing policies are provisioned from ConfigMaps.
Contact points¶
The Telegram contact point reads the bot token from the TELEGRAM_BOT_TOKEN environment variable (injected from Vault) and uses a custom Go template to format messages:
# contactpoints.yaml
contactPoints:
- name: telegram-minecraft
receivers:
- uid: telegram-001
type: telegram
settings:
bottoken: $TELEGRAM_BOT_TOKEN
chatid: "<chat-id>"
message: '{{ template "telegram.summary" . }}'
templates:
- name: "Custom Templates"
template: |
{{ define "telegram.summary" }}
{{ if gt (len .Alerts.Firing) 0 }}{{ template "__summary_only_message" .Alerts.Firing }}{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}{{ template "__summary_only_message" .Alerts.Resolved }}{{ end }}
{{ end }}
{{ define "__summary_only_message" }}{{ range .}}{{ .Annotations.summary }}{{ end }}{{ end }}
The template renders only the summary annotation, keeping messages concise.
Alert rules¶
Two alert groups are defined in rules.yaml, both in the minecraft folder:
Fires when Minecraft pod memory working set exceeds 98% of its memory request. Evaluated every minute over a 10-minute window.
sum(container_memory_working_set_bytes{namespace="minecraft", ...}) by (pod)
/ sum(cluster:namespace:pod_memory:active:kube_pod_container_resource_requests{namespace="minecraft"}) by (pod)
> 0.98
Annotation: {{ $labels.pod }} memory usage reaches {{ humanizePercentage $values.A.Value }}.
Fires when the player count changes — detects both joins and leaves. Compares the current player count against the count 1 minute ago:
abs(
minecraft_status_players_online_count{namespace="minecraft"}
- minecraft_status_players_online_count{namespace="minecraft"} offset 1m
) > 0
Annotation: {{ $values.A.Value }} player(s) now online on {{ reReplaceAll "-mc-monitor" "" $labels.service }} (was {{ $values.B.Value }})
Notification policy¶
# policies.yaml
policies:
- receiver: grafana-default-email # catch-all
group_by: [grafana_folder, alertname]
routes:
- receiver: telegram-minecraft
object_matchers:
- [alertname, =~, "Minecraft.*"]
group_wait: 5s
group_interval: 5s
repeat_interval: 4w # silence for 4 weeks after first notify
4-week repeat interval
The long repeat interval prevents Telegram spam for persistent states (e.g., server stays at high memory). Alerts still fire immediately on state changes.