Skip to content

Grafana

Grafana runs as part of the kube-prometheus-stack HelmRelease in the mon namespace, backed by a PostgreSQL database and authenticated via Authentik OAuth2.

Configuration

Grafana is configured through three complementary mechanisms:

Mechanism Used for
grafana.ini in HelmRelease values Server settings, auth, database, feature flags
Environment variables (from Secrets) Sensitive values injected at runtime
Mounted ConfigMaps / Secrets Datasources, dashboards, alert rules, OAuth credentials

Environment variables

The HelmRelease uses envFromSecret to load the entire grafana-secrets Secret as environment variables. Grafana natively supports GF_* env vars that override any grafana.ini setting:

Variable Source Purpose
GF_DATABASE_USER Vault → psql/grafana PostgreSQL username
GF_DATABASE_PASSWORD Vault → psql/grafana PostgreSQL password
GF_SECURITY_ADMIN_USER Vault → grafana/admin Local admin username
GF_SECURITY_ADMIN_PASSWORD Vault → grafana/admin Local admin password
TELEGRAM_BOT_TOKEN Vault → grafana/telegram-bot Injected separately via envValueFrom for alerting

Secret management

All secrets are managed by ExternalSecrets operator, pulling from Vault. Three ExternalSecrets are defined: grafana-secrets (DB + admin creds), grafana-oauth (OIDC client credentials), and grafana-telegram-bot (Telegram bot token).


PostgreSQL backend

Grafana uses PostgreSQL at pg.hdhomelab.com for persistent state (dashboards created in UI, users, preferences, alert history).

The grafana database is provisioned by OpenTofu (tf-deploy/psql) — not by the pod itself. This keeps database lifecycle management outside of Kubernetes and avoids Grafana needing elevated DB permissions at startup.

The database connection is configured in grafana.ini:

[database]
type = postgres
host = pg.hdhomelab.com
name = grafana

Credentials come from the GF_DATABASE_USER / GF_DATABASE_PASSWORD env vars, which Grafana picks up automatically.


OAuth2 (Authentik)

Grafana is authenticated via Authentik using the generic OAuth2 provider. The login screen still shows the built-in form, but an authentik button is available for SSO.

[auth]
signout_redirect_url = https://auth.hdhomelab.com/application/o/grafana/end-session/
oauth_auto_login = false

[auth.generic_oauth]
enabled = true
name = authentik
client_id = $__file{/etc/secrets/auth_generic_oauth/client_id}
client_secret = $__file{/etc/secrets/auth_generic_oauth/client_secret}
scopes = openid profile email
auth_url = https://auth.hdhomelab.com/application/o/authorize/
token_url = https://auth.hdhomelab.com/application/o/token/
api_url = https://auth.hdhomelab.com/application/o/userinfo/
role_attribute_path = contains(groups[*], 'grafana_admin') && 'GrafanaAdmin' || ...

The OAuth credentials are mounted as files from the grafana-oauth Secret (not env vars) to avoid them appearing in the process environment:

extraSecretMounts:
- name: auth-generic-oauth
  secretName: grafana-oauth
  mountPath: /etc/secrets/auth_generic_oauth
  readOnly: true

Role mapping

Grafana roles are derived from Authentik group membership via a JMESPath expression evaluated against the userinfo response:

Authentik Group Grafana Role
grafana_admin GrafanaAdmin
grafana_editor Editor
grafana_viewer Viewer
(no matching group) None (login denied)

Sidecar

The Grafana sidecar (k8s-sidecar) watches for ConfigMaps in the mon namespace and hot-reloads Grafana without a restart. Four sidecar channels are active:

Channel Label / Annotation Mount path
Dashboards grafana_dashboard: "1" /tmp/dashboards/
Datasources (mounted directly) /etc/grafana/provisioning/datasources/
Alerts grafana_alert: "1" /etc/grafana/provisioning/alerting/
Notifiers grafana_notifier: "1" /etc/grafana/provisioning/notifiers/

Folder placement

The grafana_dashboard_folder annotation on a ConfigMap controls which folder the dashboard appears in inside Grafana. If the annotation is absent, the dashboard lands in the folder set by folderAnnotation default (kubernetes).


ConfigMaps

All provisioning files are managed as ConfigMaps generated by Kustomize in flux/monitoring/noah/kube-prometheus-stack/:

configMapGenerator:
- name: grafana-datasources            # datasources.yaml
- name: grafana-alert-contact-points   # contactpoints.yaml
- name: grafana-alert-notification-policies  # policies.yaml
- name: grafana-alert-rules            # rules.yaml

All use disableNameSuffixHash: true so ConfigMap names stay stable across reconciliations — the HelmRelease extraConfigmapMounts references them by name.

Dashboard ConfigMaps live in a separate directory (flux/monitoring/noah/grafana-dashboards/) and are picked up by the sidecar automatically:

configMapGenerator:
- name: grafana-dashboard-traefik
  files: [traefik.json]
  options:
    disableNameSuffixHash: true
    labels:
      grafana_dashboard: "1"
    annotations:
      grafana_dashboard_folder: network

Datasources

Datasources are provisioned from datasources.yaml mounted at /etc/grafana/provisioning/datasources/:

Name UID Target Notes
Prometheus prometheus-001 kube-prometheus-stack-prometheus:9090 Short-term, high-fidelity; default datasource
Thanos thanos-001 thanos-query-frontend:9090 Long-range queries; max_source_resolution=auto
Loki loki-001 loki-gateway:80 Log queries

Which datasource to use?

Use Prometheus for recent data (< 3 days) where full resolution matters. Use Thanos for anything older — it automatically merges the Prometheus sidecar and historical MinIO blocks.


Dashboards

Dashboards are stored as JSON files in flux/monitoring/noah/grafana-dashboards/, each wrapped in its own ConfigMap. The sidecar picks them up and reloads Grafana on changes.

Dashboard Folder Description
traefik network Edge proxy request rates, latencies, response codes
pihole network DNS query rates and ad-block stats
synology (General) NAS storage, RAID health, disk temps, system metrics
flux-cluster flux Flux reconciliation status across all kustomizations
flux-control-plane flux Flux controller CPU, memory, and queue depth
local-path-storage kubernetes Local-path provisioner PVC usage
minecraft minecraft JVM memory, GC, player counts
cs2 games CS2 server metrics
cs2-demo-manager games Demo management service metrics
pinchflat pinchflat YouTube download queue and throughput

Adding a new dashboard

Export the dashboard JSON from Grafana UI, save it to flux/monitoring/noah/grafana-dashboards/<name>.json, and add a configMapGenerator entry in that directory's kustomization.yaml with the grafana_dashboard: "1" label and desired grafana_dashboard_folder annotation.


Alerting

Grafana Unified Alerting is used. Prometheus AlertManager is disabled. All alert rules, contact points, and routing policies are provisioned from ConfigMaps.

Contact points

The Telegram contact point reads the bot token from the TELEGRAM_BOT_TOKEN environment variable (injected from Vault) and uses a custom Go template to format messages:

# contactpoints.yaml
contactPoints:
- name: telegram-minecraft
  receivers:
  - uid: telegram-001
    type: telegram
    settings:
      bottoken: $TELEGRAM_BOT_TOKEN
      chatid: "<chat-id>"
      message: '{{ template "telegram.summary" . }}'

templates:
- name: "Custom Templates"
  template: |
    {{ define "telegram.summary" }}
    {{ if gt (len .Alerts.Firing) 0 }}{{ template "__summary_only_message" .Alerts.Firing }}{{ end }}
    {{ if gt (len .Alerts.Resolved) 0 }}{{ template "__summary_only_message" .Alerts.Resolved }}{{ end }}
    {{ end }}

    {{ define "__summary_only_message" }}{{ range .}}{{ .Annotations.summary }}{{ end }}{{ end }}

The template renders only the summary annotation, keeping messages concise.

Alert rules

Two alert groups are defined in rules.yaml, both in the minecraft folder:

Fires when Minecraft pod memory working set exceeds 98% of its memory request. Evaluated every minute over a 10-minute window.

sum(container_memory_working_set_bytes{namespace="minecraft", ...}) by (pod)
/ sum(cluster:namespace:pod_memory:active:kube_pod_container_resource_requests{namespace="minecraft"}) by (pod)
> 0.98

Annotation: {{ $labels.pod }} memory usage reaches {{ humanizePercentage $values.A.Value }}.

Fires when the player count changes — detects both joins and leaves. Compares the current player count against the count 1 minute ago:

abs(
  minecraft_status_players_online_count{namespace="minecraft"}
  - minecraft_status_players_online_count{namespace="minecraft"} offset 1m
) > 0

Annotation: {{ $values.A.Value }} player(s) now online on {{ reReplaceAll "-mc-monitor" "" $labels.service }} (was {{ $values.B.Value }})

Notification policy

# policies.yaml
policies:
- receiver: grafana-default-email   # catch-all
  group_by: [grafana_folder, alertname]
  routes:
  - receiver: telegram-minecraft
    object_matchers:
    - [alertname, =~, "Minecraft.*"]
    group_wait: 5s
    group_interval: 5s
    repeat_interval: 4w            # silence for 4 weeks after first notify

4-week repeat interval

The long repeat interval prevents Telegram spam for persistent states (e.g., server stays at high memory). Alerts still fire immediately on state changes.