Skip to content

feat(observability): auto-provision GoodData-CN dashboards via Grafan…#104

Draft
nortonsk wants to merge 11 commits intogooddata:masterfrom
nortonsk:feat/grafana-dashboards
Draft

feat(observability): auto-provision GoodData-CN dashboards via Grafan…#104
nortonsk wants to merge 11 commits intogooddata:masterfrom
nortonsk:feat/grafana-dashboards

Conversation

@nortonsk
Copy link
Copy Markdown
Contributor

…a sidecar

Enable the Grafana sidecar in observability.tf so it watches the observability namespace for ConfigMaps labelled grafana_dashboard=1. foldersFromFilesStructure=true maps subdirectory names to Grafana folders.

Add grafana-dashboards.tf with kubernetes_config_map_v1 resources for gooddata-cn-overall-health and panther-overall dashboards. Terraform replace() substitutes GDMIMIR->prometheus and GDLOKI->loki at plan time so imported dashboards resolve to local datasources automatically.

Add modules/k8s-common/dashboards/ with the dashboard JSON files so the module is self-contained. Add dashboards/ tooling: export.sh, import.sh, Makefile (with sync target), docker-compose.test.yml for local testing.

No manual import step needed after terraform apply — dashboards appear in the GoodData-CN folder in Grafana within ~10 seconds of apply.

@nortonsk nortonsk marked this pull request as draft March 26, 2026 13:27
@nortonsk nortonsk added do not merge Do not merge this yet Testing labels Mar 26, 2026
…a sidecar

Enable the Grafana sidecar in observability.tf so it watches the
observability namespace for ConfigMaps labelled grafana_dashboard=1.
foldersFromFilesStructure=true maps subdirectory names to Grafana folders.

Add grafana-dashboards.tf with a kubernetes_config_map_v1 resource for
the gooddata-cn-overall-health dashboard. Terraform replace() substitutes
GDMIMIR->prometheus and GDLOKI->loki at plan time so the dashboard
resolves to local datasources automatically.

No manual import step needed after terraform apply — the dashboard appears
in the GoodData-CN folder in Grafana within ~10 seconds of apply.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@nortonsk nortonsk force-pushed the feat/grafana-dashboards branch from b97d463 to 264c308 Compare March 26, 2026 14:18
…ions

Covers three deployment options: automatic via gooddata-cn-terraform,
Grafana UI import, and kubectl ConfigMap for any Kubernetes environment.
Documents datasource UID substitution and how to update the JSON.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@nortonsk nortonsk force-pushed the feat/grafana-dashboards branch 10 times, most recently from c041ccf to 39b46d4 Compare March 30, 2026 10:42
- Auto-provision dashboard via Grafana sidecar ConfigMap (grafana_dashboard=1 label,
  grafana_folder annotation) so it appears in GoodData-CN folder without manual import
- Alias GDMIMIR datasource UID to local Prometheus so all 80+ Prometheus panels work
  out of the box on local (k3d) deployments without a dedicated Mimir instance
- Set $cluster variable allValue=".*" + includeAll=true so local deployments (which
  have no cluster_name label) match all series when "All" is selected
- Fix all cluster_name filters to use regex match (=~) so the ".*" allValue works
  correctly; previously exact-match would return no data when "All" was selected
- Replace nginx ingress log/metric queries with api-gw container queries
- Replace removed forward_call_* metrics with OTel http_server_request_duration_seconds_*
  on API 5xx Error Rate, API Latency Distribution, and Gateway 5xx Error Count panels
- Keep forward_call_response_status_count_total on API Request Rate by Upstream Service
  (better upstream-host granularity); add migration note in panel description
- Fix OOM Kills stat panel to use max_over_time(…[$__range]) for full time-window view
- Remove Calcique metadata lookup times panel
- Pre-install prometheus-operator-crds before k8s-local so CNPG PodMonitor works
  without CRD-missing errors; kube-prometheus-stack skips CRDs (managed separately)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@nortonsk nortonsk force-pushed the feat/grafana-dashboards branch 3 times, most recently from ff02e12 to 309b714 Compare March 30, 2026 14:18
… collision

- Remove duplicate isDefault=true from GD Loki datasource; Grafana rejects
  configs with more than one default datasource per org and crashed on startup
- Fix openssl passwd treating passwords starting with '-' as flags by adding
  '--' end-of-options sentinel in gooddata-orgs.tf password hash script

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@nortonsk nortonsk force-pushed the feat/grafana-dashboards branch from 309b714 to 5ea6255 Compare March 30, 2026 15:06
nortonsk and others added 7 commits March 31, 2026 08:24
GoodData.CN 4.0.0 gen-ai uses alembic which builds its DB URL via Python
configparser — % is reserved for interpolation and causes a ValueError when
it appears in the password. Remove % from override_special so the generated
password is safe for use in URL/configparser contexts.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…n config

- gdcn-size-dev: add full JVM options and resource limits matching Agora
  (metadataApi, apiGateway, authService, calcique, pdfStaplerService,
  resultCache, scanModel, sqlExecutor, exportBuilder, visualExporterService,
  apiGw, redis-ha; all other services get resource limits/requests)
- gdcn-base: enable deployQuiverGeoCollections, enableGeoArea,
  enableNewGeoPushpin, mapIngestionJob, resultCache pulsar invalidation
- gdcn-local: add quiver geo collections S3 config via new SeaweedFS bucket
- k8s-local: add gooddata-geo-collections SeaweedFS bucket
- k8s-common: add local_s3_geo_collections_bucket variable

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…token secret)

The mapIngestionJob requires an external map token managed via Vault in
production. Local k3d installs don't have this secret, causing the job
to fail with BackoffLimitExceeded. Override to disabled in gdcn-local.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…and cluster_name label

- Deploy grafana-image-renderer as a sidecar service for panel/dashboard
  PNG export and dashboard image embedding
- Enable external snapshot sharing via snapshots.raintank.io
- Add nginx proxy-body-size: 50m on Grafana ingress to fix 413 on snapshot publish
- Add cluster_name external label to kube-prometheus-stack so dashboards
  using cluster_name label selector populate correctly

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
skip_crds=true caused ServiceMonitor CRD errors on fresh cluster installs
where the Prometheus Operator CRDs have never been deployed. Let Helm
manage the CRD lifecycle, which is the safe default for a fresh install.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Do not merge this yet Testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant