Platform Monitoring & Observability
Introductie
Deze documentatie beschrijft de monitoring infrastructuur en procedures binnen de HappyHorizon DevOps omgeving. Het monitoring systeem is gebaseerd op de Prometheus stack en biedt uitgebreide observability van het platform, inclusief metriek verzameling, alerting, en visualisatie via Grafana.
Monitoring Architectuur
De monitoring infrastructuur bestaat uit verschillende componenten die samen een complete observability-oplossing bieden:
Componenten Overzicht
- Prometheus Server - Voor metriek verzameling en opslag
- Alertmanager - Voor alerting en notificatiebeheer
- Grafana - Voor visualisatie en dashboarding
- Node Exporter - Voor host-level metrics
- kube-state-metrics - Voor Kubernetes object metrics
- Prometheus Operator - Voor declaratief beheer van Prometheus resources
Prometheus Configuratie
1. Prometheus Rules
Prometheus Rules definiƫren alerting regels en recording rules voor efficiƫnte query's:
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: platform-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: node
rules:
- alert: HighCPUUsage
expr: node_cpu_usage_percentage > 80
for: 5m
labels:
severity: warning
- name: kubernetes
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 15m
labels:
severity: warning
team: application
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is herstart meerdere keren in de afgelopen 15 minuten"
2. Service Monitoring
ServiceMonitor resources definieren welke services gemonitord moeten worden en hoe de metrische data verzameld moet worden:
# service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: client-namespace
labels:
app: example
release: prometheus
spec:
selector:
matchLabels:
app: example
namespaceSelector:
matchNames:
- client-namespace
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
scrapeTimeout: 10s
metricRelabelings:
- sourceLabels: [__name__]
regex: 'go_.*'
action: drop
3. Pod Monitoring
PodMonitor resources worden gebruikt voor directe Pod monitoring in plaats van via Services:
# pod-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: app-pod-metrics
namespace: client-namespace
spec:
selector:
matchLabels:
app: example
podMetricsEndpoints:
- port: metrics
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- client-namespace
Alerting Configuratie
1. AlertManager Config
# alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: platform-alerts
namespace: monitoring
labels:
alertmanagerConfig: main
spec:
route:
receiver: 'platform-team'
groupBy: ['alertname', 'cluster', 'service']
groupWait: 30s
groupInterval: 5m
repeatInterval: 4h
routes:
- receiver: 'dev-team'
match:
team: application
severity: warning
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
- receiver: 'platform-team'
match:
team: platform
groupWait: 30s
groupInterval: 5m
repeatInterval: 4h
- receiver: 'critical-team'
match:
severity: critical
groupWait: 0s
groupInterval: 1m
repeatInterval: 1h
receivers:
- name: 'platform-team'
slackConfigs:
- channel: '#platform-alerts'
apiURL: 'https://hooks.slack.com/services/xxx/yyy/zzz'
username: 'Prometheus Alert Manager'
iconEmoji: ':warning:'
sendResolved: true
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
2. Grafana Dashboards
1. Platform Overview Dashboard
Het Platform Overview dashboard toont:
- Cluster health indicators
- Node resource utilization
- Netwerk traffic en latency
- Storage performance
- Critical alerts en events
2. Application Metrics Dashboard
Per-applicatie dashboards omvatten:
- Response times en error rates
- Throughput en request volumes
- Database query performance
- Custom application metrics
- Service dependencies
3. Dashboard Configuration
Dashboards worden beheerd via Grafana provisioning in code:
# grafana-dashboard-provider.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-provider
namespace: monitoring
data:
provider.yaml: |-
apiVersion: 1
providers:
- name: 'platform'
orgId: 1
folder: 'Platform'
type: file
options:
path: /var/lib/grafana/dashboards/platform
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-platform-dashboard
namespace: monitoring
labels:
grafana_dashboard: "true"
data:
platform-overview.json: |-
{
"annotations": { ... },
"editable": true,
"panels": [ ... ],
"refresh": "10s",
"schemaVersion": 16,
"title": "Platform Overview",
"uid": "platform-overview",
"version": 1
}
Monitoring voor Custom Applicaties
1. Instrumentatie Guidelines
Voor het monitoren van custom applicaties adviseren we:
Voor Node.js applicaties:
const promClient = require('prom-client');
const collectDefaultMetrics = promClient.collectDefaultMetrics;
collectDefaultMetrics({ timeout: 5000 });
// Custom metrics
const httpRequestDurationMicroseconds = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
});
Voor Java applicaties (Spring Boot):
<!-- pom.xml -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
# application.properties
management.endpoints.web.exposure.include=prometheus,health,info
management.metrics.tags.application=${spring.application.name}
management.metrics.distribution.percentiles-histogram.http.server.requests=true
Service Configuratie:
# application-metrics.yaml
apiVersion: v1
kind: Service
metadata:
name: app-metrics
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8080'
prometheus.io/path: '/metrics'
spec:
selector:
app: example
ports:
- name: metrics
port: 8080
targetPort: 8080