initial commit

2026-07-22 05:58:50 +00:00 · 2025-12-25 22:10:57 +00:00
commit 8adb52608f
46 changed files with 7570 additions and 0 deletions
@@ -0,0 +1,267 @@
+# KubeMirror Monitoring
+
+This directory contains observability resources for monitoring KubeMirror in production.
+
+## Overview
+
+KubeMirror exposes Prometheus metrics on port 8080 at `/metrics`. The monitoring stack includes:
+
+- **ServiceMonitor**: Prometheus Operator resource for automatic metric scraping
+- **PrometheusRule**: Alert rules for common operational issues
+- **Grafana Dashboard**: Comprehensive visualization of controller metrics
+
+## Prerequisites
+
+- Prometheus Operator installed in your cluster
+- Grafana (optional, for dashboards)
+
+```bash
+# Install Prometheus Operator (if not already installed)
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo update
+helm install prometheus prometheus-community/kube-prometheus-stack \
+  --namespace monitoring \
+  --create-namespace
+```
+
+## Quick Start
+
+### Deploy Monitoring Resources
+
+```bash
+# Apply ServiceMonitor and PrometheusRule
+kubectl apply -f monitoring/servicemonitor.yaml
+kubectl apply -f monitoring/prometheusrule.yaml
+```
+
+### Import Grafana Dashboard
+
+1. **Via UI:**
+   - Open Grafana
+   - Go to Dashboards → Import
+   - Upload `grafana-dashboard.json`
+   - Select your Prometheus datasource
+
+2. **Via ConfigMap (GitOps):**
+   ```bash
+   kubectl create configmap kubemirror-dashboard \
+     --from-file=dashboard.json=monitoring/grafana-dashboard.json \
+     -n monitoring \
+     --dry-run=client -o yaml | kubectl apply -f -
+
+   # Label for automatic discovery by Grafana
+   kubectl label configmap kubemirror-dashboard \
+     grafana_dashboard=1 \
+     -n monitoring
+   ```
+
+## Available Metrics
+
+### Controller Runtime Metrics
+
+These metrics are provided by the controller-runtime framework:
+
+- `controller_runtime_reconcile_total` - Total reconciliations (by controller, result)
+- `controller_runtime_reconcile_errors_total` - Failed reconciliations
+- `controller_runtime_reconcile_time_seconds` - Reconciliation duration histogram
+- `workqueue_depth` - Current workqueue depth
+- `workqueue_adds_total` - Total items added to workqueue
+- `workqueue_retries_total` - Workqueue retry count
+
+### Leader Election Metrics
+
+- `leader_election_master_status` - Leader election status (1 = leader, 0 = follower)
+
+### Go Runtime Metrics
+
+- `go_goroutines` - Current goroutine count
+- `go_memstats_alloc_bytes` - Allocated memory
+- `process_open_fds` - Open file descriptors
+- `process_cpu_seconds_total` - CPU time
+
+## Alert Rules
+
+The PrometheusRule defines alerts for:
+
+### Critical Alerts
+
+- **KubeMirrorControllerDown**: Controller pod is not running
+  - Severity: `critical`
+  - Fires after: 5 minutes
+
+### Warning Alerts
+
+- **KubeMirrorHighReconcileErrors**: High error rate in reconciliation
+  - Threshold: >10% error rate
+  - Fires after: 10 minutes
+
+- **KubeMirrorReconcileLatencyHigh**: Slow reconciliation loops
+  - Threshold: p99 latency > 5 seconds
+  - Fires after: 10 minutes
+
+- **KubeMirrorWorkqueueDepthHigh**: Work items piling up
+  - Threshold: >100 items in queue
+  - Fires after: 15 minutes
+
+- **KubeMirrorLeaderElectionLost**: Controller is not the leader
+  - Fires after: 2 minutes
+
+- **KubeMirrorHighFailureRate**: Overall operation failure rate high
+  - Threshold: >5% failure rate
+  - Fires after: 10 minutes
+
+- **KubeMirrorMemoryHigh**: High memory usage
+  - Threshold: >90% of memory limit
+  - Fires after: 5 minutes
+
+- **KubeMirrorCPUThrottling**: CPU throttling detected
+  - Fires after: 10 minutes
+
+## Recording Rules
+
+Recording rules pre-compute expensive queries for better dashboard performance:
+
+- `kubemirror:reconcile_duration_seconds:p99` - P99 reconciliation latency
+- `kubemirror:reconcile_duration_seconds:p95` - P95 reconciliation latency
+- `kubemirror:reconcile_duration_seconds:p50` - P50 reconciliation latency
+- `kubemirror:reconcile_rate:5m` - Reconciliation rate (5m window)
+- `kubemirror:reconcile_errors:rate5m` - Error rate (5m window)
+- `kubemirror:workqueue_depth:max` - Max workqueue depth
+
+## Grafana Dashboard
+
+The dashboard includes the following panels:
+
+1. **Controller Status** - Up/down status
+2. **Reconciliation Rate** - Operations per second by type and result
+3. **Total Workqueue Depth** - Combined queue depth across controllers
+4. **Reconciliation Latency** - P99 and P95 latency trends
+5. **Workqueue Depth** - Per-controller queue depth
+6. **Memory Usage** - Working set vs limits
+7. **CPU Usage** - CPU utilization percentage
+8. **Error Rate** - Percentage of failed reconciliations
+9. **Process Stats** - Goroutines and file descriptors
+
+## Querying Metrics
+
+### Using Prometheus UI
+
+```promql
+# Total reconciliation rate
+sum(rate(controller_runtime_reconcile_total[5m])) by (controller, result)
+
+# Error rate
+sum(rate(controller_runtime_reconcile_errors_total[5m])) by (controller)
+
+# P99 latency
+histogram_quantile(0.99,
+  sum(rate(controller_runtime_reconcile_time_seconds_bucket[5m])) by (le, controller)
+)
+
+# Current workqueue depth
+workqueue_depth{name=~"secret|configmap"}
+```
+
+### Using kubectl
+
+```bash
+# Port-forward to metrics endpoint
+kubectl port-forward -n kubemirror-system svc/kubemirror-controller-metrics 8080:8080
+
+# Curl metrics (raw Prometheus format)
+curl http://localhost:8080/metrics
+```
+
+## Troubleshooting
+
+### ServiceMonitor Not Scraping
+
+Check if Prometheus Operator is configured to discover ServiceMonitors in the kubemirror-system namespace:
+
+```bash
+# Check ServiceMonitor status
+kubectl get servicemonitor -n kubemirror-system
+
+# Check Prometheus targets
+kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
+# Open http://localhost:9090/targets
+```
+
+### Alerts Not Firing
+
+Verify PrometheusRule is loaded:
+
+```bash
+# Check PrometheusRule
+kubectl get prometheusrule -n kubemirror-system
+
+# Check Prometheus rules
+kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
+# Open http://localhost:9090/rules
+```
+
+### High Memory Usage
+
+If alerts fire for high memory:
+
+1. Check for memory leaks in controller logs
+2. Increase memory limits in Helm values:
+   ```yaml
+   resources:
+     limits:
+       memory: 1Gi
+   ```
+3. Reduce worker threads or max targets if necessary
+
+### High Reconciliation Latency
+
+If reconciliation is slow:
+
+1. Check API server latency: `kubectl get --raw /metrics | grep apiserver_request_duration`
+2. Increase worker threads in Helm values:
+   ```yaml
+   controller:
+     workerThreads: 10
+   ```
+3. Review rate limiting settings if hitting API limits
+
+## Integration with Alertmanager
+
+To route KubeMirror alerts to specific channels:
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: alertmanager-config
+  namespace: monitoring
+data:
+  alertmanager.yml: |
+    route:
+      routes:
+        - match:
+            component: kubemirror
+          receiver: kubemirror-team
+          continue: true
+
+    receivers:
+      - name: kubemirror-team
+        slack_configs:
+          - channel: '#kubemirror-alerts'
+            api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
+```
+
+## Best Practices
+
+1. **Set up alerts** - Deploy PrometheusRule to catch issues early
+2. **Monitor trends** - Use Grafana dashboard to spot degradation over time
+3. **Baseline metrics** - Understand normal behavior during low/high load
+4. **Tune resources** - Adjust CPU/memory based on actual usage patterns
+5. **Alert fatigue** - Tune alert thresholds to reduce false positives
+6. **Retention** - Ensure Prometheus retains metrics for at least 7 days
+
+## Further Reading
+
+- [Prometheus Operator Documentation](https://prometheus-operator.dev/)
+- [Grafana Dashboard Best Practices](https://grafana.com/docs/grafana/latest/best-practices/best-practices-for-creating-dashboards/)
+- [Controller Runtime Metrics](https://book.kubebuilder.io/reference/metrics.html)