mirror of
https://github.com/lukaszraczylo/kubemirror.git
synced 2026-06-05 22:43:51 +00:00
initial commit
This commit is contained in:
@@ -0,0 +1,267 @@
|
||||
# KubeMirror Monitoring
|
||||
|
||||
This directory contains observability resources for monitoring KubeMirror in production.
|
||||
|
||||
## Overview
|
||||
|
||||
KubeMirror exposes Prometheus metrics on port 8080 at `/metrics`. The monitoring stack includes:
|
||||
|
||||
- **ServiceMonitor**: Prometheus Operator resource for automatic metric scraping
|
||||
- **PrometheusRule**: Alert rules for common operational issues
|
||||
- **Grafana Dashboard**: Comprehensive visualization of controller metrics
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Prometheus Operator installed in your cluster
|
||||
- Grafana (optional, for dashboards)
|
||||
|
||||
```bash
|
||||
# Install Prometheus Operator (if not already installed)
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm repo update
|
||||
helm install prometheus prometheus-community/kube-prometheus-stack \
|
||||
--namespace monitoring \
|
||||
--create-namespace
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Deploy Monitoring Resources
|
||||
|
||||
```bash
|
||||
# Apply ServiceMonitor and PrometheusRule
|
||||
kubectl apply -f monitoring/servicemonitor.yaml
|
||||
kubectl apply -f monitoring/prometheusrule.yaml
|
||||
```
|
||||
|
||||
### Import Grafana Dashboard
|
||||
|
||||
1. **Via UI:**
|
||||
- Open Grafana
|
||||
- Go to Dashboards → Import
|
||||
- Upload `grafana-dashboard.json`
|
||||
- Select your Prometheus datasource
|
||||
|
||||
2. **Via ConfigMap (GitOps):**
|
||||
```bash
|
||||
kubectl create configmap kubemirror-dashboard \
|
||||
--from-file=dashboard.json=monitoring/grafana-dashboard.json \
|
||||
-n monitoring \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# Label for automatic discovery by Grafana
|
||||
kubectl label configmap kubemirror-dashboard \
|
||||
grafana_dashboard=1 \
|
||||
-n monitoring
|
||||
```
|
||||
|
||||
## Available Metrics
|
||||
|
||||
### Controller Runtime Metrics
|
||||
|
||||
These metrics are provided by the controller-runtime framework:
|
||||
|
||||
- `controller_runtime_reconcile_total` - Total reconciliations (by controller, result)
|
||||
- `controller_runtime_reconcile_errors_total` - Failed reconciliations
|
||||
- `controller_runtime_reconcile_time_seconds` - Reconciliation duration histogram
|
||||
- `workqueue_depth` - Current workqueue depth
|
||||
- `workqueue_adds_total` - Total items added to workqueue
|
||||
- `workqueue_retries_total` - Workqueue retry count
|
||||
|
||||
### Leader Election Metrics
|
||||
|
||||
- `leader_election_master_status` - Leader election status (1 = leader, 0 = follower)
|
||||
|
||||
### Go Runtime Metrics
|
||||
|
||||
- `go_goroutines` - Current goroutine count
|
||||
- `go_memstats_alloc_bytes` - Allocated memory
|
||||
- `process_open_fds` - Open file descriptors
|
||||
- `process_cpu_seconds_total` - CPU time
|
||||
|
||||
## Alert Rules
|
||||
|
||||
The PrometheusRule defines alerts for:
|
||||
|
||||
### Critical Alerts
|
||||
|
||||
- **KubeMirrorControllerDown**: Controller pod is not running
|
||||
- Severity: `critical`
|
||||
- Fires after: 5 minutes
|
||||
|
||||
### Warning Alerts
|
||||
|
||||
- **KubeMirrorHighReconcileErrors**: High error rate in reconciliation
|
||||
- Threshold: >10% error rate
|
||||
- Fires after: 10 minutes
|
||||
|
||||
- **KubeMirrorReconcileLatencyHigh**: Slow reconciliation loops
|
||||
- Threshold: p99 latency > 5 seconds
|
||||
- Fires after: 10 minutes
|
||||
|
||||
- **KubeMirrorWorkqueueDepthHigh**: Work items piling up
|
||||
- Threshold: >100 items in queue
|
||||
- Fires after: 15 minutes
|
||||
|
||||
- **KubeMirrorLeaderElectionLost**: Controller is not the leader
|
||||
- Fires after: 2 minutes
|
||||
|
||||
- **KubeMirrorHighFailureRate**: Overall operation failure rate high
|
||||
- Threshold: >5% failure rate
|
||||
- Fires after: 10 minutes
|
||||
|
||||
- **KubeMirrorMemoryHigh**: High memory usage
|
||||
- Threshold: >90% of memory limit
|
||||
- Fires after: 5 minutes
|
||||
|
||||
- **KubeMirrorCPUThrottling**: CPU throttling detected
|
||||
- Fires after: 10 minutes
|
||||
|
||||
## Recording Rules
|
||||
|
||||
Recording rules pre-compute expensive queries for better dashboard performance:
|
||||
|
||||
- `kubemirror:reconcile_duration_seconds:p99` - P99 reconciliation latency
|
||||
- `kubemirror:reconcile_duration_seconds:p95` - P95 reconciliation latency
|
||||
- `kubemirror:reconcile_duration_seconds:p50` - P50 reconciliation latency
|
||||
- `kubemirror:reconcile_rate:5m` - Reconciliation rate (5m window)
|
||||
- `kubemirror:reconcile_errors:rate5m` - Error rate (5m window)
|
||||
- `kubemirror:workqueue_depth:max` - Max workqueue depth
|
||||
|
||||
## Grafana Dashboard
|
||||
|
||||
The dashboard includes the following panels:
|
||||
|
||||
1. **Controller Status** - Up/down status
|
||||
2. **Reconciliation Rate** - Operations per second by type and result
|
||||
3. **Total Workqueue Depth** - Combined queue depth across controllers
|
||||
4. **Reconciliation Latency** - P99 and P95 latency trends
|
||||
5. **Workqueue Depth** - Per-controller queue depth
|
||||
6. **Memory Usage** - Working set vs limits
|
||||
7. **CPU Usage** - CPU utilization percentage
|
||||
8. **Error Rate** - Percentage of failed reconciliations
|
||||
9. **Process Stats** - Goroutines and file descriptors
|
||||
|
||||
## Querying Metrics
|
||||
|
||||
### Using Prometheus UI
|
||||
|
||||
```promql
|
||||
# Total reconciliation rate
|
||||
sum(rate(controller_runtime_reconcile_total[5m])) by (controller, result)
|
||||
|
||||
# Error rate
|
||||
sum(rate(controller_runtime_reconcile_errors_total[5m])) by (controller)
|
||||
|
||||
# P99 latency
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(controller_runtime_reconcile_time_seconds_bucket[5m])) by (le, controller)
|
||||
)
|
||||
|
||||
# Current workqueue depth
|
||||
workqueue_depth{name=~"secret|configmap"}
|
||||
```
|
||||
|
||||
### Using kubectl
|
||||
|
||||
```bash
|
||||
# Port-forward to metrics endpoint
|
||||
kubectl port-forward -n kubemirror-system svc/kubemirror-controller-metrics 8080:8080
|
||||
|
||||
# Curl metrics (raw Prometheus format)
|
||||
curl http://localhost:8080/metrics
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### ServiceMonitor Not Scraping
|
||||
|
||||
Check if Prometheus Operator is configured to discover ServiceMonitors in the kubemirror-system namespace:
|
||||
|
||||
```bash
|
||||
# Check ServiceMonitor status
|
||||
kubectl get servicemonitor -n kubemirror-system
|
||||
|
||||
# Check Prometheus targets
|
||||
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
|
||||
# Open http://localhost:9090/targets
|
||||
```
|
||||
|
||||
### Alerts Not Firing
|
||||
|
||||
Verify PrometheusRule is loaded:
|
||||
|
||||
```bash
|
||||
# Check PrometheusRule
|
||||
kubectl get prometheusrule -n kubemirror-system
|
||||
|
||||
# Check Prometheus rules
|
||||
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
|
||||
# Open http://localhost:9090/rules
|
||||
```
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
If alerts fire for high memory:
|
||||
|
||||
1. Check for memory leaks in controller logs
|
||||
2. Increase memory limits in Helm values:
|
||||
```yaml
|
||||
resources:
|
||||
limits:
|
||||
memory: 1Gi
|
||||
```
|
||||
3. Reduce worker threads or max targets if necessary
|
||||
|
||||
### High Reconciliation Latency
|
||||
|
||||
If reconciliation is slow:
|
||||
|
||||
1. Check API server latency: `kubectl get --raw /metrics | grep apiserver_request_duration`
|
||||
2. Increase worker threads in Helm values:
|
||||
```yaml
|
||||
controller:
|
||||
workerThreads: 10
|
||||
```
|
||||
3. Review rate limiting settings if hitting API limits
|
||||
|
||||
## Integration with Alertmanager
|
||||
|
||||
To route KubeMirror alerts to specific channels:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: alertmanager-config
|
||||
namespace: monitoring
|
||||
data:
|
||||
alertmanager.yml: |
|
||||
route:
|
||||
routes:
|
||||
- match:
|
||||
component: kubemirror
|
||||
receiver: kubemirror-team
|
||||
continue: true
|
||||
|
||||
receivers:
|
||||
- name: kubemirror-team
|
||||
slack_configs:
|
||||
- channel: '#kubemirror-alerts'
|
||||
api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Set up alerts** - Deploy PrometheusRule to catch issues early
|
||||
2. **Monitor trends** - Use Grafana dashboard to spot degradation over time
|
||||
3. **Baseline metrics** - Understand normal behavior during low/high load
|
||||
4. **Tune resources** - Adjust CPU/memory based on actual usage patterns
|
||||
5. **Alert fatigue** - Tune alert thresholds to reduce false positives
|
||||
6. **Retention** - Ensure Prometheus retains metrics for at least 7 days
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Prometheus Operator Documentation](https://prometheus-operator.dev/)
|
||||
- [Grafana Dashboard Best Practices](https://grafana.com/docs/grafana/latest/best-practices/best-practices-for-creating-dashboards/)
|
||||
- [Controller Runtime Metrics](https://book.kubebuilder.io/reference/metrics.html)
|
||||
Reference in New Issue
Block a user