KubeMirror Monitoring
This directory contains observability resources for monitoring KubeMirror in production.
Overview
KubeMirror exposes Prometheus metrics on port 8080 at /metrics. The monitoring stack includes:
- ServiceMonitor: Prometheus Operator resource for automatic metric scraping
- PrometheusRule: Alert rules for common operational issues
- Grafana Dashboard: Comprehensive visualization of controller metrics
Prerequisites
- Prometheus Operator installed in your cluster
- Grafana (optional, for dashboards)
# Install Prometheus Operator (if not already installed)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
Quick Start
Deploy Monitoring Resources
# Apply ServiceMonitor and PrometheusRule
kubectl apply -f monitoring/servicemonitor.yaml
kubectl apply -f monitoring/prometheusrule.yaml
Import Grafana Dashboard
-
Via UI:
- Open Grafana
- Go to Dashboards → Import
- Upload
grafana-dashboard.json - Select your Prometheus datasource
-
Via ConfigMap (GitOps):
kubectl create configmap kubemirror-dashboard \ --from-file=dashboard.json=monitoring/grafana-dashboard.json \ -n monitoring \ --dry-run=client -o yaml | kubectl apply -f - # Label for automatic discovery by Grafana kubectl label configmap kubemirror-dashboard \ grafana_dashboard=1 \ -n monitoring
Available Metrics
Controller Runtime Metrics
These metrics are provided by the controller-runtime framework:
controller_runtime_reconcile_total- Total reconciliations (by controller, result)controller_runtime_reconcile_errors_total- Failed reconciliationscontroller_runtime_reconcile_time_seconds- Reconciliation duration histogramworkqueue_depth- Current workqueue depthworkqueue_adds_total- Total items added to workqueueworkqueue_retries_total- Workqueue retry count
Leader Election Metrics
leader_election_master_status- Leader election status (1 = leader, 0 = follower)
Go Runtime Metrics
go_goroutines- Current goroutine countgo_memstats_alloc_bytes- Allocated memoryprocess_open_fds- Open file descriptorsprocess_cpu_seconds_total- CPU time
Alert Rules
The PrometheusRule defines alerts for:
Critical Alerts
- KubeMirrorControllerDown: Controller pod is not running
- Severity:
critical - Fires after: 5 minutes
- Severity:
Warning Alerts
-
KubeMirrorHighReconcileErrors: High error rate in reconciliation
- Threshold: >10% error rate
- Fires after: 10 minutes
-
KubeMirrorReconcileLatencyHigh: Slow reconciliation loops
- Threshold: p99 latency > 5 seconds
- Fires after: 10 minutes
-
KubeMirrorWorkqueueDepthHigh: Work items piling up
- Threshold: >100 items in queue
- Fires after: 15 minutes
-
KubeMirrorLeaderElectionLost: Controller is not the leader
- Fires after: 2 minutes
-
KubeMirrorHighFailureRate: Overall operation failure rate high
- Threshold: >5% failure rate
- Fires after: 10 minutes
-
KubeMirrorMemoryHigh: High memory usage
- Threshold: >90% of memory limit
- Fires after: 5 minutes
-
KubeMirrorCPUThrottling: CPU throttling detected
- Fires after: 10 minutes
Recording Rules
Recording rules pre-compute expensive queries for better dashboard performance:
kubemirror:reconcile_duration_seconds:p99- P99 reconciliation latencykubemirror:reconcile_duration_seconds:p95- P95 reconciliation latencykubemirror:reconcile_duration_seconds:p50- P50 reconciliation latencykubemirror:reconcile_rate:5m- Reconciliation rate (5m window)kubemirror:reconcile_errors:rate5m- Error rate (5m window)kubemirror:workqueue_depth:max- Max workqueue depth
Grafana Dashboard
The dashboard includes the following panels:
- Controller Status - Up/down status
- Reconciliation Rate - Operations per second by type and result
- Total Workqueue Depth - Combined queue depth across controllers
- Reconciliation Latency - P99 and P95 latency trends
- Workqueue Depth - Per-controller queue depth
- Memory Usage - Working set vs limits
- CPU Usage - CPU utilization percentage
- Error Rate - Percentage of failed reconciliations
- Process Stats - Goroutines and file descriptors
Querying Metrics
Using Prometheus UI
# Total reconciliation rate
sum(rate(controller_runtime_reconcile_total[5m])) by (controller, result)
# Error rate
sum(rate(controller_runtime_reconcile_errors_total[5m])) by (controller)
# P99 latency
histogram_quantile(0.99,
sum(rate(controller_runtime_reconcile_time_seconds_bucket[5m])) by (le, controller)
)
# Current workqueue depth
workqueue_depth{name=~"secret|configmap"}
Using kubectl
# Port-forward to metrics endpoint
kubectl port-forward -n kubemirror-system svc/kubemirror-controller-metrics 8080:8080
# Curl metrics (raw Prometheus format)
curl http://localhost:8080/metrics
Troubleshooting
ServiceMonitor Not Scraping
Check if Prometheus Operator is configured to discover ServiceMonitors in the kubemirror-system namespace:
# Check ServiceMonitor status
kubectl get servicemonitor -n kubemirror-system
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/targets
Alerts Not Firing
Verify PrometheusRule is loaded:
# Check PrometheusRule
kubectl get prometheusrule -n kubemirror-system
# Check Prometheus rules
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/rules
High Memory Usage
If alerts fire for high memory:
- Check for memory leaks in controller logs
- Increase memory limits in Helm values:
resources: limits: memory: 1Gi - Reduce worker threads or max targets if necessary
High Reconciliation Latency
If reconciliation is slow:
- Check API server latency:
kubectl get --raw /metrics | grep apiserver_request_duration - Increase worker threads in Helm values:
controller: workerThreads: 10 - Review rate limiting settings if hitting API limits
Integration with Alertmanager
To route KubeMirror alerts to specific channels:
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
route:
routes:
- match:
component: kubemirror
receiver: kubemirror-team
continue: true
receivers:
- name: kubemirror-team
slack_configs:
- channel: '#kubemirror-alerts'
api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
Best Practices
- Set up alerts - Deploy PrometheusRule to catch issues early
- Monitor trends - Use Grafana dashboard to spot degradation over time
- Baseline metrics - Understand normal behavior during low/high load
- Tune resources - Adjust CPU/memory based on actual usage patterns
- Alert fatigue - Tune alert thresholds to reduce false positives
- Retention - Ensure Prometheus retains metrics for at least 7 days