Files
2025-12-25 22:10:57 +00:00
..
2025-12-25 22:10:57 +00:00
2025-12-25 22:10:57 +00:00
2025-12-25 22:10:57 +00:00
2025-12-25 22:10:57 +00:00

KubeMirror Monitoring

This directory contains observability resources for monitoring KubeMirror in production.

Overview

KubeMirror exposes Prometheus metrics on port 8080 at /metrics. The monitoring stack includes:

  • ServiceMonitor: Prometheus Operator resource for automatic metric scraping
  • PrometheusRule: Alert rules for common operational issues
  • Grafana Dashboard: Comprehensive visualization of controller metrics

Prerequisites

  • Prometheus Operator installed in your cluster
  • Grafana (optional, for dashboards)
# Install Prometheus Operator (if not already installed)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

Quick Start

Deploy Monitoring Resources

# Apply ServiceMonitor and PrometheusRule
kubectl apply -f monitoring/servicemonitor.yaml
kubectl apply -f monitoring/prometheusrule.yaml

Import Grafana Dashboard

  1. Via UI:

    • Open Grafana
    • Go to Dashboards → Import
    • Upload grafana-dashboard.json
    • Select your Prometheus datasource
  2. Via ConfigMap (GitOps):

    kubectl create configmap kubemirror-dashboard \
      --from-file=dashboard.json=monitoring/grafana-dashboard.json \
      -n monitoring \
      --dry-run=client -o yaml | kubectl apply -f -
    
    # Label for automatic discovery by Grafana
    kubectl label configmap kubemirror-dashboard \
      grafana_dashboard=1 \
      -n monitoring
    

Available Metrics

Controller Runtime Metrics

These metrics are provided by the controller-runtime framework:

  • controller_runtime_reconcile_total - Total reconciliations (by controller, result)
  • controller_runtime_reconcile_errors_total - Failed reconciliations
  • controller_runtime_reconcile_time_seconds - Reconciliation duration histogram
  • workqueue_depth - Current workqueue depth
  • workqueue_adds_total - Total items added to workqueue
  • workqueue_retries_total - Workqueue retry count

Leader Election Metrics

  • leader_election_master_status - Leader election status (1 = leader, 0 = follower)

Go Runtime Metrics

  • go_goroutines - Current goroutine count
  • go_memstats_alloc_bytes - Allocated memory
  • process_open_fds - Open file descriptors
  • process_cpu_seconds_total - CPU time

Alert Rules

The PrometheusRule defines alerts for:

Critical Alerts

  • KubeMirrorControllerDown: Controller pod is not running
    • Severity: critical
    • Fires after: 5 minutes

Warning Alerts

  • KubeMirrorHighReconcileErrors: High error rate in reconciliation

    • Threshold: >10% error rate
    • Fires after: 10 minutes
  • KubeMirrorReconcileLatencyHigh: Slow reconciliation loops

    • Threshold: p99 latency > 5 seconds
    • Fires after: 10 minutes
  • KubeMirrorWorkqueueDepthHigh: Work items piling up

    • Threshold: >100 items in queue
    • Fires after: 15 minutes
  • KubeMirrorLeaderElectionLost: Controller is not the leader

    • Fires after: 2 minutes
  • KubeMirrorHighFailureRate: Overall operation failure rate high

    • Threshold: >5% failure rate
    • Fires after: 10 minutes
  • KubeMirrorMemoryHigh: High memory usage

    • Threshold: >90% of memory limit
    • Fires after: 5 minutes
  • KubeMirrorCPUThrottling: CPU throttling detected

    • Fires after: 10 minutes

Recording Rules

Recording rules pre-compute expensive queries for better dashboard performance:

  • kubemirror:reconcile_duration_seconds:p99 - P99 reconciliation latency
  • kubemirror:reconcile_duration_seconds:p95 - P95 reconciliation latency
  • kubemirror:reconcile_duration_seconds:p50 - P50 reconciliation latency
  • kubemirror:reconcile_rate:5m - Reconciliation rate (5m window)
  • kubemirror:reconcile_errors:rate5m - Error rate (5m window)
  • kubemirror:workqueue_depth:max - Max workqueue depth

Grafana Dashboard

The dashboard includes the following panels:

  1. Controller Status - Up/down status
  2. Reconciliation Rate - Operations per second by type and result
  3. Total Workqueue Depth - Combined queue depth across controllers
  4. Reconciliation Latency - P99 and P95 latency trends
  5. Workqueue Depth - Per-controller queue depth
  6. Memory Usage - Working set vs limits
  7. CPU Usage - CPU utilization percentage
  8. Error Rate - Percentage of failed reconciliations
  9. Process Stats - Goroutines and file descriptors

Querying Metrics

Using Prometheus UI

# Total reconciliation rate
sum(rate(controller_runtime_reconcile_total[5m])) by (controller, result)

# Error rate
sum(rate(controller_runtime_reconcile_errors_total[5m])) by (controller)

# P99 latency
histogram_quantile(0.99,
  sum(rate(controller_runtime_reconcile_time_seconds_bucket[5m])) by (le, controller)
)

# Current workqueue depth
workqueue_depth{name=~"secret|configmap"}

Using kubectl

# Port-forward to metrics endpoint
kubectl port-forward -n kubemirror-system svc/kubemirror-controller-metrics 8080:8080

# Curl metrics (raw Prometheus format)
curl http://localhost:8080/metrics

Troubleshooting

ServiceMonitor Not Scraping

Check if Prometheus Operator is configured to discover ServiceMonitors in the kubemirror-system namespace:

# Check ServiceMonitor status
kubectl get servicemonitor -n kubemirror-system

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/targets

Alerts Not Firing

Verify PrometheusRule is loaded:

# Check PrometheusRule
kubectl get prometheusrule -n kubemirror-system

# Check Prometheus rules
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/rules

High Memory Usage

If alerts fire for high memory:

  1. Check for memory leaks in controller logs
  2. Increase memory limits in Helm values:
    resources:
      limits:
        memory: 1Gi
    
  3. Reduce worker threads or max targets if necessary

High Reconciliation Latency

If reconciliation is slow:

  1. Check API server latency: kubectl get --raw /metrics | grep apiserver_request_duration
  2. Increase worker threads in Helm values:
    controller:
      workerThreads: 10
    
  3. Review rate limiting settings if hitting API limits

Integration with Alertmanager

To route KubeMirror alerts to specific channels:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    route:
      routes:
        - match:
            component: kubemirror
          receiver: kubemirror-team
          continue: true

    receivers:
      - name: kubemirror-team
        slack_configs:
          - channel: '#kubemirror-alerts'
            api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

Best Practices

  1. Set up alerts - Deploy PrometheusRule to catch issues early
  2. Monitor trends - Use Grafana dashboard to spot degradation over time
  3. Baseline metrics - Understand normal behavior during low/high load
  4. Tune resources - Adjust CPU/memory based on actual usage patterns
  5. Alert fatigue - Tune alert thresholds to reduce false positives
  6. Retention - Ensure Prometheus retains metrics for at least 7 days

Further Reading