kubemirror/monitoring/README.md

# KubeMirror Monitoring

This directory contains observability resources for monitoring KubeMirror in production.

## Overview

KubeMirror exposes Prometheus metrics on port 8080 at `/metrics`. The monitoring stack includes:

- **ServiceMonitor**: Prometheus Operator resource for automatic metric scraping
- **PrometheusRule**: Alert rules for common operational issues
- **Grafana Dashboard**: Comprehensive visualization of controller metrics

## Prerequisites

- Prometheus Operator installed in your cluster
- Grafana (optional, for dashboards)

```bash
# Install Prometheus Operator (if not already installed)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace
```

## Quick Start

### Deploy Monitoring Resources

```bash
# Apply ServiceMonitor and PrometheusRule
kubectl apply -f monitoring/servicemonitor.yaml
kubectl apply -f monitoring/prometheusrule.yaml
```

### Import Grafana Dashboard

1. **Via UI:**
   - Open Grafana
   - Go to Dashboards → Import
   - Upload `grafana-dashboard.json`
   - Select your Prometheus datasource

2. **Via ConfigMap (GitOps):**
   ```bash
   kubectl create configmap kubemirror-dashboard \
     --from-file=dashboard.json=monitoring/grafana-dashboard.json \
     -n monitoring \
     --dry-run=client -o yaml | kubectl apply -f -

   # Label for automatic discovery by Grafana
   kubectl label configmap kubemirror-dashboard \
     grafana_dashboard=1 \
     -n monitoring
   ```

## Available Metrics

### Controller Runtime Metrics

These metrics are provided by the controller-runtime framework:

- `controller_runtime_reconcile_total` - Total reconciliations (by controller, result)
- `controller_runtime_reconcile_errors_total` - Failed reconciliations
- `controller_runtime_reconcile_time_seconds` - Reconciliation duration histogram
- `workqueue_depth` - Current workqueue depth
- `workqueue_adds_total` - Total items added to workqueue
- `workqueue_retries_total` - Workqueue retry count

### Leader Election Metrics

- `leader_election_master_status` - Leader election status (1 = leader, 0 = follower)

### Go Runtime Metrics

- `go_goroutines` - Current goroutine count
- `go_memstats_alloc_bytes` - Allocated memory
- `process_open_fds` - Open file descriptors
- `process_cpu_seconds_total` - CPU time

## Alert Rules

The PrometheusRule defines alerts for:

### Critical Alerts

- **KubeMirrorControllerDown**: Controller pod is not running
  - Severity: `critical`
  - Fires after: 5 minutes

### Warning Alerts

- **KubeMirrorHighReconcileErrors**: High error rate in reconciliation
  - Threshold: >10% error rate
  - Fires after: 10 minutes

- **KubeMirrorReconcileLatencyHigh**: Slow reconciliation loops
  - Threshold: p99 latency > 5 seconds
  - Fires after: 10 minutes

- **KubeMirrorWorkqueueDepthHigh**: Work items piling up
  - Threshold: >100 items in queue
  - Fires after: 15 minutes

- **KubeMirrorLeaderElectionLost**: Controller is not the leader
  - Fires after: 2 minutes

- **KubeMirrorHighFailureRate**: Overall operation failure rate high
  - Threshold: >5% failure rate
  - Fires after: 10 minutes

- **KubeMirrorMemoryHigh**: High memory usage
  - Threshold: >90% of memory limit
  - Fires after: 5 minutes

- **KubeMirrorCPUThrottling**: CPU throttling detected
  - Fires after: 10 minutes

## Recording Rules

Recording rules pre-compute expensive queries for better dashboard performance:

- `kubemirror:reconcile_duration_seconds:p99` - P99 reconciliation latency
- `kubemirror:reconcile_duration_seconds:p95` - P95 reconciliation latency
- `kubemirror:reconcile_duration_seconds:p50` - P50 reconciliation latency
- `kubemirror:reconcile_rate:5m` - Reconciliation rate (5m window)
- `kubemirror:reconcile_errors:rate5m` - Error rate (5m window)
- `kubemirror:workqueue_depth:max` - Max workqueue depth

## Grafana Dashboard

The dashboard includes the following panels:

1. **Controller Status** - Up/down status
2. **Reconciliation Rate** - Operations per second by type and result
3. **Total Workqueue Depth** - Combined queue depth across controllers
4. **Reconciliation Latency** - P99 and P95 latency trends
5. **Workqueue Depth** - Per-controller queue depth
6. **Memory Usage** - Working set vs limits
7. **CPU Usage** - CPU utilization percentage
8. **Error Rate** - Percentage of failed reconciliations
9. **Process Stats** - Goroutines and file descriptors

## Querying Metrics

### Using Prometheus UI

```promql
# Total reconciliation rate
sum(rate(controller_runtime_reconcile_total[5m])) by (controller, result)

# Error rate
sum(rate(controller_runtime_reconcile_errors_total[5m])) by (controller)

# P99 latency
histogram_quantile(0.99,
  sum(rate(controller_runtime_reconcile_time_seconds_bucket[5m])) by (le, controller)
)

# Current workqueue depth
workqueue_depth{name=~"secret|configmap"}
```

### Using kubectl

```bash
# Port-forward to metrics endpoint
kubectl port-forward -n kubemirror-system svc/kubemirror-controller-metrics 8080:8080

# Curl metrics (raw Prometheus format)
curl http://localhost:8080/metrics
```

## Troubleshooting

### ServiceMonitor Not Scraping

Check if Prometheus Operator is configured to discover ServiceMonitors in the kubemirror-system namespace:

```bash
# Check ServiceMonitor status
kubectl get servicemonitor -n kubemirror-system

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/targets
```

### Alerts Not Firing

Verify PrometheusRule is loaded:

```bash
# Check PrometheusRule
kubectl get prometheusrule -n kubemirror-system

# Check Prometheus rules
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/rules
```

### High Memory Usage

If alerts fire for high memory:

1. Check for memory leaks in controller logs
2. Increase memory limits in Helm values:
   ```yaml
   resources:
     limits:
       memory: 1Gi
   ```
3. Reduce worker threads or max targets if necessary

### High Reconciliation Latency

If reconciliation is slow:

1. Check API server latency: `kubectl get --raw /metrics | grep apiserver_request_duration`
2. Increase worker threads in Helm values:
   ```yaml
   controller:
     workerThreads: 10
   ```
3. Review rate limiting settings if hitting API limits

## Integration with Alertmanager

To route KubeMirror alerts to specific channels:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    route:
      routes:
        - match:
            component: kubemirror
          receiver: kubemirror-team
          continue: true

    receivers:
      - name: kubemirror-team
        slack_configs:
          - channel: '#kubemirror-alerts'
            api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
```

## Best Practices

1. **Set up alerts** - Deploy PrometheusRule to catch issues early
2. **Monitor trends** - Use Grafana dashboard to spot degradation over time
3. **Baseline metrics** - Understand normal behavior during low/high load
4. **Tune resources** - Adjust CPU/memory based on actual usage patterns
5. **Alert fatigue** - Tune alert thresholds to reduce false positives
6. **Retention** - Ensure Prometheus retains metrics for at least 7 days

## Further Reading

- [Prometheus Operator Documentation](https://prometheus-operator.dev/)
- [Grafana Dashboard Best Practices](https://grafana.com/docs/grafana/latest/best-practices/best-practices-for-creating-dashboards/)
- [Controller Runtime Metrics](https://book.kubebuilder.io/reference/metrics.html)