lukaszraczylo/kubemirror

Fork 0

mirror of https://github.com/lukaszraczylo/kubemirror.git synced 2026-07-22 02:42:14 +00:00

Files

T

History

lukaszraczylo 8adb52608f initial commit

2025-12-25 22:10:57 +00:00

grafana-dashboard.json

initial commit

2025-12-25 22:10:57 +00:00

prometheusrule.yaml

initial commit

2025-12-25 22:10:57 +00:00

README.md

initial commit

2025-12-25 22:10:57 +00:00

servicemonitor.yaml

initial commit

2025-12-25 22:10:57 +00:00

README.md

KubeMirror Monitoring

This directory contains observability resources for monitoring KubeMirror in production.

Overview

KubeMirror exposes Prometheus metrics on port 8080 at /metrics. The monitoring stack includes:

ServiceMonitor: Prometheus Operator resource for automatic metric scraping
PrometheusRule: Alert rules for common operational issues
Grafana Dashboard: Comprehensive visualization of controller metrics

Prerequisites

Prometheus Operator installed in your cluster
Grafana (optional, for dashboards)

# Install Prometheus Operator (if not already installed)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

Quick Start

Deploy Monitoring Resources

# Apply ServiceMonitor and PrometheusRule
kubectl apply -f monitoring/servicemonitor.yaml
kubectl apply -f monitoring/prometheusrule.yaml

Import Grafana Dashboard

Via UI:
- Open Grafana
- Go to Dashboards → Import
- Upload grafana-dashboard.json
- Select your Prometheus datasource

Via ConfigMap (GitOps):

kubectl create configmap kubemirror-dashboard \
  --from-file=dashboard.json=monitoring/grafana-dashboard.json \
  -n monitoring \
  --dry-run=client -o yaml | kubectl apply -f -

# Label for automatic discovery by Grafana
kubectl label configmap kubemirror-dashboard \
  grafana_dashboard=1 \
  -n monitoring

Available Metrics

Controller Runtime Metrics

These metrics are provided by the controller-runtime framework:

controller_runtime_reconcile_total - Total reconciliations (by controller, result)
controller_runtime_reconcile_errors_total - Failed reconciliations
controller_runtime_reconcile_time_seconds - Reconciliation duration histogram
workqueue_depth - Current workqueue depth
workqueue_adds_total - Total items added to workqueue
workqueue_retries_total - Workqueue retry count

Leader Election Metrics

leader_election_master_status - Leader election status (1 = leader, 0 = follower)

Go Runtime Metrics

go_goroutines - Current goroutine count
go_memstats_alloc_bytes - Allocated memory
process_open_fds - Open file descriptors
process_cpu_seconds_total - CPU time

Alert Rules

The PrometheusRule defines alerts for:

Critical Alerts

KubeMirrorControllerDown: Controller pod is not running
- Severity: critical
- Fires after: 5 minutes

Warning Alerts

KubeMirrorHighReconcileErrors: High error rate in reconciliation
- Threshold: >10% error rate
- Fires after: 10 minutes
KubeMirrorReconcileLatencyHigh: Slow reconciliation loops
- Threshold: p99 latency > 5 seconds
- Fires after: 10 minutes
KubeMirrorWorkqueueDepthHigh: Work items piling up
- Threshold: >100 items in queue
- Fires after: 15 minutes
KubeMirrorLeaderElectionLost: Controller is not the leader
- Fires after: 2 minutes
KubeMirrorHighFailureRate: Overall operation failure rate high
- Threshold: >5% failure rate
- Fires after: 10 minutes
KubeMirrorMemoryHigh: High memory usage
- Threshold: >90% of memory limit
- Fires after: 5 minutes
KubeMirrorCPUThrottling: CPU throttling detected
- Fires after: 10 minutes

Recording Rules

Recording rules pre-compute expensive queries for better dashboard performance:

kubemirror:reconcile_duration_seconds:p99 - P99 reconciliation latency
kubemirror:reconcile_duration_seconds:p95 - P95 reconciliation latency
kubemirror:reconcile_duration_seconds:p50 - P50 reconciliation latency
kubemirror:reconcile_rate:5m - Reconciliation rate (5m window)
kubemirror:reconcile_errors:rate5m - Error rate (5m window)
kubemirror:workqueue_depth:max - Max workqueue depth

Grafana Dashboard

The dashboard includes the following panels:

Controller Status - Up/down status
Reconciliation Rate - Operations per second by type and result
Total Workqueue Depth - Combined queue depth across controllers
Reconciliation Latency - P99 and P95 latency trends
Workqueue Depth - Per-controller queue depth
Memory Usage - Working set vs limits
CPU Usage - CPU utilization percentage
Error Rate - Percentage of failed reconciliations
Process Stats - Goroutines and file descriptors

Querying Metrics

Using Prometheus UI

# Total reconciliation rate
sum(rate(controller_runtime_reconcile_total[5m])) by (controller, result)

# Error rate
sum(rate(controller_runtime_reconcile_errors_total[5m])) by (controller)

# P99 latency
histogram_quantile(0.99,
  sum(rate(controller_runtime_reconcile_time_seconds_bucket[5m])) by (le, controller)
)

# Current workqueue depth
workqueue_depth{name=~"secret|configmap"}

Using kubectl

# Port-forward to metrics endpoint
kubectl port-forward -n kubemirror-system svc/kubemirror-controller-metrics 8080:8080

# Curl metrics (raw Prometheus format)
curl http://localhost:8080/metrics

Troubleshooting

ServiceMonitor Not Scraping

Check if Prometheus Operator is configured to discover ServiceMonitors in the kubemirror-system namespace:

# Check ServiceMonitor status
kubectl get servicemonitor -n kubemirror-system

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/targets

Alerts Not Firing

Verify PrometheusRule is loaded:

# Check PrometheusRule
kubectl get prometheusrule -n kubemirror-system

# Check Prometheus rules
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/rules

High Memory Usage

If alerts fire for high memory:

Check for memory leaks in controller logs
Increase memory limits in Helm values:
```
resources:
  limits:
    memory: 1Gi
```
Reduce worker threads or max targets if necessary

High Reconciliation Latency

If reconciliation is slow:

Check API server latency: kubectl get --raw /metrics | grep apiserver_request_duration
Increase worker threads in Helm values:
```
controller:
  workerThreads: 10
```
Review rate limiting settings if hitting API limits

Integration with Alertmanager

To route KubeMirror alerts to specific channels:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    route:
      routes:
        - match:
            component: kubemirror
          receiver: kubemirror-team
          continue: true

    receivers:
      - name: kubemirror-team
        slack_configs:
          - channel: '#kubemirror-alerts'
            api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

Best Practices

Set up alerts - Deploy PrometheusRule to catch issues early
Monitor trends - Use Grafana dashboard to spot degradation over time
Baseline metrics - Understand normal behavior during low/high load
Tune resources - Adjust CPU/memory based on actual usage patterns
Alert fatigue - Tune alert thresholds to reduce false positives
Retention - Ensure Prometheus retains metrics for at least 7 days

README.md

KubeMirror Monitoring

Overview

Prerequisites

Quick Start

Deploy Monitoring Resources

Import Grafana Dashboard

Available Metrics

Controller Runtime Metrics

Leader Election Metrics

Go Runtime Metrics

Alert Rules

Critical Alerts

Warning Alerts

Recording Rules

Grafana Dashboard

Querying Metrics

Using Prometheus UI

Using kubectl

Troubleshooting

ServiceMonitor Not Scraping

Alerts Not Firing

High Memory Usage

High Reconciliation Latency

Integration with Alertmanager

Best Practices

Further Reading