Skip to content

Metrics Collection & Monitoring

Comprehensive guide to metrics collection, visualization, and analysis in the OVES ecosystem.

Overview

Metrics provide quantitative measurements of system behavior, performance, and health. The OVES metrics infrastructure uses Prometheus for collection and storage, and Grafana for visualization and alerting.

Metrics Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Metric Sources                             │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │ Kubernetes   │  │ Applications │  │ Databases    │          │
│  │ Components   │  │ /metrics     │  │ Exporters    │          │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘          │
└─────────┼──────────────────┼──────────────────┼─────────────────┘
          │                  │                  │
          │                  │                  │
          ▼                  ▼                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Service Discovery                            │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │              Prometheus Server                           │  │
│  │  - Kubernetes SD (ServiceMonitor, PodMonitor)            │  │
│  │  - Static Configs                                        │  │
│  │  - Scrape Targets Every 30s                              │  │
│  └──────────────────────┬───────────────────────────────────┘  │
└────────────────────────┼──────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Time-Series Storage                          │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │              Prometheus TSDB                             │  │
│  │  - 15 days retention                                     │  │
│  │  - 100GB storage                                         │  │
│  │  - Compression enabled                                   │  │
│  └──────────────────────┬───────────────────────────────────┘  │
└────────────────────────┼──────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Visualization & Alerting                     │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │   Grafana    │  │ AlertManager │  │   Prometheus │          │
│  │  Dashboards  │  │  Routing     │  │   Alerts     │          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
└─────────────────────────────────────────────────────────────────┘

Prometheus Setup

Prometheus Configuration

global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    cluster: oves-prod
    environment: production

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Load rules
rule_files:
  - /etc/prometheus/rules/*.yml

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Kubernetes API server
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # Kubernetes nodes
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

  # Kubernetes pods
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # Service monitors (via Prometheus Operator)
  - job_name: 'kubernetes-service-endpoints'
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        target_label: kubernetes_name

ServiceMonitor Example

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: account-microservice
  namespace: monitoring
  labels:
    app: account-microservice
    release: prometheus
spec:
  selector:
    matchLabels:
      app: account-microservice
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
      scheme: http
  namespaceSelector:
    matchNames:
      - production
      - staging

Metric Types

1. Counter

Purpose: Cumulative metric that only increases (or resets to zero)

Use Cases: - Request counts - Error counts - Task completions - Bytes transferred

Example:

const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Increment counter
httpRequestsTotal.inc({ method: 'GET', route: '/api/accounts', status_code: '200' });

PromQL Queries:

# Rate of requests per second
rate(http_requests_total[5m])

# Total requests in last hour
increase(http_requests_total[1h])

# Requests per minute by status code
sum(rate(http_requests_total[1m])) by (status_code)

2. Gauge

Purpose: Metric that can go up or down

Use Cases: - Current memory usage - Active connections - Queue size - Temperature readings

Example:

const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active database connections',
  labelNames: ['database']
});

// Set gauge value
activeConnections.set({ database: 'mongodb' }, 25);

// Increment/decrement
activeConnections.inc({ database: 'mongodb' });
activeConnections.dec({ database: 'mongodb' });

PromQL Queries:

# Current value
active_connections

# Average over time
avg_over_time(active_connections[5m])

# Max value in last hour
max_over_time(active_connections[1h])

3. Histogram

Purpose: Samples observations and counts them in configurable buckets

Use Cases: - Request duration - Response size - Query execution time

Example:

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});

// Observe value
const end = httpRequestDuration.startTimer();
// ... process request ...
end({ method: 'GET', route: '/api/accounts', status_code: '200' });

PromQL Queries:

# 95th percentile response time
histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket[5m])
)

# Average response time
rate(http_request_duration_seconds_sum[5m]) / 
rate(http_request_duration_seconds_count[5m])

# Requests slower than 1 second
sum(rate(http_request_duration_seconds_bucket{le="1"}[5m]))

4. Summary

Purpose: Similar to histogram but calculates quantiles on client side

Use Cases: - Request latencies - Response times when you need exact quantiles

Example:

const requestLatency = new promClient.Summary({
  name: 'request_latency_seconds',
  help: 'Request latency in seconds',
  labelNames: ['service'],
  percentiles: [0.5, 0.9, 0.95, 0.99]
});

// Observe value
requestLatency.observe({ service: 'account' }, 0.234);

Key Metrics by Component

Application Metrics

HTTP Metrics

// Request rate
http_requests_total

// Request duration
http_request_duration_seconds

// Request size
http_request_size_bytes

// Response size
http_response_size_bytes

// Active requests
http_requests_in_flight

Database Metrics

// Connection pool
db_connections_active
db_connections_idle
db_connections_max

// Query performance
db_query_duration_seconds
db_queries_total
db_query_errors_total

// Operations
db_operations_total{operation="insert|update|delete|find"}

Business Metrics

// User activity
user_registrations_total
user_logins_total
user_sessions_active

// Transactions
transactions_total{status="success|failed"}
transaction_amount_total
transaction_duration_seconds

// API usage
api_calls_total{endpoint="/api/accounts"}
api_quota_remaining

Kubernetes Metrics

Node Metrics

# CPU usage
node_cpu_seconds_total

# Memory usage
node_memory_MemAvailable_bytes
node_memory_MemTotal_bytes

# Disk usage
node_filesystem_avail_bytes
node_filesystem_size_bytes

# Network I/O
node_network_receive_bytes_total
node_network_transmit_bytes_total

Pod Metrics

# Pod status
kube_pod_status_phase{phase="Running|Pending|Failed"}

# Container restarts
kube_pod_container_status_restarts_total

# Resource requests/limits
kube_pod_container_resource_requests{resource="cpu|memory"}
kube_pod_container_resource_limits{resource="cpu|memory"}

# Actual usage
container_cpu_usage_seconds_total
container_memory_working_set_bytes

Deployment Metrics

# Replica status
kube_deployment_status_replicas
kube_deployment_status_replicas_available
kube_deployment_status_replicas_unavailable

# Deployment conditions
kube_deployment_status_condition{condition="Available|Progressing"}

Database Exporters

MongoDB Exporter

# Connections
mongodb_connections{state="current|available"}

# Operations
mongodb_op_counters_total{type="insert|query|update|delete"}

# Replication lag
mongodb_mongod_replset_member_replication_lag

# Memory
mongodb_memory{type="resident|virtual|mapped"}

Redis Exporter

# Connected clients
redis_connected_clients

# Memory usage
redis_memory_used_bytes
redis_memory_max_bytes

# Hit rate
redis_keyspace_hits_total
redis_keyspace_misses_total

# Commands processed
redis_commands_processed_total

PostgreSQL Exporter

# Active connections
pg_stat_database_numbackends

# Transaction rate
rate(pg_stat_database_xact_commit[5m])
rate(pg_stat_database_xact_rollback[5m])

# Cache hit ratio
pg_stat_database_blks_hit / 
(pg_stat_database_blks_hit + pg_stat_database_blks_read)

# Slow queries
pg_stat_statements_mean_time_seconds > 1

PromQL Query Examples

Basic Queries

# Current value
http_requests_total

# Filter by labels
http_requests_total{method="GET", status_code="200"}

# Regex matching
http_requests_total{route=~"/api/.*"}

# Negative matching
http_requests_total{status_code!="200"}

# Multiple conditions
http_requests_total{method="POST", status_code=~"5.."}

Rate and Increase

# Requests per second
rate(http_requests_total[5m])

# Total increase over time
increase(http_requests_total[1h])

# Per-second rate with sum
sum(rate(http_requests_total[5m]))

# Rate by label
sum(rate(http_requests_total[5m])) by (status_code)

Aggregation

# Sum across all instances
sum(http_requests_total)

# Average
avg(http_request_duration_seconds)

# Min/Max
min(http_request_duration_seconds)
max(http_request_duration_seconds)

# Count
count(up == 1)

# Group by labels
sum(http_requests_total) by (method, status_code)

# Without specific labels
sum(http_requests_total) without (instance)

Mathematical Operations

# Error rate percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes * 100

# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_avail_bytes) /
node_filesystem_size_bytes * 100

# Request rate change
rate(http_requests_total[5m]) - rate(http_requests_total[5m] offset 1h)

Time Functions

# Average over time
avg_over_time(cpu_usage[5m])

# Max over time
max_over_time(memory_usage[1h])

# Min over time
min_over_time(response_time[30m])

# Rate of change
deriv(cpu_usage[5m])

# Predict future value
predict_linear(disk_usage[1h], 3600)

Advanced Queries

# Top 10 endpoints by request rate
topk(10, sum(rate(http_requests_total[5m])) by (route))

# Bottom 5 by response time
bottomk(5, avg(http_request_duration_seconds) by (route))

# Quantile across instances
quantile(0.95, http_request_duration_seconds)

# Histogram quantile
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# Absent metric (alerting)
absent(up{job="account-microservice"})

# Changes (count resets)
changes(process_start_time_seconds[1h])

Grafana Dashboards

Creating Dashboards

Panel Configuration:

{
  "title": "Request Rate",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (service)",
      "legendFormat": "{{service}}",
      "refId": "A"
    }
  ],
  "type": "graph",
  "yaxes": [
    {
      "format": "reqps",
      "label": "Requests/sec"
    }
  ]
}

Dashboard Variables

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "query": "label_values(kube_pod_info, namespace)",
        "multi": true,
        "includeAll": true
      },
      {
        "name": "pod",
        "type": "query",
        "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
        "multi": true
      }
    ]
  }
}

Common Dashboard Panels

1. Request Rate

sum(rate(http_requests_total{namespace="$namespace"}[5m])) by (service)

2. Error Rate

sum(rate(http_requests_total{namespace="$namespace", status_code=~"5.."}[5m])) /
sum(rate(http_requests_total{namespace="$namespace"}[5m])) * 100

3. Response Time (p95)

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{namespace="$namespace"}[5m])) by (le, service)
)

4. CPU Usage

sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$pod"}[5m])) by (pod)

5. Memory Usage

sum(container_memory_working_set_bytes{namespace="$namespace", pod=~"$pod"}) by (pod)

6. Pod Count

count(kube_pod_info{namespace="$namespace"}) by (namespace)

Recording Rules

Purpose: Pre-compute expensive queries

groups:
  - name: application_rules
    interval: 30s
    rules:
      # Request rate by service
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, service)

      # Error rate percentage
      - record: job:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job) /
          sum(rate(http_requests_total[5m])) by (job) * 100

      # Average response time
      - record: job:http_request_duration:avg5m
        expr: |
          rate(http_request_duration_seconds_sum[5m]) /
          rate(http_request_duration_seconds_count[5m])

      # CPU usage by pod
      - record: pod:cpu_usage:rate5m
        expr: |
          sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod)

      # Memory usage by pod
      - record: pod:memory_usage:bytes
        expr: |
          sum(container_memory_working_set_bytes) by (namespace, pod)

Best Practices

1. Metric Naming

Follow Prometheus naming conventions:

# Format: <namespace>_<name>_<unit>
http_requests_total
http_request_duration_seconds
database_connections_active
memory_usage_bytes

# Use base units
_seconds (not _milliseconds)
_bytes (not _megabytes)
_ratio (0-1, not percentage)

2. Label Usage

// ✅ Good - Low cardinality labels
http_requests_total{method="GET", route="/api/accounts", status_code="200"}

// ❌ Bad - High cardinality (user_id changes frequently)
http_requests_total{user_id="12345"}

// ❌ Bad - Unbounded labels
http_requests_total{request_id="550e8400-e29b-41d4-a716-446655440000"}

3. Instrumentation

// Instrument at application boundaries
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || 'unknown',
      status_code: res.statusCode
    });

    httpRequestDuration.observe({
      method: req.method,
      route: req.route?.path || 'unknown',
      status_code: res.statusCode
    }, duration);
  });

  next();
});

4. Alert on Symptoms

# ✅ Good - Alert on user-facing issues
- alert: HighErrorRate
  expr: job:http_errors:rate5m > 5
  annotations:
    summary: "High error rate affecting users"

# ❌ Bad - Alert on internal metrics
- alert: HighCPU
  expr: cpu_usage > 80
  annotations:
    summary: "CPU usage high"

Troubleshooting

High Cardinality

Problem: Prometheus using too much memory

Solutions:

# Find high cardinality metrics
promtool tsdb analyze /prometheus/data

# Check series count
curl http://prometheus:9090/api/v1/status/tsdb

# Drop problematic metrics
- metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'high_cardinality_metric.*'
    action: drop

Missing Metrics

Problem: Metrics not appearing

Solutions: 1. Check target status: http://prometheus:9090/targets 2. Verify ServiceMonitor: kubectl get servicemonitor 3. Check application /metrics endpoint 4. Review Prometheus logs

Slow Queries

Problem: Grafana dashboards loading slowly

Solutions: 1. Use recording rules for expensive queries 2. Reduce time range 3. Add more specific label filters 4. Use rate() instead of increase() for large time ranges