Skip to content

Monitoring & Observability Documentation

Comprehensive guide to OVES monitoring, logging, and alerting infrastructure.

Overview

The OVES monitoring stack provides complete observability across all infrastructure and applications. We use industry-standard tools for metrics collection, log aggregation, alerting, and uptime monitoring, ensuring we can quickly detect, diagnose, and resolve issues.

Monitoring Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        Data Sources                                 │
│  ┌──────────────┐  ┌────────────────┐  ┌──────────────┐             │
│  │ Kubernetes   │  │ Applications   │  │ AWS Services │             │
│  │ Clusters     │  │ (Microservices)│  │ (CloudWatch) │             │
│  └──────┬───────┘  └───────┬────────┘  └──────┬───────┘             │
└─────────┼──────────────────┼──────────────────┼─────────────────────┘
          │                  │                  │
          ▼                  ▼                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Collection Layer                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │ Prometheus   │  │    Loki      │  │  CloudWatch  │              │
│  │ (Metrics)    │  │   (Logs)     │  │  (AWS Logs)  │              │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘              │
│         │                 │                 │                     │
│         │                 │                 │                     │
│  ┌──────▼───────┐  ┌──────▼───────┐  ┌──────▼───────┐              │
│  │  Logstash    │  │ Elasticsearch│  │   CloudWatch │              │
│  │ (Processing) │  │  (Indexing)  │  │   Insights   │              │
│  └──────────────┘  └──────────────┘  └──────────────┘              │
└─────────────────────────────────────────────────────────────────────┘
          │                  │                  │
          ▼                  ▼                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Visualization Layer                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │   Grafana    │  │    Kibana    │  │  CloudWatch  │              │
│  │ (Dashboards) │  │ (Log Search) │  │  (Dashboards)│              │
│  └──────────────┘  └──────────────┘  └──────────────┘              │
└─────────────────────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Alerting Layer                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │ AlertManager │  │    Uptime    │  │   Checkcle   │              │
│  │              │  │     Kuma     │  │              │              │
│  └──────┬───────┘  └──────┬───────┘  └───────┬──────┘              │
│         │                 │                  │                     │
│         └─────────────────┴──────────────────┘                     │
│                           │                                        │
│                           ▼                                        │
│                  ┌──────────────────┐                               │
│                  │ Microsoft Teams  │                               │
│                  │    (Alerts)      │                               │
│                  └──────────────────┘                               │
└─────────────────────────────────────────────────────────────────────┘

Monitoring Stack Components

1. Metrics Collection (Prometheus)

Purpose: Time-series metrics collection and storage

What It Monitors: - Kubernetes cluster metrics (nodes, pods, containers) - Application metrics (request rates, latency, errors) - Database metrics (connections, queries, performance) - Infrastructure metrics (CPU, memory, disk, network) - Custom business metrics

Architecture: - Prometheus Server: Central metrics collection and storage - Node Exporter: Host-level metrics (CPU, memory, disk) - kube-state-metrics: Kubernetes object state metrics - Service Monitors: Automatic service discovery and scraping - Pushgateway: For short-lived jobs and batch processes

Deployment:

# Deployed in dev cluster
namespace: monitoring
replicas: 2 (HA setup)
retention: 15 days
storage: 100GB EBS volume
scrape_interval: 30s

Key Metrics Collected: - Infrastructure: node_cpu_seconds_total, node_memory_bytes, node_disk_io_time_seconds_total - Kubernetes: kube_pod_status_phase, kube_deployment_status_replicas, kube_node_status_condition - Applications: http_requests_total, http_request_duration_seconds, http_requests_errors_total - Databases: mongodb_connections, redis_connected_clients, postgres_up

Example ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: account-microservice
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: account-microservice
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

2. Visualization (Grafana)

Purpose: Metrics visualization and dashboarding

Features: - Real-time dashboards - Custom visualizations - Alerting rules - Multi-data source support (Prometheus, Loki, Elasticsearch, CloudWatch) - Team collaboration - Dashboard templating

Access: - Dev: https://grafana-dev.omnivoltaic.com - Prod: https://grafana.omnivoltaic.com

Pre-built Dashboards:

  1. Cluster Overview
  2. Node status and resource usage
  3. Pod distribution
  4. Network traffic
  5. Storage utilization

  6. Application Performance

  7. Request rate (RPS)
  8. Response time (p50, p95, p99)
  9. Error rate
  10. Active connections

  11. Database Performance

  12. Query performance
  13. Connection pool status
  14. Cache hit rates
  15. Replication lag

  16. Infrastructure Health

  17. CPU, Memory, Disk usage
  18. Network I/O
  19. Load averages
  20. System errors

  21. Business Metrics

  22. User registrations
  23. Transaction volumes
  24. API usage by endpoint
  25. Payment processing

Dashboard Example:

{
  "dashboard": {
    "title": "Account Microservice",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{service='account-microservice'}[5m])"
          }
        ]
      },
      {
        "title": "Response Time (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service='account-microservice'}[5m]))"
          }
        ]
      }
    ]
  }
}

3. Log Aggregation

We use multiple log aggregation systems for different purposes:

Loki (Primary for Kubernetes)

Purpose: Kubernetes log aggregation and querying

Features: - Label-based log indexing (like Prometheus for logs) - Efficient storage (only indexes metadata) - Native Grafana integration - LogQL query language - Multi-tenancy support

Deployment:

namespace: monitoring
replicas: 3 (HA setup)
retention: 30 days
storage: 200GB EBS volume

Log Collection: - Promtail: Deployed as DaemonSet on all nodes - Automatically discovers and tails pod logs - Enriches logs with Kubernetes metadata

Example Query:

# Find errors in account microservice
{namespace="production", app="account-microservice"} |= "error" | json

# Count errors per minute
sum(rate({namespace="production"} |= "error" [1m])) by (app)

# Find slow queries
{app="mongodb"} | json | duration > 1s

Elasticsearch + Logstash + Kibana (ELK Stack)

Purpose: Advanced log analysis and full-text search

Components:

  1. Logstash: Log processing and transformation
  2. Parses structured logs (JSON)
  3. Enriches with additional metadata
  4. Filters and transforms data
  5. Sends to Elasticsearch

  6. Elasticsearch: Log storage and indexing

  7. Full-text search capabilities
  8. Aggregations and analytics
  9. Scalable distributed storage
  10. 30-day retention

  11. Kibana: Log exploration and visualization

  12. Interactive log search
  13. Custom visualizations
  14. Saved searches and dashboards
  15. Alerting on log patterns

Deployment:

# All deployed in dev cluster
namespace: logging

elasticsearch:
  replicas: 3
  storage: 500GB EBS volume
  heap_size: 4GB

logstash:
  replicas: 2
  heap_size: 2GB

kibana:
  replicas: 2

Access: - Kibana: https://kibana-dev.omnivoltaic.com

Use Cases: - Complex log queries and aggregations - Historical log analysis - Compliance and audit logging - Security event investigation

CloudWatch Logs

Purpose: AWS service logs and CloudWatch integration

What It Collects: - EKS control plane logs - Lambda function logs - RDS database logs - VPC Flow Logs - CloudTrail audit logs - Application logs from EC2 instances

Features: - Native AWS integration - Log Insights for querying - Automatic retention management - Metric filters for alerting

Example Query:

# CloudWatch Insights query
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)

4. Alerting (AlertManager)

Purpose: Alert routing, grouping, and notification management

Features: - Alert deduplication - Alert grouping - Silencing and inhibition - Multi-channel notifications - Alert routing based on labels

Deployment:

namespace: monitoring
replicas: 3 (HA setup)

Alert Channels: 1. Microsoft Teams (Primary) - Critical alerts - Service degradation - Infrastructure issues

  1. Email
  2. Non-critical alerts
  3. Daily summaries
  4. Weekly reports

  5. PagerDuty (On-call)

  6. Production incidents
  7. Service outages
  8. Critical errors

Alert Configuration:

route:
  receiver: 'teams-default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
  - match:
      severity: critical
    receiver: 'teams-critical'
    continue: true
  - match:
      severity: critical
      environment: production
    receiver: 'pagerduty'

receivers:
- name: 'teams-default'
  webhook_configs:
  - url: 'https://outlook.office.com/webhook/...'
    send_resolved: true

- name: 'teams-critical'
  webhook_configs:
  - url: 'https://outlook.office.com/webhook/.../critical'
    send_resolved: true

- name: 'pagerduty'
  pagerduty_configs:
  - service_key: '<pagerduty-key>'

Common Alerts:

  1. High CPU Usage

    - alert: HighCPUUsage
      expr: node_cpu_seconds_total{mode="idle"} < 20
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on {{ $labels.instance }}"
        description: "CPU usage is above 80% for 5 minutes"
    

  2. Pod CrashLooping

    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"
        description: "Pod has restarted {{ $value }} times in the last 15 minutes"
    

  3. High Error Rate

    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate in {{ $labels.service }}"
        description: "Error rate is {{ $value | humanizePercentage }}"
    

  4. Database Connection Issues

    - alert: DatabaseConnectionPoolExhausted
      expr: mongodb_connections_current / mongodb_connections_available > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Database connection pool nearly exhausted"
        description: "{{ $labels.database }} is using {{ $value | humanizePercentage }} of available connections"
    

5. Uptime Monitoring

Uptime Kuma

Purpose: Service availability monitoring and status pages

Features: - HTTP/HTTPS monitoring - TCP port monitoring - Ping monitoring - Keyword monitoring - SSL certificate expiry monitoring - Status pages for stakeholders

Deployment:

namespace: monitoring
replicas: 1
storage: 10GB EBS volume

Access: https://uptime.omnivoltaic.com

Monitored Services: - All public APIs - Web applications - Database endpoints - Third-party integrations - DNS resolution

Check Intervals: - Critical services: 1 minute - Standard services: 5 minutes - Internal services: 10 minutes

Checkly

Purpose: Synthetic monitoring and API testing

Features: - Multi-region checks - API endpoint monitoring - Browser-based checks - Performance monitoring - Alerting on failures

Monitored Endpoints: - GraphQL APIs - REST APIs - Authentication flows - Payment processing - Critical user journeys

Check Locations: - US East (Virginia) - US West (California) - EU (Frankfurt) - Asia (Singapore)

6. Cloud Monitoring (CloudWatch)

Purpose: AWS-native monitoring and alerting

What It Monitors: - EC2 instance metrics - EKS cluster metrics - RDS database metrics - Load balancer metrics - S3 bucket metrics - Lambda function metrics

Key Metrics: - CPU utilization - Network in/out - Disk read/write - Status checks - Request counts - Error rates

CloudWatch Alarms: - EC2 instance health - RDS storage space - Load balancer unhealthy targets - Lambda errors and throttling - Billing alerts

Monitoring Best Practices

1. Metrics

  • Use Labels Wisely: Don't create high-cardinality labels
  • Instrument Everything: Add metrics to all services
  • Follow Naming Conventions: Use standard Prometheus naming
  • Set Appropriate Retention: Balance storage cost vs. historical data needs
  • Use Recording Rules: Pre-compute expensive queries

2. Logging

  • Structured Logging: Use JSON format for logs
  • Include Context: Add request IDs, user IDs, trace IDs
  • Log Levels: Use appropriate levels (DEBUG, INFO, WARN, ERROR)
  • Avoid Sensitive Data: Never log passwords, tokens, PII
  • Sampling: Sample high-volume logs to reduce costs

3. Alerting

  • Alert on Symptoms, Not Causes: Alert on user-facing issues
  • Reduce Noise: Avoid alert fatigue with proper thresholds
  • Actionable Alerts: Every alert should require action
  • Runbooks: Link alerts to troubleshooting guides
  • Test Alerts: Regularly test alert delivery

4. Dashboards

  • Start with Overview: High-level health dashboard
  • Drill-Down Capability: Link to detailed dashboards
  • Use Templates: Create reusable dashboard templates
  • Keep It Simple: Don't overcrowd dashboards
  • Document: Add descriptions to panels

Accessing Monitoring Tools

Development Environment

  • Grafana: https://grafana-dev.omnivoltaic.com
  • Prometheus: https://prometheus-dev.omnivoltaic.com
  • Kibana: https://kibana-dev.omnivoltaic.com
  • AlertManager: https://alertmanager-dev.omnivoltaic.com

Production Environment

  • Grafana: https://grafana.omnivoltaic.com
  • Uptime Kuma: https://uptime.omnivoltaic.com
  • CloudWatch: AWS Console → CloudWatch

Authentication

  • SSO: All tools integrated with company SSO
  • RBAC: Role-based access control
  • API Keys: Available for automation

Troubleshooting

High Cardinality Issues

Symptom: Prometheus running out of memory

Solutions: 1. Identify high-cardinality metrics: promtool tsdb analyze /prometheus/data 2. Remove or aggregate problematic labels 3. Use recording rules to pre-aggregate 4. Increase Prometheus memory allocation

Missing Metrics

Symptom: Metrics not appearing in Grafana

Solutions: 1. Check ServiceMonitor configuration: kubectl get servicemonitor 2. Verify Prometheus targets: https://prometheus/targets 3. Check application metrics endpoint: curl http://service:port/metrics 4. Review Prometheus logs for scrape errors

Log Ingestion Issues

Symptom: Logs not appearing in Loki/Elasticsearch

Solutions: 1. Check Promtail/Logstash status: kubectl get pods -n monitoring 2. Verify log format is correct (JSON preferred) 3. Check storage capacity 4. Review ingestion rate limits

Alert Not Firing

Symptom: Expected alert not triggering

Solutions: 1. Check alert rule syntax in Prometheus 2. Verify AlertManager configuration 3. Test alert expression in Prometheus UI 4. Check AlertManager routing rules 5. Verify webhook/notification channel configuration

)