Monitoring & Observability Documentation¶

Comprehensive guide to OVES monitoring, logging, and alerting infrastructure.

Overview¶

The OVES monitoring stack provides complete observability across all infrastructure and applications. We use industry-standard tools for metrics collection, log aggregation, alerting, and uptime monitoring, ensuring we can quickly detect, diagnose, and resolve issues.

Monitoring Architecture¶

┌─────────────────────────────────────────────────────────────────────┐
│                        Data Sources                                 │
│  ┌──────────────┐  ┌────────────────┐  ┌──────────────┐             │
│  │ Kubernetes   │  │ Applications   │  │ AWS Services │             │
│  │ Clusters     │  │ (Microservices)│  │ (CloudWatch) │             │
│  └──────┬───────┘  └───────┬────────┘  └──────┬───────┘             │
└─────────┼──────────────────┼──────────────────┼─────────────────────┘
          │                  │                  │
          ▼                  ▼                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Collection Layer                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │ Prometheus   │  │    Loki      │  │  CloudWatch  │              │
│  │ (Metrics)    │  │   (Logs)     │  │  (AWS Logs)  │              │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘              │
│         │                 │                 │                     │
│         │                 │                 │                     │
│  ┌──────▼───────┐  ┌──────▼───────┐  ┌──────▼───────┐              │
│  │  Logstash    │  │ Elasticsearch│  │   CloudWatch │              │
│  │ (Processing) │  │  (Indexing)  │  │   Insights   │              │
│  └──────────────┘  └──────────────┘  └──────────────┘              │
└─────────────────────────────────────────────────────────────────────┘
          │                  │                  │
          ▼                  ▼                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Visualization Layer                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │   Grafana    │  │    Kibana    │  │  CloudWatch  │              │
│  │ (Dashboards) │  │ (Log Search) │  │  (Dashboards)│              │
│  └──────────────┘  └──────────────┘  └──────────────┘              │
└─────────────────────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Alerting Layer                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │ AlertManager │  │    Uptime    │  │   Checkcle   │              │
│  │              │  │     Kuma     │  │              │              │
│  └──────┬───────┘  └──────┬───────┘  └───────┬──────┘              │
│         │                 │                  │                     │
│         └─────────────────┴──────────────────┘                     │
│                           │                                        │
│                           ▼                                        │
│                  ┌──────────────────┐                               │
│                  │ Microsoft Teams  │                               │
│                  │    (Alerts)      │                               │
│                  └──────────────────┘                               │
└─────────────────────────────────────────────────────────────────────┘

Monitoring Stack Components¶

1. Metrics Collection (Prometheus)¶

Purpose: Time-series metrics collection and storage

What It Monitors: - Kubernetes cluster metrics (nodes, pods, containers) - Application metrics (request rates, latency, errors) - Database metrics (connections, queries, performance) - Infrastructure metrics (CPU, memory, disk, network) - Custom business metrics

Architecture: - Prometheus Server: Central metrics collection and storage - Node Exporter: Host-level metrics (CPU, memory, disk) - kube-state-metrics: Kubernetes object state metrics - Service Monitors: Automatic service discovery and scraping - Pushgateway: For short-lived jobs and batch processes

Deployment:

# Deployed in dev cluster
namespace: monitoring
replicas: 2 (HA setup)
retention: 15 days
storage: 100GB EBS volume
scrape_interval: 30s

Key Metrics Collected: - Infrastructure: node_cpu_seconds_total, node_memory_bytes, node_disk_io_time_seconds_total - Kubernetes: kube_pod_status_phase, kube_deployment_status_replicas, kube_node_status_condition - Applications: http_requests_total, http_request_duration_seconds, http_requests_errors_total - Databases: mongodb_connections, redis_connected_clients, postgres_up

Example ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: account-microservice
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: account-microservice
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

2. Visualization (Grafana)¶

Purpose: Metrics visualization and dashboarding

Features: - Real-time dashboards - Custom visualizations - Alerting rules - Multi-data source support (Prometheus, Loki, Elasticsearch, CloudWatch) - Team collaboration - Dashboard templating

Access: - Dev: https://grafana-dev.omnivoltaic.com - Prod: https://grafana.omnivoltaic.com

Pre-built Dashboards:

Cluster Overview
Node status and resource usage
Pod distribution
Network traffic
Storage utilization
Application Performance
Request rate (RPS)
Response time (p50, p95, p99)
Error rate
Active connections
Database Performance
Query performance
Connection pool status
Cache hit rates
Replication lag
Infrastructure Health
CPU, Memory, Disk usage
Network I/O
Load averages
System errors
Business Metrics
User registrations
Transaction volumes
API usage by endpoint
Payment processing

Dashboard Example:

{
  "dashboard": {
    "title": "Account Microservice",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{service='account-microservice'}[5m])"
          }
        ]
      },
      {
        "title": "Response Time (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service='account-microservice'}[5m]))"
          }
        ]
      }
    ]
  }
}

3. Log Aggregation¶

We use multiple log aggregation systems for different purposes:

Loki (Primary for Kubernetes)¶

Purpose: Kubernetes log aggregation and querying

Features: - Label-based log indexing (like Prometheus for logs) - Efficient storage (only indexes metadata) - Native Grafana integration - LogQL query language - Multi-tenancy support

Deployment:

namespace: monitoring
replicas: 3 (HA setup)
retention: 30 days
storage: 200GB EBS volume

Log Collection: - Promtail: Deployed as DaemonSet on all nodes - Automatically discovers and tails pod logs - Enriches logs with Kubernetes metadata

Example Query:

# Find errors in account microservice
{namespace="production", app="account-microservice"} |= "error" | json

# Count errors per minute
sum(rate({namespace="production"} |= "error" [1m])) by (app)

# Find slow queries
{app="mongodb"} | json | duration > 1s

Elasticsearch + Logstash + Kibana (ELK Stack)¶

Purpose: Advanced log analysis and full-text search

Components:

Logstash: Log processing and transformation
Parses structured logs (JSON)
Enriches with additional metadata
Filters and transforms data
Sends to Elasticsearch
Elasticsearch: Log storage and indexing
Full-text search capabilities
Aggregations and analytics
Scalable distributed storage
30-day retention
Kibana: Log exploration and visualization
Interactive log search
Custom visualizations
Saved searches and dashboards
Alerting on log patterns

Deployment:

# All deployed in dev cluster
namespace: logging

elasticsearch:
  replicas: 3
  storage: 500GB EBS volume
  heap_size: 4GB

logstash:
  replicas: 2
  heap_size: 2GB

kibana:
  replicas: 2

Access: - Kibana: https://kibana-dev.omnivoltaic.com

Use Cases: - Complex log queries and aggregations - Historical log analysis - Compliance and audit logging - Security event investigation

CloudWatch Logs¶

Purpose: AWS service logs and CloudWatch integration

What It Collects: - EKS control plane logs - Lambda function logs - RDS database logs - VPC Flow Logs - CloudTrail audit logs - Application logs from EC2 instances

Features: - Native AWS integration - Log Insights for querying - Automatic retention management - Metric filters for alerting

Example Query:

# CloudWatch Insights query
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)

4. Alerting (AlertManager)¶

Purpose: Alert routing, grouping, and notification management

Features: - Alert deduplication - Alert grouping - Silencing and inhibition - Multi-channel notifications - Alert routing based on labels

Deployment:

namespace: monitoring
replicas: 3 (HA setup)

Alert Channels: 1. Microsoft Teams (Primary) - Critical alerts - Service degradation - Infrastructure issues

Email
Non-critical alerts
Daily summaries
Weekly reports
PagerDuty (On-call)
Production incidents
Service outages
Critical errors

Alert Configuration:

route:
  receiver: 'teams-default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
  - match:
      severity: critical
    receiver: 'teams-critical'
    continue: true
  - match:
      severity: critical
      environment: production
    receiver: 'pagerduty'

receivers:
- name: 'teams-default'
  webhook_configs:
  - url: 'https://outlook.office.com/webhook/...'
    send_resolved: true

- name: 'teams-critical'
  webhook_configs:
  - url: 'https://outlook.office.com/webhook/.../critical'
    send_resolved: true

- name: 'pagerduty'
  pagerduty_configs:
  - service_key: '<pagerduty-key>'

Common Alerts:

High CPU Usage

- alert: HighCPUUsage
  expr: node_cpu_seconds_total{mode="idle"} < 20
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage on {{ $labels.instance }}"
    description: "CPU usage is above 80% for 5 minutes"

Pod CrashLooping

- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Pod {{ $labels.pod }} is crash looping"
    description: "Pod has restarted {{ $value }} times in the last 15 minutes"

High Error Rate

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate in {{ $labels.service }}"
    description: "Error rate is {{ $value | humanizePercentage }}"

Database Connection Issues

- alert: DatabaseConnectionPoolExhausted
  expr: mongodb_connections_current / mongodb_connections_available > 0.9
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Database connection pool nearly exhausted"
    description: "{{ $labels.database }} is using {{ $value | humanizePercentage }} of available connections"

5. Uptime Monitoring¶

Uptime Kuma¶

Purpose: Service availability monitoring and status pages

Features: - HTTP/HTTPS monitoring - TCP port monitoring - Ping monitoring - Keyword monitoring - SSL certificate expiry monitoring - Status pages for stakeholders

Deployment:

namespace: monitoring
replicas: 1
storage: 10GB EBS volume

Access: https://uptime.omnivoltaic.com

Monitored Services: - All public APIs - Web applications - Database endpoints - Third-party integrations - DNS resolution

Check Intervals: - Critical services: 1 minute - Standard services: 5 minutes - Internal services: 10 minutes

Checkly¶

Purpose: Synthetic monitoring and API testing

Features: - Multi-region checks - API endpoint monitoring - Browser-based checks - Performance monitoring - Alerting on failures

Monitored Endpoints: - GraphQL APIs - REST APIs - Authentication flows - Payment processing - Critical user journeys

Check Locations: - US East (Virginia) - US West (California) - EU (Frankfurt) - Asia (Singapore)

6. Cloud Monitoring (CloudWatch)¶

Purpose: AWS-native monitoring and alerting

What It Monitors: - EC2 instance metrics - EKS cluster metrics - RDS database metrics - Load balancer metrics - S3 bucket metrics - Lambda function metrics

Key Metrics: - CPU utilization - Network in/out - Disk read/write - Status checks - Request counts - Error rates

CloudWatch Alarms: - EC2 instance health - RDS storage space - Load balancer unhealthy targets - Lambda errors and throttling - Billing alerts

Monitoring Best Practices¶

1. Metrics¶

Use Labels Wisely: Don't create high-cardinality labels
Instrument Everything: Add metrics to all services
Follow Naming Conventions: Use standard Prometheus naming
Set Appropriate Retention: Balance storage cost vs. historical data needs
Use Recording Rules: Pre-compute expensive queries

2. Logging¶

Structured Logging: Use JSON format for logs
Include Context: Add request IDs, user IDs, trace IDs
Log Levels: Use appropriate levels (DEBUG, INFO, WARN, ERROR)
Avoid Sensitive Data: Never log passwords, tokens, PII
Sampling: Sample high-volume logs to reduce costs

3. Alerting¶

Alert on Symptoms, Not Causes: Alert on user-facing issues
Reduce Noise: Avoid alert fatigue with proper thresholds
Actionable Alerts: Every alert should require action
Runbooks: Link alerts to troubleshooting guides
Test Alerts: Regularly test alert delivery

4. Dashboards¶

Start with Overview: High-level health dashboard
Drill-Down Capability: Link to detailed dashboards
Use Templates: Create reusable dashboard templates
Keep It Simple: Don't overcrowd dashboards
Document: Add descriptions to panels

Accessing Monitoring Tools¶

Development Environment¶

Grafana: https://grafana-dev.omnivoltaic.com
Prometheus: https://prometheus-dev.omnivoltaic.com
Kibana: https://kibana-dev.omnivoltaic.com
AlertManager: https://alertmanager-dev.omnivoltaic.com

Production Environment¶

Grafana: https://grafana.omnivoltaic.com
Uptime Kuma: https://uptime.omnivoltaic.com
CloudWatch: AWS Console → CloudWatch

Authentication¶

SSO: All tools integrated with company SSO
RBAC: Role-based access control
API Keys: Available for automation

Troubleshooting¶

High Cardinality Issues¶

Symptom: Prometheus running out of memory

Solutions: 1. Identify high-cardinality metrics: promtool tsdb analyze /prometheus/data 2. Remove or aggregate problematic labels 3. Use recording rules to pre-aggregate 4. Increase Prometheus memory allocation

Missing Metrics¶

Symptom: Metrics not appearing in Grafana

Solutions: 1. Check ServiceMonitor configuration: kubectl get servicemonitor 2. Verify Prometheus targets: https://prometheus/targets 3. Check application metrics endpoint: curl http://service:port/metrics 4. Review Prometheus logs for scrape errors

Log Ingestion Issues¶

Symptom: Logs not appearing in Loki/Elasticsearch

Solutions: 1. Check Promtail/Logstash status: kubectl get pods -n monitoring 2. Verify log format is correct (JSON preferred) 3. Check storage capacity 4. Review ingestion rate limits

Alert Not Firing¶

Symptom: Expected alert not triggering

Solutions: 1. Check alert rule syntax in Prometheus 2. Verify AlertManager configuration 3. Test alert expression in Prometheus UI 4. Check AlertManager routing rules 5. Verify webhook/notification channel configuration

)