Monitoring & Observability Documentation¶
Comprehensive guide to OVES monitoring, logging, and alerting infrastructure.
Overview¶
The OVES monitoring stack provides complete observability across all infrastructure and applications. We use industry-standard tools for metrics collection, log aggregation, alerting, and uptime monitoring, ensuring we can quickly detect, diagnose, and resolve issues.
Monitoring Architecture¶
┌─────────────────────────────────────────────────────────────────────┐
│ Data Sources │
│ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ │
│ │ Kubernetes │ │ Applications │ │ AWS Services │ │
│ │ Clusters │ │ (Microservices)│ │ (CloudWatch) │ │
│ └──────┬───────┘ └───────┬────────┘ └──────┬───────┘ │
└─────────┼──────────────────┼──────────────────┼─────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ Collection Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Prometheus │ │ Loki │ │ CloudWatch │ │
│ │ (Metrics) │ │ (Logs) │ │ (AWS Logs) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ │ │ │
│ ┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼───────┐ │
│ │ Logstash │ │ Elasticsearch│ │ CloudWatch │ │
│ │ (Processing) │ │ (Indexing) │ │ Insights │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ Visualization Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Grafana │ │ Kibana │ │ CloudWatch │ │
│ │ (Dashboards) │ │ (Log Search) │ │ (Dashboards)│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Alerting Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ AlertManager │ │ Uptime │ │ Checkcle │ │
│ │ │ │ Kuma │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ └───────┬──────┘ │
│ │ │ │ │
│ └─────────────────┴──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Microsoft Teams │ │
│ │ (Alerts) │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Alerting¶
Alerting in this monitoring stack is handled through AlertManager plus availability checks from tools such as Uptime Kuma and Checkly. Use this page as the anchor overview for alert routing, notification channels, and escalation expectations referenced by the logs and metrics guides.
Monitoring Stack Components¶
1. Metrics Collection (Prometheus)¶
Purpose: Time-series metrics collection and storage
What It Monitors: - Kubernetes cluster metrics (nodes, pods, containers) - Application metrics (request rates, latency, errors) - Database metrics (connections, queries, performance) - Infrastructure metrics (CPU, memory, disk, network) - Custom business metrics
Architecture: - Prometheus Server: Central metrics collection and storage - Node Exporter: Host-level metrics (CPU, memory, disk) - kube-state-metrics: Kubernetes object state metrics - Service Monitors: Automatic service discovery and scraping - Pushgateway: For short-lived jobs and batch processes
Deployment:
# Deployed in dev cluster
namespace: monitoring
replicas: 2 (HA setup)
retention: 15 days
storage: 100GB EBS volume
scrape_interval: 30s
Key Metrics Collected:
- Infrastructure: node_cpu_seconds_total, node_memory_bytes, node_disk_io_time_seconds_total
- Kubernetes: kube_pod_status_phase, kube_deployment_status_replicas, kube_node_status_condition
- Applications: http_requests_total, http_request_duration_seconds, http_requests_errors_total
- Databases: mongodb_connections, redis_connected_clients, postgres_up
Example ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: account-microservice
namespace: monitoring
spec:
selector:
matchLabels:
app: account-microservice
endpoints:
- port: metrics
interval: 30s
path: /metrics
2. Visualization (Grafana)¶
Purpose: Metrics visualization and dashboarding
Features: - Real-time dashboards - Custom visualizations - Alerting rules - Multi-data source support (Prometheus, Loki, Elasticsearch, CloudWatch) - Team collaboration - Dashboard templating
Access:
- Dev: https://grafana-dev.omnivoltaic.com
- Prod: https://grafana.omnivoltaic.com
Pre-built Dashboards:
- Cluster Overview
- Node status and resource usage
- Pod distribution
- Network traffic
-
Storage utilization
-
Application Performance
- Request rate (RPS)
- Response time (p50, p95, p99)
- Error rate
-
Active connections
-
Database Performance
- Query performance
- Connection pool status
- Cache hit rates
-
Replication lag
-
Infrastructure Health
- CPU, Memory, Disk usage
- Network I/O
- Load averages
-
System errors
-
Business Metrics
- User registrations
- Transaction volumes
- API usage by endpoint
- Payment processing
Dashboard Example:
{
"dashboard": {
"title": "Account Microservice",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total{service='account-microservice'}[5m])"
}
]
},
{
"title": "Response Time (p95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service='account-microservice'}[5m]))"
}
]
}
]
}
}
3. Log Aggregation¶
We use multiple log aggregation systems for different purposes:
Loki (Primary for Kubernetes)¶
Purpose: Kubernetes log aggregation and querying
Features: - Label-based log indexing (like Prometheus for logs) - Efficient storage (only indexes metadata) - Native Grafana integration - LogQL query language - Multi-tenancy support
Deployment:
namespace: monitoring
replicas: 3 (HA setup)
retention: 30 days
storage: 200GB EBS volume
Log Collection: - Promtail: Deployed as DaemonSet on all nodes - Automatically discovers and tails pod logs - Enriches logs with Kubernetes metadata
Example Query:
# Find errors in account microservice
{namespace="production", app="account-microservice"} |= "error" | json
# Count errors per minute
sum(rate({namespace="production"} |= "error" [1m])) by (app)
# Find slow queries
{app="mongodb"} | json | duration > 1s
Elasticsearch + Logstash + Kibana (ELK Stack)¶
Purpose: Advanced log analysis and full-text search
Components:
- Logstash: Log processing and transformation
- Parses structured logs (JSON)
- Enriches with additional metadata
- Filters and transforms data
-
Sends to Elasticsearch
-
Elasticsearch: Log storage and indexing
- Full-text search capabilities
- Aggregations and analytics
- Scalable distributed storage
-
30-day retention
-
Kibana: Log exploration and visualization
- Interactive log search
- Custom visualizations
- Saved searches and dashboards
- Alerting on log patterns
Deployment:
# All deployed in dev cluster
namespace: logging
elasticsearch:
replicas: 3
storage: 500GB EBS volume
heap_size: 4GB
logstash:
replicas: 2
heap_size: 2GB
kibana:
replicas: 2
Access:
- Kibana: https://kibana-dev.omnivoltaic.com
Use Cases: - Complex log queries and aggregations - Historical log analysis - Compliance and audit logging - Security event investigation
CloudWatch Logs¶
Purpose: AWS service logs and CloudWatch integration
What It Collects: - EKS control plane logs - Lambda function logs - RDS database logs - VPC Flow Logs - CloudTrail audit logs - Application logs from EC2 instances
Features: - Native AWS integration - Log Insights for querying - Automatic retention management - Metric filters for alerting
Example Query:
# CloudWatch Insights query
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)
4. Alerting (AlertManager)¶
Purpose: Alert routing, grouping, and notification management
Features: - Alert deduplication - Alert grouping - Silencing and inhibition - Multi-channel notifications - Alert routing based on labels
Deployment:
namespace: monitoring
replicas: 3 (HA setup)
Alert Channels: 1. Microsoft Teams (Primary) - Critical alerts - Service degradation - Infrastructure issues
- Non-critical alerts
- Daily summaries
-
Weekly reports
-
PagerDuty (On-call)
- Production incidents
- Service outages
- Critical errors
Alert Configuration:
route:
receiver: 'teams-default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'teams-critical'
continue: true
- match:
severity: critical
environment: production
receiver: 'pagerduty'
receivers:
- name: 'teams-default'
webhook_configs:
- url: 'https://outlook.office.com/webhook/...'
send_resolved: true
- name: 'teams-critical'
webhook_configs:
- url: 'https://outlook.office.com/webhook/.../critical'
send_resolved: true
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-key>'
Common Alerts:
-
High CPU Usage
- alert: HighCPUUsage expr: node_cpu_seconds_total{mode="idle"} < 20 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is above 80% for 5 minutes" -
Pod CrashLooping
- alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: critical annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod has restarted {{ $value }} times in the last 15 minutes" -
High Error Rate
- alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate in {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }}" -
Database Connection Issues
- alert: DatabaseConnectionPoolExhausted expr: mongodb_connections_current / mongodb_connections_available > 0.9 for: 5m labels: severity: warning annotations: summary: "Database connection pool nearly exhausted" description: "{{ $labels.database }} is using {{ $value | humanizePercentage }} of available connections"
5. Uptime Monitoring¶
Uptime Kuma¶
Purpose: Service availability monitoring and status pages
Features: - HTTP/HTTPS monitoring - TCP port monitoring - Ping monitoring - Keyword monitoring - SSL certificate expiry monitoring - Status pages for stakeholders
Deployment:
namespace: monitoring
replicas: 1
storage: 10GB EBS volume
Access: https://uptime.omnivoltaic.com
Monitored Services: - All public APIs - Web applications - Database endpoints - Third-party integrations - DNS resolution
Check Intervals: - Critical services: 1 minute - Standard services: 5 minutes - Internal services: 10 minutes
Checkly¶
Purpose: Synthetic monitoring and API testing
Features: - Multi-region checks - API endpoint monitoring - Browser-based checks - Performance monitoring - Alerting on failures
Monitored Endpoints: - GraphQL APIs - REST APIs - Authentication flows - Payment processing - Critical user journeys
Check Locations: - US East (Virginia) - US West (California) - EU (Frankfurt) - Asia (Singapore)
6. Cloud Monitoring (CloudWatch)¶
Purpose: AWS-native monitoring and alerting
What It Monitors: - EC2 instance metrics - EKS cluster metrics - RDS database metrics - Load balancer metrics - S3 bucket metrics - Lambda function metrics
Key Metrics: - CPU utilization - Network in/out - Disk read/write - Status checks - Request counts - Error rates
CloudWatch Alarms: - EC2 instance health - RDS storage space - Load balancer unhealthy targets - Lambda errors and throttling - Billing alerts
Monitoring Best Practices¶
1. Metrics¶
- Use Labels Wisely: Don't create high-cardinality labels
- Instrument Everything: Add metrics to all services
- Follow Naming Conventions: Use standard Prometheus naming
- Set Appropriate Retention: Balance storage cost vs. historical data needs
- Use Recording Rules: Pre-compute expensive queries
2. Logging¶
- Structured Logging: Use JSON format for logs
- Include Context: Add request IDs, user IDs, trace IDs
- Log Levels: Use appropriate levels (DEBUG, INFO, WARN, ERROR)
- Avoid Sensitive Data: Never log passwords, tokens, PII
- Sampling: Sample high-volume logs to reduce costs
3. Alerting¶
- Alert on Symptoms, Not Causes: Alert on user-facing issues
- Reduce Noise: Avoid alert fatigue with proper thresholds
- Actionable Alerts: Every alert should require action
- Runbooks: Link alerts to troubleshooting guides
- Test Alerts: Regularly test alert delivery
4. Dashboards¶
- Start with Overview: High-level health dashboard
- Drill-Down Capability: Link to detailed dashboards
- Use Templates: Create reusable dashboard templates
- Keep It Simple: Don't overcrowd dashboards
- Document: Add descriptions to panels
Accessing Monitoring Tools¶
Development Environment¶
- Grafana:
https://grafana-dev.omnivoltaic.com - Prometheus:
https://prometheus-dev.omnivoltaic.com - Kibana:
https://kibana-dev.omnivoltaic.com - AlertManager:
https://alertmanager-dev.omnivoltaic.com
Production Environment¶
- Grafana:
https://grafana.omnivoltaic.com - Uptime Kuma:
https://uptime.omnivoltaic.com - CloudWatch: AWS Console → CloudWatch
Authentication¶
- SSO: All tools integrated with company SSO
- RBAC: Role-based access control
- API Keys: Available for automation
Troubleshooting¶
High Cardinality Issues¶
Symptom: Prometheus running out of memory
Solutions:
1. Identify high-cardinality metrics: promtool tsdb analyze /prometheus/data
2. Remove or aggregate problematic labels
3. Use recording rules to pre-aggregate
4. Increase Prometheus memory allocation
Missing Metrics¶
Symptom: Metrics not appearing in Grafana
Solutions:
1. Check ServiceMonitor configuration: kubectl get servicemonitor
2. Verify Prometheus targets: https://prometheus/targets
3. Check application metrics endpoint: curl http://service:port/metrics
4. Review Prometheus logs for scrape errors
Log Ingestion Issues¶
Symptom: Logs not appearing in Loki/Elasticsearch
Solutions:
1. Check Promtail/Logstash status: kubectl get pods -n monitoring
2. Verify log format is correct (JSON preferred)
3. Check storage capacity
4. Review ingestion rate limits
Alert Not Firing¶
Symptom: Expected alert not triggering
Solutions: 1. Check alert rule syntax in Prometheus 2. Verify AlertManager configuration 3. Test alert expression in Prometheus UI 4. Check AlertManager routing rules 5. Verify webhook/notification channel configuration
)