Log Management Documentation¶
Comprehensive guide to logging infrastructure, best practices, and troubleshooting in the OVES ecosystem.
Overview¶
Effective logging is crucial for debugging, monitoring, and understanding system behavior. The OVES logging infrastructure uses multiple tools to collect, aggregate, analyze, and visualize logs from all applications and infrastructure components.
Logging Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Log Sources │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Kubernetes │ │ Applications│ │ AWS │ │
│ │ Pods │ │ (stdout) │ │ Services │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
└─────────┼─────────────────┼─────────────────┼──────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Log Collection │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Promtail │ │ Fluentd │ │ CloudWatch │ │
│ │ (K8s logs) │ │ (App logs) │ │ Agent │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
└─────────┼─────────────────┼─────────────────┼──────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Log Processing │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Loki │ │ Logstash │ │ CloudWatch │ │
│ │ (Storage) │ │ (Transform) │ │ Insights │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
└─────────┼─────────────────┼─────────────────┼──────────────────┘
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │Elasticsearch│ │
│ │ (Indexing) │ │
│ └──────┬──────┘ │
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Visualization │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Grafana │ │ Kibana │ │ CloudWatch │ │
│ │ (Loki) │ │ (ELK) │ │ Dashboard │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Log Collection Systems¶
1. Loki + Promtail (Primary for Kubernetes)¶
Purpose: Lightweight, cost-effective log aggregation for Kubernetes workloads
How It Works¶
- Promtail runs as a DaemonSet on every Kubernetes node
- Automatically discovers all pods and containers
- Tails log files from
/var/log/pods - Enriches logs with Kubernetes metadata (namespace, pod, container)
- Sends logs to Loki for storage
Promtail Configuration¶
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: monitoring
data:
promtail.yaml: |
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Kubernetes pod logs
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Add namespace label
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
# Add pod name label
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
# Add container name label
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
# Add app label
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
# Add environment label
- source_labels: [__meta_kubernetes_pod_label_environment]
target_label: environment
Loki Configuration¶
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2023-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: loki_index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
shared_store: s3
aws:
s3: s3://us-east-1/oves-loki-logs
region: us-east-1
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
chunk_store_config:
max_look_back_period: 720h # 30 days
table_manager:
retention_deletes_enabled: true
retention_period: 720h # 30 days
Querying Logs with LogQL¶
Basic Queries:
# View all logs from a specific app
{app="account-microservice"}
# Filter by namespace
{namespace="production"}
# Combine multiple labels
{namespace="production", app="account-microservice"}
# Search for specific text
{app="account-microservice"} |= "error"
# Case-insensitive search
{app="account-microservice"} |~ "(?i)error"
# Exclude specific text
{app="account-microservice"} != "health check"
Advanced Queries:
# Parse JSON logs
{app="account-microservice"} | json | level="error"
# Extract fields
{app="account-microservice"} | json | line_format "{{.timestamp}} {{.message}}"
# Count errors per minute
sum(rate({namespace="production"} |= "error" [1m])) by (app)
# Find slow queries (duration > 1 second)
{app="mongodb"} | json | duration > 1
# Top 10 error messages
topk(10, sum by (message) (rate({level="error"} [5m])))
# Calculate error rate percentage
sum(rate({level="error"} [5m])) / sum(rate({} [5m])) * 100
Time-based Queries:
# Last 5 minutes
{app="account-microservice"} [5m]
# Specific time range
{app="account-microservice"} [2024-01-01T00:00:00Z:2024-01-01T23:59:59Z]
# Rate of log entries
rate({app="account-microservice"} [5m])
2. ELK Stack (Elasticsearch, Logstash, Kibana)¶
Purpose: Advanced log analysis, full-text search, and complex aggregations
Logstash Pipeline¶
Input Configuration:
input {
# Beats input (from Filebeat)
beats {
port => 5044
}
# HTTP input (from applications)
http {
port => 8080
codec => json
}
# Kafka input (for high-volume logs)
kafka {
bootstrap_servers => "kafka:9092"
topics => ["application-logs"]
codec => json
}
}
Filter Configuration:
filter {
# Parse JSON logs
if [message] =~ /^\{.*\}$/ {
json {
source => "message"
}
}
# Add timestamp
date {
match => ["timestamp", "ISO8601"]
target => "@timestamp"
}
# Parse log level
if [level] {
mutate {
uppercase => ["level"]
}
}
# Extract request ID
if [message] =~ /request_id/ {
grok {
match => { "message" => "request_id=%{UUID:request_id}" }
}
}
# GeoIP lookup for IP addresses
if [client_ip] {
geoip {
source => "client_ip"
target => "geoip"
}
}
# Add environment tag
mutate {
add_field => {
"environment" => "${ENVIRONMENT:development}"
}
}
# Remove sensitive fields
mutate {
remove_field => ["password", "token", "api_key"]
}
}
Output Configuration:
output {
# Send to Elasticsearch
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{[environment]}-%{+YYYY.MM.dd}"
document_type => "_doc"
}
# Send critical errors to dead letter queue
if [level] == "CRITICAL" {
file {
path => "/var/log/critical-errors.log"
codec => json_lines
}
}
# Debug output (optional)
# stdout { codec => rubydebug }
}
Elasticsearch Index Templates¶
{
"index_patterns": ["logs-*"],
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "logs-policy",
"index.lifecycle.rollover_alias": "logs"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"level": { "type": "keyword" },
"message": { "type": "text" },
"app": { "type": "keyword" },
"namespace": { "type": "keyword" },
"pod": { "type": "keyword" },
"container": { "type": "keyword" },
"request_id": { "type": "keyword" },
"user_id": { "type": "keyword" },
"duration": { "type": "float" },
"status_code": { "type": "integer" },
"client_ip": { "type": "ip" },
"geoip": {
"properties": {
"location": { "type": "geo_point" }
}
}
}
}
}
Kibana Queries¶
Basic Searches:
# Find all errors
level:ERROR
# Search in specific field
message:"database connection failed"
# Wildcard search
app:*-microservice
# Range query
status_code:[400 TO 599]
# Boolean operators
level:ERROR AND app:account-microservice
# Exclude
level:ERROR NOT message:"expected error"
Advanced Searches:
# Regex search
message:/error.*database/
# Exists query
_exists_:request_id
# Missing field
NOT _exists_:user_id
# Date range
@timestamp:[now-1h TO now]
# Aggregation query
app:* | stats count() by level
# Complex query
(level:ERROR OR level:CRITICAL) AND
namespace:production AND
@timestamp:[now-1h TO now]
3. CloudWatch Logs¶
Purpose: AWS service logs and EC2 application logs
CloudWatch Agent Configuration¶
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/application/*.log",
"log_group_name": "/aws/ec2/application",
"log_stream_name": "{instance_id}/{hostname}",
"timestamp_format": "%Y-%m-%d %H:%M:%S",
"timezone": "UTC"
},
{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "/aws/ec2/nginx",
"log_stream_name": "{instance_id}/access"
},
{
"file_path": "/var/log/nginx/error.log",
"log_group_name": "/aws/ec2/nginx",
"log_stream_name": "{instance_id}/error"
}
]
}
},
"log_stream_name": "{instance_id}"
}
}
CloudWatch Insights Queries¶
-- Find all errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
-- Count errors by application
fields @timestamp, application
| filter level = "ERROR"
| stats count() by application
-- Calculate average response time
fields @timestamp, duration
| filter ispresent(duration)
| stats avg(duration) as avg_duration by bin(5m)
-- Find slow queries
fields @timestamp, query, duration
| filter duration > 1000
| sort duration desc
-- Parse JSON logs
fields @timestamp, @message
| parse @message /(?<level>\w+)\s+(?<message>.*)/
| filter level = "ERROR"
-- Top 10 error messages
fields @message
| filter level = "ERROR"
| stats count() as error_count by @message
| sort error_count desc
| limit 10
-- Request rate per minute
fields @timestamp
| stats count() as request_count by bin(1m)
-- Error rate percentage
fields @timestamp, level
| stats count() as total,
sum(level = "ERROR") as errors
| fields errors / total * 100 as error_rate
Log Formats and Standards¶
Structured Logging (JSON)¶
Recommended Format:
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "ERROR",
"service": "account-microservice",
"version": "1.2.3",
"environment": "production",
"message": "Failed to connect to database",
"error": {
"type": "MongoError",
"message": "Connection timeout",
"stack": "Error: Connection timeout\n at..."
},
"context": {
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user_123",
"correlation_id": "abc-def-ghi",
"method": "POST",
"path": "/api/accounts",
"duration_ms": 5432
},
"metadata": {
"pod": "account-microservice-7d9f8b6c5-x4k2p",
"namespace": "production",
"node": "ip-10-0-1-45"
}
}
Log Levels¶
Use consistent log levels across all applications:
| Level | Usage | Examples |
|---|---|---|
| DEBUG | Detailed diagnostic information | Variable values, function entry/exit |
| INFO | General informational messages | Service started, request completed |
| WARN | Warning messages for potentially harmful situations | Deprecated API usage, slow query |
| ERROR | Error events that might still allow the application to continue | Failed API call, validation error |
| CRITICAL | Severe error events that might cause the application to abort | Database unavailable, out of memory |
Application Logging Best Practices¶
1. Include Context¶
// ✅ Good - Includes context
logger.info('User login successful', {
user_id: user.id,
email: user.email,
ip_address: req.ip,
user_agent: req.headers['user-agent'],
request_id: req.id,
duration_ms: Date.now() - startTime
});
// ❌ Bad - No context
logger.info('User logged in');
2. Use Request IDs¶
// Generate request ID at entry point
app.use((req, res, next) => {
req.id = uuidv4();
res.setHeader('X-Request-ID', req.id);
next();
});
// Include in all logs
logger.info('Processing payment', {
request_id: req.id,
amount: payment.amount
});
3. Log Errors Properly¶
// ✅ Good - Full error details
try {
await processPayment(payment);
} catch (error) {
logger.error('Payment processing failed', {
error: {
name: error.name,
message: error.message,
stack: error.stack,
code: error.code
},
payment_id: payment.id,
amount: payment.amount,
request_id: req.id
});
throw error;
}
// ❌ Bad - Lost error details
catch (error) {
logger.error('Payment failed');
}
4. Avoid Logging Sensitive Data¶
// ✅ Good - Sensitive data masked
logger.info('Payment processed', {
card_last4: payment.card.slice(-4),
amount: payment.amount,
currency: payment.currency
});
// ❌ Bad - Sensitive data exposed
logger.info('Payment processed', {
card_number: payment.card, // ❌ Never log full card numbers
cvv: payment.cvv, // ❌ Never log CVV
password: user.password // ❌ Never log passwords
});
Log Retention Policies¶
| System | Retention Period | Storage Location | Backup |
|---|---|---|---|
| Loki | 30 days | EBS + S3 | Daily to S3 |
| Elasticsearch | 30 days | EBS | Weekly to S3 |
| CloudWatch | 90 days | CloudWatch | N/A |
| S3 Archive | 1 year | S3 Glacier | N/A |
Log Analysis Examples¶
Finding Performance Issues¶
# Loki: Find requests taking > 5 seconds
{app="account-microservice"}
| json
| duration_ms > 5000
| line_format "{{.method}} {{.path}} took {{.duration_ms}}ms"
-- CloudWatch: Average response time by endpoint
fields @timestamp, path, duration
| stats avg(duration) as avg_duration by path
| sort avg_duration desc
Tracking User Activity¶
# Loki: Track specific user's actions
{namespace="production"}
| json
| user_id="user_123"
| line_format "{{.timestamp}} {{.action}} {{.resource}}"
Debugging Errors¶
# Loki: Find all errors with stack traces
{app="account-microservice", level="ERROR"}
| json
| line_format "{{.message}}\n{{.error.stack}}"
Monitoring API Usage¶
-- CloudWatch: API endpoint usage
fields @timestamp, method, path
| stats count() as requests by path
| sort requests desc
| limit 20
Troubleshooting¶
High Log Volume¶
Problem: Too many logs, high storage costs
Solutions: 1. Implement log sampling for high-volume endpoints 2. Reduce DEBUG logs in production 3. Use log levels appropriately 4. Implement log aggregation at application level
// Example: Sample 10% of requests
if (Math.random() < 0.1 || req.path.includes('/critical/')) {
logger.debug('Request details', { ... });
}
Missing Logs¶
Problem: Logs not appearing in aggregation system
Solutions:
1. Check Promtail/Fluentd is running: kubectl get pods -n monitoring
2. Verify log format is correct (JSON preferred)
3. Check application is writing to stdout/stderr
4. Verify network connectivity to log aggregator
5. Check storage capacity
Slow Log Queries¶
Problem: Queries taking too long
Solutions: 1. Add time range filters 2. Use indexed fields in queries 3. Avoid wildcard searches at beginning of strings 4. Use aggregations instead of returning all results 5. Consider using Elasticsearch for complex queries