Monitoring Guide
This comprehensive guide covers monitoring strategies, tools, and best practices for applications and infrastructure built with Trae.
Overview
Effective monitoring is crucial for maintaining application performance, reliability, and user experience. This guide covers:
- Monitoring fundamentals
- Application performance monitoring (APM)
- Infrastructure monitoring
- Log management
- Alerting strategies
- Observability best practices
- Monitoring tools and platforms
Monitoring Fundamentals
Key Concepts
The Three Pillars of Observability
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Metrics │ │ Logs │ │ Traces │
│ │ │ │ │ │
│ • Counters │ │ • Events │ │ • Requests │
│ • Gauges │ │ • Errors │ │ • Spans │
│ • Histograms │ │ • Debug Info │ │ • Dependencies │
│ • Summaries │ │ • Audit Trail │ │ • Performance │
└─────────────────┘ └─────────────────┘ └─────────────────┘Monitoring Types
- Synthetic Monitoring: Proactive testing with simulated transactions
- Real User Monitoring (RUM): Actual user experience tracking
- Infrastructure Monitoring: System resources and health
- Application Monitoring: Code-level performance and errors
- Business Monitoring: KPIs and business metrics
Monitoring Strategy
SLI, SLO, and SLA Framework
yaml
# Service Level Indicators (SLIs)
slis:
availability:
description: "Percentage of successful requests"
measurement: "(successful_requests / total_requests) * 100"
target: "> 99.9%"
latency:
description: "95th percentile response time"
measurement: "response_time_95th_percentile"
target: "< 200ms"
error_rate:
description: "Percentage of failed requests"
measurement: "(failed_requests / total_requests) * 100"
target: "< 0.1%"
# Service Level Objectives (SLOs)
slos:
- name: "API Availability"
sli: "availability"
target: "99.9% over 30 days"
error_budget: "0.1%"
- name: "Response Time"
sli: "latency"
target: "95% of requests < 200ms"
measurement_window: "5 minutes"
# Service Level Agreements (SLAs)
slas:
- name: "Service Uptime"
commitment: "99.5% monthly uptime"
penalty: "Service credits for downtime"
measurement: "External monitoring"Application Performance Monitoring
APM Implementation
Node.js Application Monitoring
javascript
// app.js - APM setup
const apm = require('elastic-apm-node').start({
serviceName: 'my-app',
secretToken: process.env.ELASTIC_APM_SECRET_TOKEN,
serverUrl: process.env.ELASTIC_APM_SERVER_URL,
environment: process.env.NODE_ENV,
captureBody: 'errors',
captureHeaders: true,
logLevel: 'info'
});
const express = require('express');
const prometheus = require('prom-client');
const app = express();
// Prometheus metrics
const register = new prometheus.Registry();
prometheus.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
const httpRequestTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status']
});
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
// Middleware for metrics collection
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route ? req.route.path : req.path;
httpRequestDuration
.labels(req.method, route, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, route, res.statusCode)
.inc();
});
next();
});
// Health check endpoint
app.get('/health', (req, res) => {
res.status(200).json({
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime(),
memory: process.memoryUsage(),
version: process.env.npm_package_version
});
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Error handling with APM
app.use((err, req, res, next) => {
apm.captureError(err);
console.error('Error:', err);
res.status(500).json({ error: 'Internal Server Error' });
});
module.exports = app;Python Application Monitoring
python
# app.py - APM setup
import os
import time
from flask import Flask, request, jsonify
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import elasticapm
from elasticapm.contrib.flask import ElasticAPM
app = Flask(__name__)
# Elastic APM configuration
app.config['ELASTIC_APM'] = {
'SERVICE_NAME': 'my-python-app',
'SECRET_TOKEN': os.environ.get('ELASTIC_APM_SECRET_TOKEN'),
'SERVER_URL': os.environ.get('ELASTIC_APM_SERVER_URL'),
'ENVIRONMENT': os.environ.get('FLASK_ENV', 'production'),
'CAPTURE_BODY': 'errors',
'CAPTURE_HEADERS': True
}
apm = ElasticAPM(app)
# Prometheus metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
)
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
request_latency = time.time() - request.start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=response.status_code
).inc()
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(request_latency)
return response
@app.route('/health')
def health_check():
return jsonify({
'status': 'healthy',
'timestamp': time.time(),
'version': os.environ.get('APP_VERSION', 'unknown')
})
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
@app.errorhandler(Exception)
def handle_exception(e):
apm.capture_exception()
return jsonify({'error': 'Internal Server Error'}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)Custom Metrics Implementation
Business Metrics
javascript
// business-metrics.js
const prometheus = require('prom-client');
class BusinessMetrics {
constructor() {
this.userRegistrations = new prometheus.Counter({
name: 'user_registrations_total',
help: 'Total number of user registrations',
labelNames: ['source', 'plan']
});
this.orderValue = new prometheus.Histogram({
name: 'order_value_dollars',
help: 'Order value in dollars',
labelNames: ['product_category'],
buckets: [10, 50, 100, 500, 1000, 5000]
});
this.activeUsers = new prometheus.Gauge({
name: 'active_users_current',
help: 'Current number of active users',
labelNames: ['session_type']
});
this.featureUsage = new prometheus.Counter({
name: 'feature_usage_total',
help: 'Total feature usage count',
labelNames: ['feature_name', 'user_tier']
});
}
recordUserRegistration(source, plan) {
this.userRegistrations.labels(source, plan).inc();
}
recordOrder(value, category) {
this.orderValue.labels(category).observe(value);
}
updateActiveUsers(count, sessionType) {
this.activeUsers.labels(sessionType).set(count);
}
recordFeatureUsage(featureName, userTier) {
this.featureUsage.labels(featureName, userTier).inc();
}
}
module.exports = new BusinessMetrics();Performance Metrics
javascript
// performance-metrics.js
const prometheus = require('prom-client');
class PerformanceMetrics {
constructor() {
this.databaseQueryDuration = new prometheus.Histogram({
name: 'database_query_duration_seconds',
help: 'Database query duration',
labelNames: ['query_type', 'table'],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]
});
this.cacheHitRate = new prometheus.Gauge({
name: 'cache_hit_rate',
help: 'Cache hit rate percentage',
labelNames: ['cache_type']
});
this.queueSize = new prometheus.Gauge({
name: 'queue_size_current',
help: 'Current queue size',
labelNames: ['queue_name']
});
this.externalApiCalls = new prometheus.Counter({
name: 'external_api_calls_total',
help: 'Total external API calls',
labelNames: ['service', 'endpoint', 'status']
});
}
recordDatabaseQuery(queryType, table, duration) {
this.databaseQueryDuration.labels(queryType, table).observe(duration);
}
updateCacheHitRate(cacheType, hitRate) {
this.cacheHitRate.labels(cacheType).set(hitRate);
}
updateQueueSize(queueName, size) {
this.queueSize.labels(queueName).set(size);
}
recordExternalApiCall(service, endpoint, status) {
this.externalApiCalls.labels(service, endpoint, status).inc();
}
}
module.exports = new PerformanceMetrics();Infrastructure Monitoring
Prometheus Configuration
yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-2'
rule_files:
- "alert_rules.yml"
- "recording_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Application metrics
- job_name: 'app'
static_configs:
- targets: ['app:3000']
metrics_path: '/metrics'
scrape_interval: 10s
scrape_timeout: 5s
# Node exporter for system metrics
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# Database metrics
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
# Redis metrics
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
# Nginx metrics
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
# Kubernetes metrics
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;httpsAlert Rules
yaml
# alert_rules.yml
groups:
- name: application.rules
rules:
- alert: HighErrorRate
expr: |
(
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
) * 100 > 5
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}% for {{ $labels.instance }}"
runbook_url: "https://runbooks.example.com/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s for {{ $labels.instance }}"
- alert: DatabaseConnectionsHigh
expr: |
pg_stat_database_numbackends / pg_settings_max_connections * 100 > 80
for: 5m
labels:
severity: warning
team: database
annotations:
summary: "Database connections high"
description: "Database connections are at {{ $value }}% of maximum"
- name: infrastructure.rules
rules:
- alert: HighCPUUsage
expr: |
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: |
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Disk space low"
description: "Disk usage is {{ $value }}% on {{ $labels.instance }} {{ $labels.mountpoint }}"Recording Rules
yaml
# recording_rules.yml
groups:
- name: application.recording
interval: 30s
rules:
- record: app:http_request_rate
expr: |
rate(http_requests_total[5m])
- record: app:http_error_rate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
- record: app:http_latency_p95
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
- record: app:http_latency_p99
expr: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
- name: business.recording
interval: 60s
rules:
- record: business:revenue_per_hour
expr: |
increase(order_value_dollars_sum[1h])
- record: business:user_growth_rate
expr: |
rate(user_registrations_total[1h])
- record: business:feature_adoption_rate
expr: |
rate(feature_usage_total[1h]) by (feature_name)Log Management
Structured Logging
Node.js Logging
javascript
// logger.js
const winston = require('winston');
const { ElasticsearchTransport } = require('winston-elasticsearch');
const esTransportOpts = {
level: 'info',
clientOpts: {
node: process.env.ELASTICSEARCH_URL || 'http://localhost:9200'
},
index: 'app-logs',
indexTemplate: {
name: 'app-logs-template',
pattern: 'app-logs-*',
settings: {
number_of_shards: 1,
number_of_replicas: 1
},
mappings: {
properties: {
'@timestamp': { type: 'date' },
level: { type: 'keyword' },
message: { type: 'text' },
service: { type: 'keyword' },
trace_id: { type: 'keyword' },
span_id: { type: 'keyword' },
user_id: { type: 'keyword' },
request_id: { type: 'keyword' }
}
}
}
};
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: process.env.SERVICE_NAME || 'app',
version: process.env.APP_VERSION || 'unknown',
environment: process.env.NODE_ENV || 'development'
},
transports: [
new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.simple()
)
}),
new ElasticsearchTransport(esTransportOpts)
]
});
// Request logging middleware
const requestLogger = (req, res, next) => {
const start = Date.now();
const requestId = req.headers['x-request-id'] || generateRequestId();
req.logger = logger.child({
request_id: requestId,
user_id: req.user?.id,
ip: req.ip,
user_agent: req.get('User-Agent')
});
req.logger.info('Request started', {
method: req.method,
url: req.url,
headers: req.headers
});
res.on('finish', () => {
const duration = Date.now() - start;
req.logger.info('Request completed', {
method: req.method,
url: req.url,
status: res.statusCode,
duration,
content_length: res.get('Content-Length')
});
});
next();
};
function generateRequestId() {
return Math.random().toString(36).substring(2, 15) +
Math.random().toString(36).substring(2, 15);
}
module.exports = { logger, requestLogger };Python Logging
python
# logger.py
import logging
import json
import time
import uuid
from datetime import datetime
from flask import request, g
from pythonjsonlogger import jsonlogger
class CustomJsonFormatter(jsonlogger.JsonFormatter):
def add_fields(self, log_record, record, message_dict):
super(CustomJsonFormatter, self).add_fields(log_record, record, message_dict)
if not log_record.get('timestamp'):
log_record['timestamp'] = datetime.utcnow().isoformat()
if hasattr(g, 'request_id'):
log_record['request_id'] = g.request_id
if hasattr(g, 'user_id'):
log_record['user_id'] = g.user_id
log_record['service'] = 'my-python-app'
log_record['version'] = os.environ.get('APP_VERSION', 'unknown')
log_record['environment'] = os.environ.get('FLASK_ENV', 'development')
def setup_logging():
formatter = CustomJsonFormatter(
'%(timestamp)s %(level)s %(name)s %(message)s'
)
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(handler)
return logger
def request_logging_middleware():
@app.before_request
def before_request():
g.start_time = time.time()
g.request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
g.user_id = getattr(request, 'user_id', None)
logger.info('Request started', extra={
'method': request.method,
'url': request.url,
'remote_addr': request.remote_addr,
'user_agent': request.headers.get('User-Agent')
})
@app.after_request
def after_request(response):
duration = time.time() - g.start_time
logger.info('Request completed', extra={
'method': request.method,
'url': request.url,
'status': response.status_code,
'duration': duration,
'content_length': response.content_length
})
return response
logger = setup_logging()Log Aggregation with ELK Stack
Logstash Configuration
ruby
# logstash.conf
input {
beats {
port => 5044
}
http {
port => 8080
codec => json
}
}
filter {
if [fields][service] {
mutate {
add_field => { "service" => "%{[fields][service]}" }
}
}
# Parse JSON logs
if [message] =~ /^\{.*\}$/ {
json {
source => "message"
}
}
# Parse timestamp
if [timestamp] {
date {
match => [ "timestamp", "ISO8601" ]
}
}
# Extract error information
if [level] == "error" {
mutate {
add_tag => [ "error" ]
}
}
# Grok patterns for unstructured logs
if ![level] {
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}"
}
}
}
# GeoIP enrichment
if [ip] {
geoip {
source => "ip"
target => "geoip"
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{service}-%{+YYYY.MM.dd}"
template_name => "logs"
template => "/usr/share/logstash/templates/logs.json"
template_overwrite => true
}
# Debug output
if [level] == "debug" {
stdout {
codec => rubydebug
}
}
}Filebeat Configuration
yaml
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
fields:
service: my-app
environment: production
fields_under_root: true
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
- type: docker
enabled: true
containers.ids:
- '*'
processors:
- add_docker_metadata:
host: "unix:///var/run/docker.sock"
processors:
- add_host_metadata:
when.not.contains.tags: forwarded
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.logstash:
hosts: ["logstash:5044"]
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644Alerting Strategies
Alertmanager Configuration
yaml
# alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 0s
repeat_interval: 5m
- match:
team: database
receiver: 'database-team'
- match:
team: frontend
receiver: 'frontend-team'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'oncall@example.com'
subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: 'Critical Alert'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
description: '{{ .GroupLabels.alertname }}'
- name: 'database-team'
email_configs:
- to: 'database-team@example.com'
subject: '[DB Alert] {{ .GroupLabels.alertname }}'
- name: 'frontend-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#frontend-alerts'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']Alert Runbooks
markdown
# Alert Runbooks
## High Error Rate
### Symptoms
- Error rate > 5% for 5 minutes
- Users experiencing failures
### Investigation Steps
1. Check application logs for error patterns
2. Verify database connectivity
3. Check external service dependencies
4. Review recent deployments
### Resolution
1. If deployment related: rollback
2. If database issue: check connections and queries
3. If external service: implement circuit breaker
### Prevention
- Implement proper error handling
- Add circuit breakers for external services
- Improve deployment testing
## High Latency
### Symptoms
- 95th percentile latency > 500ms
- Slow user experience
### Investigation Steps
1. Check database query performance
2. Review application performance metrics
3. Check system resources (CPU, memory)
4. Analyze slow query logs
### Resolution
1. Optimize slow queries
2. Scale application instances
3. Add caching layers
4. Review code performance
### Prevention
- Regular performance testing
- Query optimization
- Proper indexing
- Caching strategyObservability Best Practices
Distributed Tracing
OpenTelemetry Implementation
javascript
// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const jaegerExporter = new JaegerExporter({
endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces'
});
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-app',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development'
}),
traceExporter: jaegerExporter,
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();
// Custom tracing
const { trace } = require('@opentelemetry/api');
function createCustomSpan(name, operation) {
const tracer = trace.getTracer('my-app');
return tracer.startSpan(name, {
kind: 1, // SpanKind.INTERNAL
attributes: {
'operation.name': operation
}
});
}
module.exports = { createCustomSpan };Correlation IDs
javascript
// correlation.js
const { AsyncLocalStorage } = require('async_hooks');
const asyncLocalStorage = new AsyncLocalStorage();
function correlationMiddleware(req, res, next) {
const correlationId = req.headers['x-correlation-id'] ||
req.headers['x-request-id'] ||
generateCorrelationId();
res.setHeader('x-correlation-id', correlationId);
asyncLocalStorage.run({ correlationId }, () => {
req.correlationId = correlationId;
next();
});
}
function getCorrelationId() {
const store = asyncLocalStorage.getStore();
return store?.correlationId;
}
function generateCorrelationId() {
return `${Date.now()}-${Math.random().toString(36).substring(2)}`;
}
module.exports = {
correlationMiddleware,
getCorrelationId
};Monitoring Tools and Platforms
Grafana Dashboards
json
{
"dashboard": {
"title": "Application Overview",
"tags": ["application", "overview"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{instance}} - {{method}}"
}
],
"yAxes": [
{
"label": "Requests/sec",
"min": 0
}
]
},
{
"title": "Error Rate",
"type": "singlestat",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m]) * 100",
"legendFormat": "Error Rate %"
}
],
"thresholds": "1,5",
"colorBackground": true
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "99th percentile"
}
]
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s"
}
}Synthetic Monitoring
javascript
// synthetic-monitoring.js
const puppeteer = require('puppeteer');
const prometheus = require('prom-client');
const syntheticMetrics = {
availability: new prometheus.Gauge({
name: 'synthetic_check_availability',
help: 'Synthetic check availability',
labelNames: ['check_name', 'endpoint']
}),
responseTime: new prometheus.Histogram({
name: 'synthetic_check_duration_seconds',
help: 'Synthetic check duration',
labelNames: ['check_name', 'endpoint'],
buckets: [0.1, 0.5, 1, 2, 5, 10]
})
};
class SyntheticMonitor {
constructor() {
this.checks = [];
}
addCheck(name, url, checkFunction) {
this.checks.push({ name, url, checkFunction });
}
async runChecks() {
const browser = await puppeteer.launch({ headless: true });
for (const check of this.checks) {
await this.runCheck(browser, check);
}
await browser.close();
}
async runCheck(browser, { name, url, checkFunction }) {
const page = await browser.newPage();
const startTime = Date.now();
try {
await page.goto(url, { waitUntil: 'networkidle2' });
if (checkFunction) {
await checkFunction(page);
}
const duration = (Date.now() - startTime) / 1000;
syntheticMetrics.availability.labels(name, url).set(1);
syntheticMetrics.responseTime.labels(name, url).observe(duration);
console.log(`✓ Check ${name} passed in ${duration}s`);
} catch (error) {
const duration = (Date.now() - startTime) / 1000;
syntheticMetrics.availability.labels(name, url).set(0);
syntheticMetrics.responseTime.labels(name, url).observe(duration);
console.error(`✗ Check ${name} failed:`, error.message);
} finally {
await page.close();
}
}
startScheduler(intervalMs = 60000) {
setInterval(() => {
this.runChecks().catch(console.error);
}, intervalMs);
}
}
// Usage
const monitor = new SyntheticMonitor();
monitor.addCheck('homepage', 'https://example.com', async (page) => {
await page.waitForSelector('h1');
const title = await page.$eval('h1', el => el.textContent);
if (!title.includes('Welcome')) {
throw new Error('Homepage title incorrect');
}
});
monitor.addCheck('login', 'https://example.com/login', async (page) => {
await page.type('#username', 'test@example.com');
await page.type('#password', 'password');
await page.click('#login-button');
await page.waitForNavigation();
const url = page.url();
if (!url.includes('/dashboard')) {
throw new Error('Login failed');
}
});
monitor.startScheduler(30000); // Run every 30 seconds
module.exports = monitor;Best Practices Summary
Monitoring Strategy
- Define Clear SLIs/SLOs: Establish measurable service level indicators
- Implement Four Golden Signals: Latency, traffic, errors, saturation
- Use Structured Logging: Consistent, searchable log formats
- Correlation IDs: Track requests across services
- Distributed Tracing: Understand request flows
Alert Management
- Alert on Symptoms: Focus on user-impacting issues
- Reduce Alert Fatigue: Tune thresholds and group related alerts
- Actionable Alerts: Every alert should require action
- Runbook Documentation: Clear resolution steps
- Escalation Procedures: Define escalation paths
Performance Optimization
- Baseline Metrics: Establish performance baselines
- Continuous Monitoring: Monitor trends over time
- Capacity Planning: Proactive resource planning
- Performance Testing: Regular load and stress testing
- Optimization Cycles: Regular performance reviews
Tool Selection
- Standardization: Use consistent tools across teams
- Integration: Ensure tools work together
- Scalability: Choose tools that scale with growth
- Cost Management: Monitor tool costs and usage
- Training: Ensure team proficiency with tools