Monitoring Guide

This comprehensive guide covers monitoring strategies, tools, and best practices for applications and infrastructure built with Trae.

Overview

Effective monitoring is crucial for maintaining application performance, reliability, and user experience. This guide covers:

Monitoring fundamentals
Application performance monitoring (APM)
Infrastructure monitoring
Log management
Alerting strategies
Observability best practices
Monitoring tools and platforms

Monitoring Fundamentals

Key Concepts

The Three Pillars of Observability

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│     Metrics     │    │      Logs       │    │     Traces      │
│                 │    │                 │    │                 │
│  • Counters     │    │  • Events       │    │  • Requests     │
│  • Gauges       │    │  • Errors       │    │  • Spans        │
│  • Histograms   │    │  • Debug Info   │    │  • Dependencies │
│  • Summaries    │    │  • Audit Trail  │    │  • Performance │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Monitoring Types

Synthetic Monitoring: Proactive testing with simulated transactions
Real User Monitoring (RUM): Actual user experience tracking
Infrastructure Monitoring: System resources and health
Application Monitoring: Code-level performance and errors
Business Monitoring: KPIs and business metrics

Monitoring Strategy

SLI, SLO, and SLA Framework

yaml

# Service Level Indicators (SLIs)
slis:
  availability:
    description: "Percentage of successful requests"
    measurement: "(successful_requests / total_requests) * 100"
    target: "> 99.9%"
  
  latency:
    description: "95th percentile response time"
    measurement: "response_time_95th_percentile"
    target: "< 200ms"
  
  error_rate:
    description: "Percentage of failed requests"
    measurement: "(failed_requests / total_requests) * 100"
    target: "< 0.1%"

# Service Level Objectives (SLOs)
slos:
  - name: "API Availability"
    sli: "availability"
    target: "99.9% over 30 days"
    error_budget: "0.1%"
  
  - name: "Response Time"
    sli: "latency"
    target: "95% of requests < 200ms"
    measurement_window: "5 minutes"

# Service Level Agreements (SLAs)
slas:
  - name: "Service Uptime"
    commitment: "99.5% monthly uptime"
    penalty: "Service credits for downtime"
    measurement: "External monitoring"

Application Performance Monitoring

APM Implementation

Node.js Application Monitoring

javascript

// app.js - APM setup
const apm = require('elastic-apm-node').start({
  serviceName: 'my-app',
  secretToken: process.env.ELASTIC_APM_SECRET_TOKEN,
  serverUrl: process.env.ELASTIC_APM_SERVER_URL,
  environment: process.env.NODE_ENV,
  captureBody: 'errors',
  captureHeaders: true,
  logLevel: 'info'
});

const express = require('express');
const prometheus = require('prom-client');
const app = express();

// Prometheus metrics
const register = new prometheus.Registry();
prometheus.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});

const httpRequestTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status']
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);

// Middleware for metrics collection
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;
    
    httpRequestDuration
      .labels(req.method, route, res.statusCode)
      .observe(duration);
    
    httpRequestTotal
      .labels(req.method, route, res.statusCode)
      .inc();
  });
  
  next();
});

// Health check endpoint
app.get('/health', (req, res) => {
  res.status(200).json({
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    memory: process.memoryUsage(),
    version: process.env.npm_package_version
  });
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Error handling with APM
app.use((err, req, res, next) => {
  apm.captureError(err);
  console.error('Error:', err);
  res.status(500).json({ error: 'Internal Server Error' });
});

module.exports = app;

Python Application Monitoring

python

# app.py - APM setup
import os
import time
from flask import Flask, request, jsonify
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import elasticapm
from elasticapm.contrib.flask import ElasticAPM

app = Flask(__name__)

# Elastic APM configuration
app.config['ELASTIC_APM'] = {
    'SERVICE_NAME': 'my-python-app',
    'SECRET_TOKEN': os.environ.get('ELASTIC_APM_SECRET_TOKEN'),
    'SERVER_URL': os.environ.get('ELASTIC_APM_SERVER_URL'),
    'ENVIRONMENT': os.environ.get('FLASK_ENV', 'production'),
    'CAPTURE_BODY': 'errors',
    'CAPTURE_HEADERS': True
}

apm = ElasticAPM(app)

# Prometheus metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    request_latency = time.time() - request.start_time
    
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()
    
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown'
    ).observe(request_latency)
    
    return response

@app.route('/health')
def health_check():
    return jsonify({
        'status': 'healthy',
        'timestamp': time.time(),
        'version': os.environ.get('APP_VERSION', 'unknown')
    })

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

@app.errorhandler(Exception)
def handle_exception(e):
    apm.capture_exception()
    return jsonify({'error': 'Internal Server Error'}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Custom Metrics Implementation

Business Metrics

javascript

// business-metrics.js
const prometheus = require('prom-client');

class BusinessMetrics {
  constructor() {
    this.userRegistrations = new prometheus.Counter({
      name: 'user_registrations_total',
      help: 'Total number of user registrations',
      labelNames: ['source', 'plan']
    });
    
    this.orderValue = new prometheus.Histogram({
      name: 'order_value_dollars',
      help: 'Order value in dollars',
      labelNames: ['product_category'],
      buckets: [10, 50, 100, 500, 1000, 5000]
    });
    
    this.activeUsers = new prometheus.Gauge({
      name: 'active_users_current',
      help: 'Current number of active users',
      labelNames: ['session_type']
    });
    
    this.featureUsage = new prometheus.Counter({
      name: 'feature_usage_total',
      help: 'Total feature usage count',
      labelNames: ['feature_name', 'user_tier']
    });
  }
  
  recordUserRegistration(source, plan) {
    this.userRegistrations.labels(source, plan).inc();
  }
  
  recordOrder(value, category) {
    this.orderValue.labels(category).observe(value);
  }
  
  updateActiveUsers(count, sessionType) {
    this.activeUsers.labels(sessionType).set(count);
  }
  
  recordFeatureUsage(featureName, userTier) {
    this.featureUsage.labels(featureName, userTier).inc();
  }
}

module.exports = new BusinessMetrics();

Performance Metrics

javascript

// performance-metrics.js
const prometheus = require('prom-client');

class PerformanceMetrics {
  constructor() {
    this.databaseQueryDuration = new prometheus.Histogram({
      name: 'database_query_duration_seconds',
      help: 'Database query duration',
      labelNames: ['query_type', 'table'],
      buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]
    });
    
    this.cacheHitRate = new prometheus.Gauge({
      name: 'cache_hit_rate',
      help: 'Cache hit rate percentage',
      labelNames: ['cache_type']
    });
    
    this.queueSize = new prometheus.Gauge({
      name: 'queue_size_current',
      help: 'Current queue size',
      labelNames: ['queue_name']
    });
    
    this.externalApiCalls = new prometheus.Counter({
      name: 'external_api_calls_total',
      help: 'Total external API calls',
      labelNames: ['service', 'endpoint', 'status']
    });
  }
  
  recordDatabaseQuery(queryType, table, duration) {
    this.databaseQueryDuration.labels(queryType, table).observe(duration);
  }
  
  updateCacheHitRate(cacheType, hitRate) {
    this.cacheHitRate.labels(cacheType).set(hitRate);
  }
  
  updateQueueSize(queueName, size) {
    this.queueSize.labels(queueName).set(size);
  }
  
  recordExternalApiCall(service, endpoint, status) {
    this.externalApiCalls.labels(service, endpoint, status).inc();
  }
}

module.exports = new PerformanceMetrics();

Infrastructure Monitoring

Prometheus Configuration

yaml

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-west-2'

rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Application metrics
  - job_name: 'app'
    static_configs:
      - targets: ['app:3000']
    metrics_path: '/metrics'
    scrape_interval: 10s
    scrape_timeout: 5s
    
  # Node exporter for system metrics
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    
  # Database metrics
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']
    
  # Redis metrics
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    
  # Nginx metrics
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']
    
  # Kubernetes metrics
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

Alert Rules

yaml

# alert_rules.yml
groups:
  - name: application.rules
    rules:
      - alert: HighErrorRate
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[5m]) /
            rate(http_requests_total[5m])
          ) * 100 > 5
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}% for {{ $labels.instance }}"
          runbook_url: "https://runbooks.example.com/high-error-rate"
      
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.instance }}"
      
      - alert: DatabaseConnectionsHigh
        expr: |
          pg_stat_database_numbackends / pg_settings_max_connections * 100 > 80
        for: 5m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "Database connections high"
          description: "Database connections are at {{ $value }}% of maximum"
  
  - name: infrastructure.rules
    rules:
      - alert: HighCPUUsage
        expr: |
          100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
      
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
      
      - alert: DiskSpaceLow
        expr: |
          (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Disk space low"
          description: "Disk usage is {{ $value }}% on {{ $labels.instance }} {{ $labels.mountpoint }}"

Recording Rules

yaml

# recording_rules.yml
groups:
  - name: application.recording
    interval: 30s
    rules:
      - record: app:http_request_rate
        expr: |
          rate(http_requests_total[5m])
      
      - record: app:http_error_rate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) /
          rate(http_requests_total[5m])
      
      - record: app:http_latency_p95
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
      
      - record: app:http_latency_p99
        expr: |
          histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
  
  - name: business.recording
    interval: 60s
    rules:
      - record: business:revenue_per_hour
        expr: |
          increase(order_value_dollars_sum[1h])
      
      - record: business:user_growth_rate
        expr: |
          rate(user_registrations_total[1h])
      
      - record: business:feature_adoption_rate
        expr: |
          rate(feature_usage_total[1h]) by (feature_name)

Log Management

Structured Logging

Node.js Logging

javascript

// logger.js
const winston = require('winston');
const { ElasticsearchTransport } = require('winston-elasticsearch');

const esTransportOpts = {
  level: 'info',
  clientOpts: {
    node: process.env.ELASTICSEARCH_URL || 'http://localhost:9200'
  },
  index: 'app-logs',
  indexTemplate: {
    name: 'app-logs-template',
    pattern: 'app-logs-*',
    settings: {
      number_of_shards: 1,
      number_of_replicas: 1
    },
    mappings: {
      properties: {
        '@timestamp': { type: 'date' },
        level: { type: 'keyword' },
        message: { type: 'text' },
        service: { type: 'keyword' },
        trace_id: { type: 'keyword' },
        span_id: { type: 'keyword' },
        user_id: { type: 'keyword' },
        request_id: { type: 'keyword' }
      }
    }
  }
};

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: process.env.SERVICE_NAME || 'app',
    version: process.env.APP_VERSION || 'unknown',
    environment: process.env.NODE_ENV || 'development'
  },
  transports: [
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.simple()
      )
    }),
    new ElasticsearchTransport(esTransportOpts)
  ]
});

// Request logging middleware
const requestLogger = (req, res, next) => {
  const start = Date.now();
  const requestId = req.headers['x-request-id'] || generateRequestId();
  
  req.logger = logger.child({
    request_id: requestId,
    user_id: req.user?.id,
    ip: req.ip,
    user_agent: req.get('User-Agent')
  });
  
  req.logger.info('Request started', {
    method: req.method,
    url: req.url,
    headers: req.headers
  });
  
  res.on('finish', () => {
    const duration = Date.now() - start;
    req.logger.info('Request completed', {
      method: req.method,
      url: req.url,
      status: res.statusCode,
      duration,
      content_length: res.get('Content-Length')
    });
  });
  
  next();
};

function generateRequestId() {
  return Math.random().toString(36).substring(2, 15) +
         Math.random().toString(36).substring(2, 15);
}

module.exports = { logger, requestLogger };

Python Logging

python

# logger.py
import logging
import json
import time
import uuid
from datetime import datetime
from flask import request, g
from pythonjsonlogger import jsonlogger

class CustomJsonFormatter(jsonlogger.JsonFormatter):
    def add_fields(self, log_record, record, message_dict):
        super(CustomJsonFormatter, self).add_fields(log_record, record, message_dict)
        
        if not log_record.get('timestamp'):
            log_record['timestamp'] = datetime.utcnow().isoformat()
        
        if hasattr(g, 'request_id'):
            log_record['request_id'] = g.request_id
        
        if hasattr(g, 'user_id'):
            log_record['user_id'] = g.user_id
        
        log_record['service'] = 'my-python-app'
        log_record['version'] = os.environ.get('APP_VERSION', 'unknown')
        log_record['environment'] = os.environ.get('FLASK_ENV', 'development')

def setup_logging():
    formatter = CustomJsonFormatter(
        '%(timestamp)s %(level)s %(name)s %(message)s'
    )
    
    handler = logging.StreamHandler()
    handler.setFormatter(formatter)
    
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    logger.addHandler(handler)
    
    return logger

def request_logging_middleware():
    @app.before_request
    def before_request():
        g.start_time = time.time()
        g.request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
        g.user_id = getattr(request, 'user_id', None)
        
        logger.info('Request started', extra={
            'method': request.method,
            'url': request.url,
            'remote_addr': request.remote_addr,
            'user_agent': request.headers.get('User-Agent')
        })
    
    @app.after_request
    def after_request(response):
        duration = time.time() - g.start_time
        
        logger.info('Request completed', extra={
            'method': request.method,
            'url': request.url,
            'status': response.status_code,
            'duration': duration,
            'content_length': response.content_length
        })
        
        return response

logger = setup_logging()

Log Aggregation with ELK Stack

Logstash Configuration

ruby

# logstash.conf
input {
  beats {
    port => 5044
  }
  
  http {
    port => 8080
    codec => json
  }
}

filter {
  if [fields][service] {
    mutate {
      add_field => { "service" => "%{[fields][service]}" }
    }
  }
  
  # Parse JSON logs
  if [message] =~ /^\{.*\}$/ {
    json {
      source => "message"
    }
  }
  
  # Parse timestamp
  if [timestamp] {
    date {
      match => [ "timestamp", "ISO8601" ]
    }
  }
  
  # Extract error information
  if [level] == "error" {
    mutate {
      add_tag => [ "error" ]
    }
  }
  
  # Grok patterns for unstructured logs
  if ![level] {
    grok {
      match => { 
        "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" 
      }
    }
  }
  
  # GeoIP enrichment
  if [ip] {
    geoip {
      source => "ip"
      target => "geoip"
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{service}-%{+YYYY.MM.dd}"
    template_name => "logs"
    template => "/usr/share/logstash/templates/logs.json"
    template_overwrite => true
  }
  
  # Debug output
  if [level] == "debug" {
    stdout {
      codec => rubydebug
    }
  }
}

Filebeat Configuration

yaml

# filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.log
  fields:
    service: my-app
    environment: production
  fields_under_root: true
  multiline.pattern: '^\d{4}-\d{2}-\d{2}'
  multiline.negate: true
  multiline.match: after

- type: docker
  enabled: true
  containers.ids:
    - '*'
  processors:
    - add_docker_metadata:
        host: "unix:///var/run/docker.sock"

processors:
  - add_host_metadata:
      when.not.contains.tags: forwarded
  - add_kubernetes_metadata:
      host: ${NODE_NAME}
      matchers:
      - logs_path:
          logs_path: "/var/log/containers/"

output.logstash:
  hosts: ["logstash:5044"]

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644

Alerting Strategies

Alertmanager Configuration

yaml

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      group_wait: 0s
      repeat_interval: 5m
    
    - match:
        team: database
      receiver: 'database-team'
    
    - match:
        team: frontend
      receiver: 'frontend-team'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'
        subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
          {{ end }}
  
  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@example.com'
        subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Critical Alert'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
        description: '{{ .GroupLabels.alertname }}'
  
  - name: 'database-team'
    email_configs:
      - to: 'database-team@example.com'
        subject: '[DB Alert] {{ .GroupLabels.alertname }}'
  
  - name: 'frontend-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#frontend-alerts'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Alert Runbooks

markdown

# Alert Runbooks

## High Error Rate

### Symptoms
- Error rate > 5% for 5 minutes
- Users experiencing failures

### Investigation Steps
1. Check application logs for error patterns
2. Verify database connectivity
3. Check external service dependencies
4. Review recent deployments

### Resolution
1. If deployment related: rollback
2. If database issue: check connections and queries
3. If external service: implement circuit breaker

### Prevention
- Implement proper error handling
- Add circuit breakers for external services
- Improve deployment testing

## High Latency

### Symptoms
- 95th percentile latency > 500ms
- Slow user experience

### Investigation Steps
1. Check database query performance
2. Review application performance metrics
3. Check system resources (CPU, memory)
4. Analyze slow query logs

### Resolution
1. Optimize slow queries
2. Scale application instances
3. Add caching layers
4. Review code performance

### Prevention
- Regular performance testing
- Query optimization
- Proper indexing
- Caching strategy

Observability Best Practices

Distributed Tracing

OpenTelemetry Implementation

javascript

// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const jaegerExporter = new JaegerExporter({
  endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces'
});

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-app',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development'
  }),
  traceExporter: jaegerExporter,
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

// Custom tracing
const { trace } = require('@opentelemetry/api');

function createCustomSpan(name, operation) {
  const tracer = trace.getTracer('my-app');
  
  return tracer.startSpan(name, {
    kind: 1, // SpanKind.INTERNAL
    attributes: {
      'operation.name': operation
    }
  });
}

module.exports = { createCustomSpan };

Correlation IDs

javascript

// correlation.js
const { AsyncLocalStorage } = require('async_hooks');
const asyncLocalStorage = new AsyncLocalStorage();

function correlationMiddleware(req, res, next) {
  const correlationId = req.headers['x-correlation-id'] || 
                       req.headers['x-request-id'] || 
                       generateCorrelationId();
  
  res.setHeader('x-correlation-id', correlationId);
  
  asyncLocalStorage.run({ correlationId }, () => {
    req.correlationId = correlationId;
    next();
  });
}

function getCorrelationId() {
  const store = asyncLocalStorage.getStore();
  return store?.correlationId;
}

function generateCorrelationId() {
  return `${Date.now()}-${Math.random().toString(36).substring(2)}`;
}

module.exports = {
  correlationMiddleware,
  getCorrelationId
};

Monitoring Tools and Platforms

Grafana Dashboards

json

{
  "dashboard": {
    "title": "Application Overview",
    "tags": ["application", "overview"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{instance}} - {{method}}"
          }
        ],
        "yAxes": [
          {
            "label": "Requests/sec",
            "min": 0
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "singlestat",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m]) * 100",
            "legendFormat": "Error Rate %"
          }
        ],
        "thresholds": "1,5",
        "colorBackground": true
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "50th percentile"
          },
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          },
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "99th percentile"
          }
        ]
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Synthetic Monitoring

javascript

// synthetic-monitoring.js
const puppeteer = require('puppeteer');
const prometheus = require('prom-client');

const syntheticMetrics = {
  availability: new prometheus.Gauge({
    name: 'synthetic_check_availability',
    help: 'Synthetic check availability',
    labelNames: ['check_name', 'endpoint']
  }),
  
  responseTime: new prometheus.Histogram({
    name: 'synthetic_check_duration_seconds',
    help: 'Synthetic check duration',
    labelNames: ['check_name', 'endpoint'],
    buckets: [0.1, 0.5, 1, 2, 5, 10]
  })
};

class SyntheticMonitor {
  constructor() {
    this.checks = [];
  }
  
  addCheck(name, url, checkFunction) {
    this.checks.push({ name, url, checkFunction });
  }
  
  async runChecks() {
    const browser = await puppeteer.launch({ headless: true });
    
    for (const check of this.checks) {
      await this.runCheck(browser, check);
    }
    
    await browser.close();
  }
  
  async runCheck(browser, { name, url, checkFunction }) {
    const page = await browser.newPage();
    const startTime = Date.now();
    
    try {
      await page.goto(url, { waitUntil: 'networkidle2' });
      
      if (checkFunction) {
        await checkFunction(page);
      }
      
      const duration = (Date.now() - startTime) / 1000;
      
      syntheticMetrics.availability.labels(name, url).set(1);
      syntheticMetrics.responseTime.labels(name, url).observe(duration);
      
      console.log(`✓ Check ${name} passed in ${duration}s`);
      
    } catch (error) {
      const duration = (Date.now() - startTime) / 1000;
      
      syntheticMetrics.availability.labels(name, url).set(0);
      syntheticMetrics.responseTime.labels(name, url).observe(duration);
      
      console.error(`✗ Check ${name} failed:`, error.message);
    } finally {
      await page.close();
    }
  }
  
  startScheduler(intervalMs = 60000) {
    setInterval(() => {
      this.runChecks().catch(console.error);
    }, intervalMs);
  }
}

// Usage
const monitor = new SyntheticMonitor();

monitor.addCheck('homepage', 'https://example.com', async (page) => {
  await page.waitForSelector('h1');
  const title = await page.$eval('h1', el => el.textContent);
  if (!title.includes('Welcome')) {
    throw new Error('Homepage title incorrect');
  }
});

monitor.addCheck('login', 'https://example.com/login', async (page) => {
  await page.type('#username', 'test@example.com');
  await page.type('#password', 'password');
  await page.click('#login-button');
  await page.waitForNavigation();
  
  const url = page.url();
  if (!url.includes('/dashboard')) {
    throw new Error('Login failed');
  }
});

monitor.startScheduler(30000); // Run every 30 seconds

module.exports = monitor;

Best Practices Summary

Monitoring Strategy

Define Clear SLIs/SLOs: Establish measurable service level indicators
Implement Four Golden Signals: Latency, traffic, errors, saturation
Use Structured Logging: Consistent, searchable log formats
Correlation IDs: Track requests across services
Distributed Tracing: Understand request flows

Alert Management

Alert on Symptoms: Focus on user-impacting issues
Reduce Alert Fatigue: Tune thresholds and group related alerts
Actionable Alerts: Every alert should require action
Runbook Documentation: Clear resolution steps
Escalation Procedures: Define escalation paths

Performance Optimization

Baseline Metrics: Establish performance baselines
Continuous Monitoring: Monitor trends over time
Capacity Planning: Proactive resource planning
Performance Testing: Regular load and stress testing
Optimization Cycles: Regular performance reviews

Tool Selection

Standardization: Use consistent tools across teams
Integration: Ensure tools work together
Scalability: Choose tools that scale with growth
Cost Management: Monitor tool costs and usage
Training: Ensure team proficiency with tools

Monitoring Guide ​

Overview ​

Monitoring Fundamentals ​

Key Concepts ​

The Three Pillars of Observability ​

Monitoring Types ​

Monitoring Strategy ​

SLI, SLO, and SLA Framework ​

Application Performance Monitoring ​

APM Implementation ​

Node.js Application Monitoring ​

Python Application Monitoring ​

Custom Metrics Implementation ​

Business Metrics ​

Performance Metrics ​

Infrastructure Monitoring ​

Prometheus Configuration ​

Alert Rules ​

Recording Rules ​

Log Management ​

Structured Logging ​

Node.js Logging ​

Python Logging ​

Log Aggregation with ELK Stack ​

Logstash Configuration ​

Filebeat Configuration ​

Alerting Strategies ​

Alertmanager Configuration ​

Alert Runbooks ​

Observability Best Practices ​

Distributed Tracing ​

OpenTelemetry Implementation ​

Correlation IDs ​

Monitoring Tools and Platforms ​

Grafana Dashboards ​

Synthetic Monitoring ​

Best Practices Summary ​

Monitoring Strategy ​

Alert Management ​

Performance Optimization ​

Tool Selection ​

Related Articles ​

Monitoring Guide

Overview

Monitoring Fundamentals

Key Concepts

The Three Pillars of Observability

Monitoring Types

Monitoring Strategy

SLI, SLO, and SLA Framework

Application Performance Monitoring

APM Implementation

Node.js Application Monitoring

Python Application Monitoring

Custom Metrics Implementation

Business Metrics

Performance Metrics

Infrastructure Monitoring

Prometheus Configuration

Alert Rules

Recording Rules

Log Management

Structured Logging

Node.js Logging

Python Logging

Log Aggregation with ELK Stack

Logstash Configuration

Filebeat Configuration

Alerting Strategies

Alertmanager Configuration

Alert Runbooks

Observability Best Practices

Distributed Tracing

OpenTelemetry Implementation

Correlation IDs

Monitoring Tools and Platforms

Grafana Dashboards

Synthetic Monitoring

Best Practices Summary

Monitoring Strategy

Alert Management

Performance Optimization

Tool Selection

Related Articles