C
DevOps/Cloud Observability/Lesson 04

Cloud + Monitoring — AWS · Prometheus · Grafana

45 min·theory

Cloud + Monitoring — AWS · Prometheus · Grafana

🎯 After reading this lesson

After finishing this lesson, you will be able to confidently do the following three things.

  • ✅ Map the core services of AWS / GCP / Azure
  • ✅ Set up OpenTelemetry (Trace · Metrics · Logs)
  • ✅ SLO/SLI/SLA + Prometheus + Grafana

Keep these learning objectives as a checklist, and close the lesson once you can answer all of them.

Key AWS Services + Choosing a Cloud Provider

AWS (market leader since 2006):

CategoryService
ComputeEC2 (VM) · Lambda (serverless) · ECS·EKS (containers)
StorageS3 (object) · EBS (block) · EFS (file)
DBRDS (managed RDBMS) · DynamoDB (NoSQL) · ElastiCache (Redis)
NetworkVPC · CloudFront (CDN) · Route 53 (DNS) · API Gateway
DevOpsCodeDeploy · CodePipeline · CloudFormation
SecurityIAM · KMS · Secrets Manager · WAF
ObservabilityCloudWatch · X-Ray (tracing)

Cloud Comparison:

CloudStrengthsKorean Market
AWSWidest variety · learning resources#1
GCPKubernetes · strong AIGrowing
AzureEnterprise · Microsoft ecosystemEnterprises
Naver CloudKorean data · certificationsGovernment · Finance

Cost-saving Tips:

  • Reserved Instance / Savings Plan — 1- or 3-year commitment (~60% discount)
  • Spot Instance — interruptible (~90% discount) · batch workloads
  • S3 Glacier — archival data (~$0.004/GB)
  • Auto Scaling — scales automatically with traffic
  • CloudWatch Billing Alarm — alerts when thresholds are exceeded

Observability — Metrics · Logs · Traces

The 3 Pillars of Observability:

PillarWhat it isTools
MetricsTime-series numbers (CPU · request count · error rate)Prometheus + Grafana / Datadog / CloudWatch
LogsText eventsLoki / ELK (Elasticsearch+Kibana) / Datadog
TracesDistributed call tracingJaeger / Tempo / Datadog APM

Prometheus + Grafana (open-source standard):

  • Prometheus = metric collection and storage (pull-based)
  • Grafana = visualization (dashboards)
  • Alertmanager = alerting (Slack · PagerDuty)

4 Golden Signals (Google SRE):
1. Latency — response time (p50 · p95 · p99)
2. Traffic — request rate (QPS)
3. Errors — 5xx · exception rate
4. Saturation — CPU · memory · disk · DB connections

SLO · SLI · SLA:

  • SLI (Indicator) — a measurable metric (e.g., availability %)
  • SLO (Objective) — a target (e.g., 99.9% availability)
  • SLA (Agreement) — a contract (compensation for violations)
  • Error Budget — 100% - SLO. 0.1% = ~43 min downtime per month allowed

eBPF-based Tools (2024+ trend):

  • Cilium (networking), Falco (security), Pixie (app observability)
  • Runs code safely inside the kernel → no code changes required
💻 📌 Prometheus + Grafana Example
# ============================================
# docker-compose.yml — Local Prometheus + Grafana
# ============================================
version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:latest
    ports: ["3001:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

  node-exporter:                # System metrics
    image: prom/node-exporter:latest
    ports: ["9100:9100"]

volumes:
  grafana-data:

# ============================================
# prometheus.yml — Metric collection settings
# ============================================
# global:
#   scrape_interval: 15s
# scrape_configs:
#   - job_name: 'myapp'
#     static_configs:
#       - targets: ['localhost:3000']      # /metrics endpoint
#   - job_name: 'node'
#     static_configs:
#       - targets: ['node-exporter:9100']

# ============================================
# Exposing metrics in Node.js app
# ============================================
# const express = require('express');
# const { register, Counter, Histogram } = require('prom-client');
# const app = express();
#
# const httpRequests = new Counter({
#   name: 'http_requests_total',
#   help: 'Total HTTP requests',
#   labelNames: ['method', 'route', 'status']
# });
#
# const httpDuration = new Histogram({
#   name: 'http_duration_seconds',
#   help: 'HTTP request duration',
#   labelNames: ['method', 'route'],
#   buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
# });
#
# app.use((req, res, next) => {
#   const end = httpDuration.startTimer({ method: req.method, route: req.path });
#   res.on('finish', () => {
#     httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });
#     end();
#   });
#   next();
# });
#
# app.get('/metrics', async (req, res) => {
#   res.set('Content-Type', register.contentType);
#   res.end(await register.metrics());
# });

🤖 Try asking AI like this

Once you know the concepts in this lesson, you can give AI specific instructions. Instead of a vague "fix this," you make vocabulary-powered requests — and that is where token savings begin.

  • "Set up OpenTelemetry trace + metrics + logs for this app"
  • "Configure monitoring for this SLO (99.9% availability) using Prometheus + Grafana"

Why does this reduce tokens?

Without the concepts, even after receiving an AI answer you still have to ask "What does that mean?" all over again. Those follow-up questions are what eat up tokens. Learn the concepts once, and the conversation ends in a single round.

Cloud + Monitoring — AWS · Prometheus · Grafana - DevOps