Log in Start free

Previous4 / 4Next

DevOps/Cloud Observability/Lesson 04

Cloud + Monitoring — AWS · Prometheus · Grafana

45 min·theory

Cloud + Monitoring — AWS · Prometheus · Grafana

🎯 After reading this lesson

After finishing this lesson, you will be able to confidently do the following three things.

▸✅ Map the core services of AWS / GCP / Azure
▸✅ Set up OpenTelemetry (Trace · Metrics · Logs)
▸✅ SLO/SLI/SLA + Prometheus + Grafana

Keep these learning objectives as a checklist, and close the lesson once you can answer all of them.

Key AWS Services + Choosing a Cloud Provider

AWS (market leader since 2006):

Category	Service
Compute	EC2 (VM) · Lambda (serverless) · ECS·EKS (containers)
Storage	S3 (object) · EBS (block) · EFS (file)
DB	RDS (managed RDBMS) · DynamoDB (NoSQL) · ElastiCache (Redis)
Network	VPC · CloudFront (CDN) · Route 53 (DNS) · API Gateway
DevOps	CodeDeploy · CodePipeline · CloudFormation
Security	IAM · KMS · Secrets Manager · WAF
Observability	CloudWatch · X-Ray (tracing)

Cloud Comparison:

Cloud	Strengths	Korean Market
AWS	Widest variety · learning resources	#1
GCP	Kubernetes · strong AI	Growing
Azure	Enterprise · Microsoft ecosystem	Enterprises
Naver Cloud	Korean data · certifications	Government · Finance

Cost-saving Tips:

▸Reserved Instance / Savings Plan — 1- or 3-year commitment (~60% discount)
▸Spot Instance — interruptible (~90% discount) · batch workloads
▸S3 Glacier — archival data (~$0.004/GB)
▸Auto Scaling — scales automatically with traffic
▸CloudWatch Billing Alarm — alerts when thresholds are exceeded

Observability — Metrics · Logs · Traces

The 3 Pillars of Observability:

Pillar	What it is	Tools
Metrics	Time-series numbers (CPU · request count · error rate)	Prometheus + Grafana / Datadog / CloudWatch
Logs	Text events	Loki / ELK (Elasticsearch+Kibana) / Datadog
Traces	Distributed call tracing	Jaeger / Tempo / Datadog APM

Prometheus + Grafana (open-source standard):

▸Prometheus = metric collection and storage (pull-based)
▸Grafana = visualization (dashboards)
▸Alertmanager = alerting (Slack · PagerDuty)

4 Golden Signals (Google SRE):
1. Latency — response time (p50 · p95 · p99)
2. Traffic — request rate (QPS)
3. Errors — 5xx · exception rate
4. Saturation — CPU · memory · disk · DB connections

SLO · SLI · SLA:

▸SLI (Indicator) — a measurable metric (e.g., availability %)
▸SLO (Objective) — a target (e.g., 99.9% availability)
▸SLA (Agreement) — a contract (compensation for violations)
▸Error Budget — 100% - SLO. 0.1% = ~43 min downtime per month allowed

eBPF-based Tools (2024+ trend):

▸Cilium (networking), Falco (security), Pixie (app observability)
▸Runs code safely inside the kernel → no code changes required

💻 📌 Prometheus + Grafana Example

# ============================================
# docker-compose.yml — Local Prometheus + Grafana
# ============================================
version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:latest
    ports: ["3001:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

  node-exporter:                # System metrics
    image: prom/node-exporter:latest
    ports: ["9100:9100"]

volumes:
  grafana-data:

# ============================================
# prometheus.yml — Metric collection settings
# ============================================
# global:
#   scrape_interval: 15s
# scrape_configs:
#   - job_name: 'myapp'
#     static_configs:
#       - targets: ['localhost:3000']      # /metrics endpoint
#   - job_name: 'node'
#     static_configs:
#       - targets: ['node-exporter:9100']

# ============================================
# Exposing metrics in Node.js app
# ============================================
# const express = require('express');
# const { register, Counter, Histogram } = require('prom-client');
# const app = express();
#
# const httpRequests = new Counter({
#   name: 'http_requests_total',
#   help: 'Total HTTP requests',
#   labelNames: ['method', 'route', 'status']
# });
#
# const httpDuration = new Histogram({
#   name: 'http_duration_seconds',
#   help: 'HTTP request duration',
#   labelNames: ['method', 'route'],
#   buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
# });
#
# app.use((req, res, next) => {
#   const end = httpDuration.startTimer({ method: req.method, route: req.path });
#   res.on('finish', () => {
#     httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });
#     end();
#   });
#   next();
# });
#
# app.get('/metrics', async (req, res) => {
#   res.set('Content-Type', register.contentType);
#   res.end(await register.metrics());
# });

🤖 Try asking AI like this

Once you know the concepts in this lesson, you can give AI specific instructions. Instead of a vague "fix this," you make vocabulary-powered requests — and that is where token savings begin.

▸"Set up OpenTelemetry trace + metrics + logs for this app"
▸"Configure monitoring for this SLO (99.9% availability) using Prometheus + Grafana"

Why does this reduce tokens?

Without the concepts, even after receiving an AI answer you still have to ask "What does that mean?" all over again. Those follow-up questions are what eat up tokens. Learn the concepts once, and the conversation ends in a single round.

Read this first: CI/CD + Kubernetes — GitHub Actions·Pod·Deployment

Up next: Collaboration & Git

CI/CD + Kubernetes — GitHub Actions·Pod·DeploymentPrevious List Collaboration & GitNext

Cloud + Monitoring — AWS · Prometheus · Grafana - DevOps