Cloud + Monitoring — AWS · Prometheus · Grafana
Cloud + Monitoring — AWS · Prometheus · Grafana
🎯 After reading this lesson
After finishing this lesson, you will be able to confidently do the following three things.
- ▸✅ Map the core services of AWS / GCP / Azure
- ▸✅ Set up OpenTelemetry (Trace · Metrics · Logs)
- ▸✅ SLO/SLI/SLA + Prometheus + Grafana
Keep these learning objectives as a checklist, and close the lesson once you can answer all of them.
Key AWS Services + Choosing a Cloud Provider
AWS (market leader since 2006):
Cloud Comparison:
Cost-saving Tips:
- ▸Reserved Instance / Savings Plan — 1- or 3-year commitment (~60% discount)
- ▸Spot Instance — interruptible (~90% discount) · batch workloads
- ▸S3 Glacier — archival data (~$0.004/GB)
- ▸Auto Scaling — scales automatically with traffic
- ▸CloudWatch Billing Alarm — alerts when thresholds are exceeded
Observability — Metrics · Logs · Traces
The 3 Pillars of Observability:
Prometheus + Grafana (open-source standard):
- ▸Prometheus = metric collection and storage (pull-based)
- ▸Grafana = visualization (dashboards)
- ▸Alertmanager = alerting (Slack · PagerDuty)
4 Golden Signals (Google SRE):
1. Latency — response time (p50 · p95 · p99)
2. Traffic — request rate (QPS)
3. Errors — 5xx · exception rate
4. Saturation — CPU · memory · disk · DB connections
SLO · SLI · SLA:
- ▸SLI (Indicator) — a measurable metric (e.g., availability %)
- ▸SLO (Objective) — a target (e.g., 99.9% availability)
- ▸SLA (Agreement) — a contract (compensation for violations)
- ▸Error Budget — 100% - SLO. 0.1% = ~43 min downtime per month allowed
eBPF-based Tools (2024+ trend):
- ▸Cilium (networking), Falco (security), Pixie (app observability)
- ▸Runs code safely inside the kernel → no code changes required
🤖 Try asking AI like this
Once you know the concepts in this lesson, you can give AI specific instructions. Instead of a vague "fix this," you make vocabulary-powered requests — and that is where token savings begin.
- ▸"Set up OpenTelemetry trace + metrics + logs for this app"
- ▸"Configure monitoring for this SLO (99.9% availability) using Prometheus + Grafana"
Why does this reduce tokens?
Without the concepts, even after receiving an AI answer you still have to ask "What does that mean?" all over again. Those follow-up questions are what eat up tokens. Learn the concepts once, and the conversation ends in a single round.