C
OS/Advanced/Lesson 05

Filesystem · Scheduling · Advanced — IPC · Signals · cgroups · Observability

60 min·theory

Filesystem · Scheduling · Advanced — IPC · Signals · cgroups · Observability

🎯 What you'll be able to do after this lesson

After completing this lesson, you will be confident in the following 3 areas:

  • ✅ Context switching overhead (thread < process)
  • ✅ epoll · kqueue · IOCP event-driven I/O
  • ✅ Copy-on-Write + fork() behavior

Keep the learning goals as a checklist and close the lesson once you can answer all of them.

Filesystem — inodes and directories

One-liner: A file = inode + data blocks, and a directory = name → inode mapping.

6 steps from open() to read():
1. open("/etc/hosts") — path parsing
2. dentry cache lookup — directory entry cache (RAM)
3. inode load — permission · size · block location · timestamp metadata
4. permission check — user/group · rwx
5. fd allocation — returns file descriptor number (0,1,2 = stdin/stdout/stderr, 3+ = user)
6. read(fd, buf, n) — block to page cache → user buffer

Filesystem types:

FSUse caseNotes
ext4Linux standardStable, general-purpose
XFSLarge-scale serversHigh performance (RHEL default)
btrfsNext-generationSnapshots · checksums
ZFSData integritySolaris · FreeBSD origin
APFSmacOSSSD-optimized · snapshots
NTFSWindowsPermissions · journaling

Common pitfalls:

  • ❌ "too many open files" — insufficient fds (ulimit -n increase)
  • ❌ Concurrent file writes → corruption (fcntl lock · O_APPEND)
  • ❌ Following symbolic links into an infinite loop (beware -L option)
  • ✅ Call fsync() to guarantee disk sync (DB · write-ahead log)

CPU Scheduling — who runs first

Linux CFS (Completely Fair Scheduler) — the default scheduler (2.6.23+).

Principle: Tracks the virtual runtime of each process. The process that has run the least goes first → fairness.

Scheduling policies:

PolicyUse casePriority mechanism
SCHED_NORMALGeneral processesnice value (-20 ~ +19)
SCHED_FIFOReal-timePriority + FIFO
SCHED_RRReal-timePriority + Round Robin
SCHED_IDLEBackgroundOnly when nothing else is running

Round Robin (RR) behavior:
1. Each process is given a time slice (e.g., 10 ms)
2. When the slice expires, the process moves to the back of the queue
3. The next process runs
Fair, but incurs context switching overhead

Scheduling pitfalls:

  • ❌ Misusing nice values (lower = higher priority) — counter-intuitive
  • ❌ Real-time SCHED_FIFO infinite loop → entire system freezes
  • ❌ No CPU affinity set → frequent cache invalidation

Modern trends:

  • EEVDF (Earliest Eligible Virtual Deadline First) — CFS successor in Linux 6.6+
  • Pluggable — custom BPF schedulers via sched_ext (2024+)

IPC · Signals · Zombies — inter-process communication

IPC (Inter-Process Communication) types:

MethodSpeedUse case
PipeMediumParent-child unidirectional (\ shell)
Named pipe (FIFO)MediumUnrelated processes (mkfifo)
Shared memoryFastestDB cache · games (zero-copy)
Message queueMediumSystem V · POSIX
SemaphoreFastSynchronization only (no data)
SocketMediumTCP/UDP/Unix domain (most versatile)
SignalFastSimple notification (SIGTERM, etc.)

Signal usage:

SignalMeaningUsage
SIGTERM (15)Graceful termination requestDefault kill <PID>
SIGKILL (9)Forced termination (no handler)kill -9
SIGINT (2)Interrupt (Ctrl+C)Interactive termination
SIGHUP (1)Hangup / config reloadnginx reload
SIGCHLDChild termination notificationZombie reaping
SIGUSR1/2User-definedRe-read app config

Graceful Shutdown pattern:

code
Receive SIGTERM → reject new requests → wait for in-progress requests to finish → release resources → exit

Zombie process:

  • Child has exited but parent has not called wait() → only the PCB remains
  • Shown as Z state in ps
  • Prevention: parent registers a SIGCHLD handler and calls waitpid()
  • Orphan process: if the parent dies first, init (PID 1) adopts it → init calls wait → cleaned up

cgroups + Observability — the foundation of containers

cgroup (Control Group) — Linux mechanism to limit resources for process groups. The core of Docker and Kubernetes.

Examples:

ResourceHow to limit
CPUcpu.cfs_quota_us — 50 ms out of 100 ms → 0.5 cores
Memorymemory.limit_in_bytes — OOM Kill when exceeded
I/Oio.weight — disk bandwidth
Networknet_cls + tc — traffic control
Devicesdevices.allow — restrict access to specific devices

Docker example: docker run --memory=1g --cpus=0.5 nginx internally configures cgroups.

cgroup v2 (modern Linux): unified hierarchy (consolidates v1's separate controllers).

Observability — 3 pillars:

PillarToolsPurpose
MetricsPrometheus · GrafanaTime-series data (CPU · memory · request count)
LogsLoki · ELK · SplunkText events
TracesJaeger · Tempo · ZipkinDistributed call tracing

eBPF (Extended Berkeley Packet Filter):

  • Runs code safely inside the kernel (no kernel patch needed)
  • Observes all system call · network · disk events
  • Tools: bcc · bpftrace · pixie · falco (security)
  • Cilium (K8s networking) and Datadog APM are both eBPF-based

Observability pitfalls:

  • ❌ Metrics only (no idea why) → add traces too
  • ❌ Log explosion (10 TB/day) → sampling · filtering
  • ❌ /var/log fills up → log rotation is mandatory (logrotate)
💻 📌 System observability through real-world scenarios
# ============================================================
# Scenario 1: "Alarm for low disk space"
# ============================================================
df -h                              # Usage by mount point
# Filesystem      Size  Used Avail Use% Mounted on
# /dev/sda1        50G   42G  6G   88% /
# tmpfs           3.9G  100M 3.8G   3% /run

# Which directory is large — Top 10
du -sh /var/* 2>/dev/null | sort -h | tail -10
# If /var/log is large → log rotation is not working
# If /var/lib/docker is large → clean up unused images and containers
docker system prune -a             # Docker cleanup

# Also check for inode shortage (when there are many small files)
df -i

# Find large files of 1GB or more
sudo find / -type f -size +1G -exec ls -lh {} 	; 2>/dev/null

# ============================================================
# Scenario 2: "Port conflict — 'Address already in use'"
# ============================================================
# Find process using port 3000
lsof -i :3000
# COMMAND   PID  USER  FD  TYPE  NODE NAME
# node    28391  app   25  IPv4  TCP *:3000 (LISTEN)

# Or ss (faster)
ss -tlnp | grep :3000

# Kill and restart
lsof -ti :3000 | xargs kill -9     # -t = output PID only

# When there are too many TIME_WAIT states (cannot be reused)
ss -tan state time-wait | wc -l
# If 1000+ → consider adjusting kernel parameters:
#   net.ipv4.tcp_tw_reuse = 1

# ============================================================
# Scenario 3: "Implementing Graceful shutdown"
# ============================================================
# Register signal handler (Python example)
cat > graceful.py <<'EOF'
import signal, time, sys

def shutdown(sig, frame):
    print(f"SIGTERM received. Cleaning up...")
    # Reject new requests, complete in-progress requests, close DB connections
    time.sleep(2)
    print("Graceful shutdown complete")
    sys.exit(0)

signal.signal(signal.SIGTERM, shutdown)
signal.signal(signal.SIGINT, shutdown)

print("Server started")
while True: time.sleep(1)
EOF

python graceful.py &
PID=$!
sleep 1
kill -TERM $PID                    # Graceful shutdown message after 2 seconds

# Docker and K8s follow the same pattern:
#   1. Send SIGTERM
#   2. Wait for terminationGracePeriodSeconds (default 30s)
#   3. If not finished, send SIGKILL

# ============================================================
# Scenario 4: "Running service with systemd"
# ============================================================
# Service unit file: /etc/systemd/system/myapp.service
# [Unit]
# Description=My App
# After=network.target
#
# [Service]
# Type=simple
# User=app
# WorkingDirectory=/opt/myapp
# ExecStart=/usr/bin/node server.js
# Restart=always
# RestartSec=5
# OOMScoreAdjust=-500              # Protect with priority
# LimitNOFILE=65536                # fd limit
#
# [Install]
# WantedBy=multi-user.target

# Usage
sudo systemctl daemon-reload       # Required after modifying unit
sudo systemctl enable myapp        # Auto-start on boot
sudo systemctl start myapp
sudo systemctl status myapp         # Status + recent logs
sudo systemctl restart myapp
journalctl -u myapp -f             # Real-time logs (tail -f)
journalctl -u myapp --since '1 hour ago'

# ============================================================
# Scenario 5: "Container resource limits (cgroup)"
# ============================================================
# Docker handles this with cgroup
docker run --memory=512m --cpus=0.5 --pids-limit=100 myapp
# Internally: recorded in /sys/fs/cgroup/memory.max·cpu.max·pids.max

# Check directly
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/system.slice/docker-<ID>.scope/memory.current

# Container resources in real time
docker stats                       # All containers
docker stats --no-stream myapp     # Only once

# ============================================================
# Scenario 6: "System call statistics with eBPF (kernel-mode observation)"
# ============================================================
# bcc·bpftrace — trace kernel events without code modification
# Which process calls open() frequently
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }'

# Display only disk I/O taking 100ms or more
sudo biolatency-bpfcc

# Trace Python function calls
sudo py-spy record --pid 28391 -o profile.svg

# Security — Detect suspicious activity with Falco (gaining root privileges, accessing critical files)

🤖 Try asking AI like this

Knowing the concepts in this lesson lets you give AI specific instructions. Instead of a vague "fix this," you can make requests with vocabulary — that is where token savings begin.

  • "Diagnose whether this task incurs high context switching overhead"
  • "Tell me what benefits there would be in converting this task to epoll (Linux)-based async"

Why this reduces tokens

Without the concepts, even after receiving an AI answer you have to follow up with "What is that?" That follow-up is what burns tokens. Learn the concept once and the conversation ends in a single round.

Filesystem·Scheduling·Advanced - OS