Performance Profiling Cheat Sheet
Commands and tools for measuring, profiling, and optimizing performance.
Python Profiling
Section titled “Python Profiling”cProfile
Section titled “cProfile”# Profile a scriptpython -m cProfile script.py
# Sort by cumulative timepython -m cProfile -s cumulative script.py
# Sort by total time spent in each functionpython -m cProfile -s tottime script.py
# Save profile data for analysispython -m cProfile -o profile.prof script.py
# Visualize with snakeviz (install: pip install snakeviz)snakeviz profile.prof# Profile a specific sectionimport cProfileimport pstats
profiler = cProfile.Profile()profiler.enable()do_work()profiler.disable()
stats = pstats.Stats(profiler)stats.sort_stats("cumulative")stats.print_stats(20) # Top 20 functionsline_profiler
Section titled “line_profiler”Profile line-by-line within a function.
# Decorate the function you want to profile@profiledef process_data(items): result = [] for item in items: transformed = expensive_transform(item) result.append(transformed) return result# Run with kernprof (install: pip install line_profiler)kernprof -l -v script.py
# -l line-by-line mode# -v print results immediatelyOutput shows time per line, hits, and percentage — tells you exactly which line is slow.
memory_profiler
Section titled “memory_profiler”from memory_profiler import profile
@profiledef load_data(): data = [x ** 2 for x in range(10_000_000)] filtered = [x for x in data if x % 2 == 0] return filtered# Installpip install memory_profiler
# Run and see line-by-line memory usagepython -m memory_profiler script.py
# Track memory over timemprof run script.pymprof plot # Opens matplotlib graphmprof clean # Remove data filespy-spy
Section titled “py-spy”Sampling profiler — attaches without modifying code or restarting.
# Installpip install py-spy
# Live top-like view of a running processpy-spy top --pid 12345
# Record a flame graphpy-spy record -o profile.svg -- python script.py
# Attach to running process and recordpy-spy record -o profile.svg --pid 12345
# Include subprocessespy-spy record --subprocesses -o profile.svg -- python script.py
# Record in speedscope format (interactive viewer)py-spy record -f speedscope -o profile.json -- python script.py
# Sample at higher frequency (default 100 Hz)py-spy record --rate 250 -o profile.svg -- python script.pytimeit
Section titled “timeit”# Command linepython -m timeit "sum(range(1000))"python -m timeit -n 10000 -r 5 "'-'.join(str(i) for i in range(100))"# -n number of executions per run# -r number of runs (best of r is reported)
# Setup codepython -m timeit -s "import json; d={'a':1}" "json.dumps(d)"import timeit
# Time a statementelapsed = timeit.timeit("sum(range(1000))", number=10000)
# Time with setupelapsed = timeit.timeit( "json.dumps(d)", setup="import json; d={'a': 1, 'b': [1,2,3]}", number=100000,)
# In IPython / Jupyter# %timeit sum(range(1000))# %%timeit (cell magic for multi-line)Database Profiling (PostgreSQL)
Section titled “Database Profiling (PostgreSQL)”EXPLAIN ANALYZE
Section titled “EXPLAIN ANALYZE”-- Show actual execution plan with timingEXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';
-- Include buffer/IO statisticsEXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE email = 'alice@example.com';
-- JSON format for programmatic analysisEXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)SELECT * FROM orders WHERE user_id = 42;
-- Verbose mode shows column output detailsEXPLAIN (ANALYZE, VERBOSE) SELECT * FROM users JOIN orders ON users.id = orders.user_id;Reading Query Plans
Section titled “Reading Query Plans”| Node type | Meaning |
|---|---|
| Seq Scan | Full table scan — may need an index |
| Index Scan | Uses index to find rows, then fetches from heap |
| Index Only Scan | Satisfied entirely from index (best case) |
| Bitmap Scan | Index builds bitmap, then fetches in bulk |
| Nested Loop | Join via loop — fast for small outer set |
| Hash Join | Builds hash of one table — good for equality |
| Merge Join | Presorted merge — good for large sorted sets |
| Sort | In-memory or on-disk sort |
| Materialize | Caches subquery results |
Key Metrics
Section titled “Key Metrics”Seq Scan on users (cost=0.00..431.00 rows=1 width=72) (actual time=3.214..3.216 rows=1 loops=1) ^^^^ ^^^^^ ^^^^ ^^^^^ estimated cost first row actual rows iterations Buffers: shared hit=217 read=14 ^^^ ^^ cache hits disk readsPlanning Time: 0.085 msExecution Time: 3.271 ms- actual time: first row..last row in milliseconds
- rows: actual rows returned (compare to estimated
rows) - loops: how many times this node executed
- Buffers shared hit: pages found in cache
- Buffers shared read: pages read from disk
pg_stat_statements
Section titled “pg_stat_statements”-- Enable (add to postgresql.conf or ALTER SYSTEM)-- shared_preload_libraries = 'pg_stat_statements'CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Top queries by total timeSELECT query, calls, total_exec_time, mean_exec_time, rowsFROM pg_stat_statementsORDER BY total_exec_time DESCLIMIT 20;
-- Top queries by mean time (slowest on average)SELECT query, calls, mean_exec_time, stddev_exec_timeFROM pg_stat_statementsORDER BY mean_exec_time DESCLIMIT 10;
-- Reset statsSELECT pg_stat_statements_reset();Common Slow Patterns
Section titled “Common Slow Patterns”| Pattern | Symptom | Fix |
|---|---|---|
| Missing index | Seq Scan on large table | CREATE INDEX on filter/join columns |
| N+1 queries | Thousands of identical simple queries | Use JOIN or batch fetch |
| Large sort on disk | Sort Method: external merge | Add index matching ORDER BY, increase work_mem |
| Bloated table | Seq Scan reads far more pages than rows | VACUUM FULL or pg_repack |
| Correlated subquery | Nested Loop with high loops count | Rewrite as JOIN or lateral |
| Missing statistics | Row estimates wildly wrong | ANALYZE tablename |
System Profiling
Section titled “System Profiling”# Bash builtin -- wall/user/systime python script.py
# /usr/bin/time with memory and detail (macOS)/usr/bin/time -l python script.py# real/user/sys + max RSS, page faults, context switches
# /usr/bin/time with memory and detail (Linux)/usr/bin/time -v python script.py| Metric | Meaning |
|---|---|
| real (wall) | Elapsed clock time |
| user | CPU time in user space |
| sys | CPU time in kernel space |
| user + sys | Total CPU time (> real means parallelism) |
| real >> user+sys | Process is I/O bound or waiting on locks |
| max RSS | Peak memory usage |
Process Monitoring
Section titled “Process Monitoring”# top -- built-in, press 'o' to sort by cpu/mem/pidtop
# htop -- interactive, tree view, filter by user/processhtophtop -p 12345 # Monitor specific PID
# macOS Activity Monitor from CLIopen -a "Activity Monitor"
# ps snapshotsps aux --sort=-%mem | head -20 # Top memory consumers (Linux)ps aux -m | head -20 # Top memory consumers (macOS)ps aux -r | head -20 # Top CPU consumers (macOS)macOS Instruments
Section titled “macOS Instruments”# List available templatesinstruments -s templates
# Time Profiler -- CPU samplingxcrun xctrace record --template "Time Profiler" --launch -- ./myprogram
# Allocations -- memory trackingxcrun xctrace record --template "Allocations" --launch -- ./myprogram
# Attach to running processxcrun xctrace record --template "Time Profiler" --attach 12345
# Open result in Instruments.appopen recording.traceperf (Linux)
Section titled “perf (Linux)”# Count hardware events (cycles, instructions, cache misses)perf stat ./myprogram
# Record samples for analysisperf record -g ./myprogram # -g captures call graphsperf report # Interactive TUI
# Record at specific frequencyperf record -F 99 -g ./myprogram
# Record a running processperf record -g -p 12345 -- sleep 30
# Flame graph pipelineperf record -F 99 -g ./myprogramperf script | stackcollapse-perf.pl | flamegraph.pl > perf.svgSystem-Level Bottleneck Identification
Section titled “System-Level Bottleneck Identification”# vmstat -- CPU, memory, swap, I/O overview (Linux)vmstat 1 10 # 1-second interval, 10 samples
# iostat -- disk I/O statisticsiostat -x 1 5 # Extended stats, 1-sec interval (Linux)iostat -d 1 5 # Disk stats (macOS)
# macOS equivalentsvm_stat # Memory page statisticsfs_usage -w -f filesys # Real-time filesystem activity (needs sudo)
# Networknettop # macOS -- live network usage per processiftop # Linux -- live bandwidth per connectionss -s # Linux -- socket statistics summaryBenchmarking
Section titled “Benchmarking”hyperfine
Section titled “hyperfine”# Installbrew install hyperfine # macOScargo install hyperfine # From source
# Basic comparisonhyperfine 'fd . /tmp' 'find /tmp'
# With warmup runs (prime caches)hyperfine --warmup 3 'command_a' 'command_b'
# Parameter scanhyperfine --parameter-scan threads 1 8 'myprogram --threads {threads}'
# Export results as markdown tablehyperfine --export-markdown bench.md 'command_a' 'command_b'
# Export as JSON for further analysishyperfine --export-json bench.json 'command_a' 'command_b'
# Set minimum number of runshyperfine --min-runs 20 'mycommand'
# Preparation command (runs before each timing run)hyperfine --prepare 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches' 'cat largefile'Python Benchmarking
Section titled “Python Benchmarking”# pytest-benchmark (install: pip install pytest-benchmark)pytest --benchmark-onlypytest --benchmark-comparepytest --benchmark-save=baseline# pytest-benchmark fixturedef test_sort_performance(benchmark): data = list(range(10000, 0, -1)) result = benchmark(sorted, data) assert result == sorted(data)
# With setupdef test_with_setup(benchmark): def setup(): return (list(range(10000)),), {} benchmark.pedantic(sorted, setup=setup, rounds=100)Go Benchmarking
Section titled “Go Benchmarking”// In _test.go filefunc BenchmarkFib(b *testing.B) { for i := 0; i < b.N; i++ { Fib(20) }}
// Benchmark with allocation trackingfunc BenchmarkParse(b *testing.B) { b.ReportAllocs() for i := 0; i < b.N; i++ { Parse(input) }}# Run benchmarksgo test -bench=.go test -bench=BenchmarkFibgo test -bench=. -benchmem # Include allocation statsgo test -bench=. -count=5 # Run 5 times for statistical significancego test -bench=. -benchtime=5s # Run for 5 seconds
# Compare results with benchstatgo test -bench=. -count=10 > old.txt# (make changes)go test -bench=. -count=10 > new.txtbenchstat old.txt new.txtRust Benchmarking
Section titled “Rust Benchmarking”# Built-in (nightly only)cargo +nightly bench
# Criterion (stable Rust, statistical rigor)cargo bench # With criterion in Cargo.toml// Criterion benchmark (benches/my_benchmark.rs)use criterion::{criterion_group, criterion_main, Criterion};
fn fibonacci(n: u64) -> u64 { match n { 0 | 1 => n, _ => fibonacci(n - 1) + fibonacci(n - 2), }}
fn bench_fib(c: &mut Criterion) { c.bench_function("fib 20", |b| b.iter(|| fibonacci(20)));}
criterion_group!(benches, bench_fib);criterion_main!(benches);Flame Graphs
Section titled “Flame Graphs”How to Read
Section titled “How to Read” +---------+ | child_b | Width = proportion of total time +----+---------+----+ | parent_fn | Y-axis = stack depth (bottom = entry point)+----+-------------------+----+| main | X-axis = alphabetical (NOT time order)+-----------------------------+- Wide frames are where time is spent — the hot path
- Tall stacks show deep call chains
- Look for wide frames near the top — those are leaf functions consuming CPU
- Narrow frames at the bottom are just call chain overhead
Generating Flame Graphs
Section titled “Generating Flame Graphs”# Python (py-spy)py-spy record --format flamegraph -o profile.svg -- python script.py
# Linux (perf + Brendan Gregg's tools)perf record -F 99 -g -- ./myprogramperf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
# Go (built-in pprof)go tool pprof -http=:8080 cpu.prof # Opens interactive web UI with flame graph
# Node.jsnode --prof script.jsnode --prof-process isolate-*.log > processed.txt# Or use 0x: npx 0x script.jsDifferential Flame Graphs
Section titled “Differential Flame Graphs”Compare before/after to see what changed.
# Generate two profilesperf record -F 99 -g -o before.data -- ./program_v1perf record -F 99 -g -o after.data -- ./program_v2
# Create differential flame graphperf script -i before.data | stackcollapse-perf.pl > before.foldedperf script -i after.data | stackcollapse-perf.pl > after.foldeddifffolded.pl before.folded after.folded | flamegraph.pl > diff.svgRed = regression (more time), blue = improvement (less time).
Load Testing
Section titled “Load Testing”# Installbrew install wrk
# Basic load test (10 threads, 200 connections, 30 seconds)wrk -t10 -c200 -d30s http://localhost:8080/api/users
# With custom headerswrk -t4 -c100 -d30s -H "Authorization: Bearer TOKEN" http://localhost:8080/api
# With Lua script for POST requestswrk -t4 -c100 -d30s -s post.lua http://localhost:8080/api-- post.luawrk.method = "POST"wrk.headers["Content-Type"] = "application/json"wrk.body = '{"name": "test"}'Reading wrk output: focus on Req/Sec (throughput), Latency (avg and stdev), and the percentile distribution (p50, p99).
# Installbrew install hey
# 200 requests, 50 concurrenthey -n 200 -c 50 http://localhost:8080/
# 30 seconds of loadhey -z 30s -c 50 http://localhost:8080/
# POST with bodyhey -m POST -d '{"key":"value"}' -T "application/json" http://localhost:8080/apiab (Apache Bench)
Section titled “ab (Apache Bench)”# 1000 requests, 10 concurrentab -n 1000 -c 10 http://localhost:8080/
# POST with file bodyab -n 1000 -c 10 -p payload.json -T "application/json" http://localhost:8080/api
# Keep-alive connectionsab -n 1000 -c 10 -k http://localhost:8080/# Installbrew install k6import http from "k6/http";import { check, sleep } from "k6";
export const options = { stages: [ { duration: "30s", target: 20 }, // Ramp up to 20 users { duration: "1m", target: 20 }, // Hold at 20 { duration: "10s", target: 0 }, // Ramp down ],};
export default function () { const res = http.get("http://localhost:8080/api/users"); check(res, { "status is 200": (r) => r.status === 200, "latency < 500ms": (r) => r.timings.duration < 500, }); sleep(1);}k6 run load-test.jsk6 run --vus 50 --duration 60s load-test.js # Override from CLIOptimization Heuristics
Section titled “Optimization Heuristics”First Principles
Section titled “First Principles”- Profile before optimizing — measure, do not guess
- Amdahl’s law — if 5% of runtime is in the hot path, a 10x speedup there yields only 1.05x overall
- The 80/20 rule — find the hot path first; most time is spent in a small fraction of code
- Premature optimization is the root of all evil — but mature optimization requires data
Bottleneck Diagnosis
Section titled “Bottleneck Diagnosis”| Bottleneck type | How to identify | Common fix |
|---|---|---|
| CPU-bound | user time >> real time, high CPU % | Algorithm change, caching, parallelism |
| I/O-bound | real >> user+sys, low CPU % | Async I/O, batching, caching, connection pool |
| Memory-bound | High RSS, swapping, GC pauses | Reduce allocations, streaming, object pools |
| Lock contention | High sys time, threads waiting | Reduce critical section, lock-free structures |
| Network-bound | High latency, low throughput | Connection reuse, compression, CDN |
| Query-bound | Slow SQL, high database wait time | Index, query rewrite, denormalization |
Optimization Checklist
Section titled “Optimization Checklist”1. Establish a baseline (measure current performance)2. Set a target (what does "fast enough" mean?)3. Profile to find the bottleneck4. Fix the bottleneck (one change at a time)5. Measure again (did it help? by how much?)6. Repeat until target is metQuick Reference
Section titled “Quick Reference”| I need to measure… | Tool | Command |
|---|---|---|
| Python function time | cProfile | python -m cProfile -s cumulative script.py |
| Python line-by-line time | line_profiler | kernprof -l -v script.py |
| Python memory per line | memory_profiler | python -m memory_profiler script.py |
| Python process (no code change) | py-spy | py-spy top --pid PID |
| Wall/user/sys time | time | time command or /usr/bin/time -l command |
| SQL query performance | EXPLAIN ANALYZE | EXPLAIN (ANALYZE, BUFFERS) SELECT ... |
| Slowest SQL queries | pg_stat_statements | SELECT query, mean_exec_time FROM pg_stat_statements |
| CLI command comparison | hyperfine | hyperfine 'cmd_a' 'cmd_b' |
| Go function benchmark | go test | go test -bench=. -benchmem |
| Rust function benchmark | criterion | cargo bench |
| HTTP endpoint throughput | wrk | wrk -t4 -c100 -d30s URL |
| HTTP endpoint latency | hey | hey -n 1000 -c 50 URL |
| CPU flame graph (Python) | py-spy | py-spy record -o flame.svg -- python script.py |
| CPU flame graph (Linux) | perf | perf record -g ./prog && perf script | flamegraph.pl |
| System CPU/memory overview | htop | htop |
| Disk I/O bottleneck | iostat | iostat -x 1 5 |
| macOS filesystem activity | fs_usage | sudo fs_usage -w -f filesys |
| Load test with scenarios | k6 | k6 run load-test.js |
See Also
Section titled “See Also”- Debugging Tools — pdb, lldb, system call tracing
- Debugging Principles — Scientific method for performance issues
- PostgreSQL — Database-specific commands
- Python — Python-specific patterns
- Testing — Benchmark test patterns
- Unix Filesystem (Advanced)
- Unix Filesystem