Systems Performance
“Systems Performance by Brendan Gregg”
60 second Linux perf troubleshooting
uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
Performance tuning
Most to least effective
- Don’t do it
- Do it, but don’t do it again
- Do it less
- Do it later
- Do it when they’re not looking
- Do it concurrently
- Do it more cheaply
sysbench
benchmark CPU
sysbench --test=cpu --cpu-max-prime=20000 run
benchmark I/O
sysbench --test=fileio --file-total-size=10G prepare
sysbench --test=fileio --file-total-size=10G --file-test-mode=rndrw --init-rnd=on --max-time=300 --max_requests=0 run
sysbench --test=fileio --file-total-size=10G cleanup
ftrace
trace-cmd
# list available plugins/events
trace-cmd list
perf
https://perf.wiki.kernel.org/index.php/Main_Page
benchmarking
- Don’t run off battery power (use mains)
- Disable things like TurboBoost (which temporarily increases CPU speed)
- Disable background processes (like backups)
- Run many times to get a stable measurement
- It might not hurt to reboot and try again
Be aware of subtle floating point rounding errors that can occur from code path changes (eg hitting the CPU registers vs main memory)
eBPF
- kprobe - a probe that fires on kernel function entry
- uprobe - a probe that fires on user-level program function entry
- USDT (user-level statically defined tracing) - a designated trace point for operations to allow for function name changes/inlining
- tracepoint - a kernel-level USDT
bpftrace
# list all syscall tracepoints
bpftrace -l 'tracepoint:syscalls:*'
# run a bpftrace program
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {printf "%s\n", comm}'
# get BPF instructions
bpftrace -v program.bt
probe /filter/ { action }
builtins:
var | desc |
---|---|
pid | process id |
tid | thread id |
uid | user id |
username | username |
comm | process or command name |
curtask | current taskstruct as u64 |
nsecs | current time in nanoseconds |
elapsed | time in nanoseconds since bpftrace start |
kstack | kernel stack trace |
ustack | user-level stack trace |
arg0…argn | function arguments |
args | tracepoint arguments |
retval | function return value |
func | function name |
probe | full probe name |
types:
var | desc |
---|---|
@name | global |
@name [key] | hash (map) |
@name [tid] | thread-local |
$name | scratch |
bpftool
# show loaded bpf programs
bpftool prog show
# dump BPF instructions of a program (here 123)
bpftool prog dump xlated id 123
USE methodology
- Utilization - The percentage of resources used before performance is impacted
- Saturation - The threshold where performance drops due to resource contention, etc.
- Errors - The threshold where errors begin to surface.
100% utilization isn’t a problem if there’s no saturation/errors. When looking for performance bottlenecks, look for saturation/errors.
Don’t make changes until you’ve profiled
Assuming code performance is a power law, a small percentage of LOC will actually affect the over runtime of the program. If you aren’t profiling your code, you have a small percentage chance of affecting the runtime performance.
Using time
desc | field |
---|---|
time spent in kernel | sys |
time spent in userland | user |
stopwatch time | real |
note that sys and user combined don’t necessarily equal real (CPU has other processes to deal with, etc)
Latency numbers
Latency Comparison Numbers (Jeff Dean ~2012)
what | ns | us | ms | notes |
---|---|---|---|---|
L1 cache reference | 0.5 | |||
Branch mispredict | 5 | |||
L2 cache reference | 7 | 14x L1 cache | ||
Mutex lock/unlock | 25 | |||
Main memory reference | 100 | 20x L2 cache, 200x L1 cache | ||
Compress 1K bytes with Zippy | 3,000 | 3 | ||
Send 1K bytes over 1 Gbps network | 10,000 | 10 | ||
Read 4K randomly from SSD* | 150,000 | 150 | ~1GB/sec SSD | |
Read 1 MB sequentially from memory | 250,000 | 250 | ||
Round trip within same datacenter | 500,000 | 500 | ||
Read 1 MB sequentially from SSD* | 1,000,000 | 1,000 | 1 | ~1GB/sec SSD, 4X memory |
Disk seek | 10,000,000 | 10,000 | 10 | 20x datacenter roundtrip |
Read 1 MB sequentially from disk | 20,000,000 | 20,000 | 20 | 80x memory, 20X SSD |
Send packet CA->Netherlands->CA | 150,000,000 | 150,000 | 150 |