Snippets for yer computer needs

Systems Performance

“Systems Performance by Brendan Gregg”

60 second Linux perf troubleshooting

dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1

Performance tuning

Most to least effective


benchmark CPU

sysbench --test=cpu --cpu-max-prime=20000 run

benchmark I/O

sysbench --test=fileio --file-total-size=10G prepare
sysbench --test=fileio --file-total-size=10G --file-test-mode=rndrw --init-rnd=on --max-time=300 --max_requests=0 run
sysbench --test=fileio --file-total-size=10G cleanup



# list available plugins/events
trace-cmd list



Be aware of subtle floating point rounding errors that can occur from code path changes (eg hitting the CPU registers vs main memory)



# list all syscall tracepoints
bpftrace -l 'tracepoint:syscalls:*'

# run a bpftrace program
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {printf "%s\n", comm}'

# get BPF instructions
bpftrace -v
probe /filter/ { action }


var desc
pid process id
tid thread id
uid user id
username username
comm process or command name
curtask current taskstruct as u64
nsecs current time in nanoseconds
elapsed time in nanoseconds since bpftrace start
kstack kernel stack trace
ustack user-level stack trace
arg0…argn function arguments
args tracepoint arguments
retval function return value
func function name
probe full probe name


var desc
@name global
@name [key] hash (map)
@name [tid] thread-local
$name scratch


# show loaded bpf programs
bpftool prog show

# dump BPF instructions of a program (here 123)
bpftool prog dump xlated id 123

USE methodology

100% utilization isn’t a problem if there’s no saturation/errors. When looking for performance bottlenecks, look for saturation/errors.

Don’t make changes until you’ve profiled

Assuming code performance is a power law, a small percentage of LOC will actually affect the over runtime of the program. If you aren’t profiling your code, you have a small percentage chance of affecting the runtime performance.

Using time

desc field
time spent in kernel sys
time spent in userland user
stopwatch time real

note that sys and user combined don’t necessarily equal real (CPU has other processes to deal with, etc)

Latency numbers

Latency Comparison Numbers (Jeff Dean ~2012)

what ns us ms notes
L1 cache reference 0.5      
Branch mispredict 5      
L2 cache reference 7     14x L1 cache
Mutex lock/unlock 25      
Main memory reference 100     20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 3    
Send 1K bytes over 1 Gbps network 10,000 10    
Read 4K randomly from SSD* 150,000 150   ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 250    
Round trip within same datacenter 500,000 500    
Read 1 MB sequentially from SSD* 1,000,000 1,000 1 ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 10,000 10 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 20,000 20 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 150,000 150