Performance measurement in this decade

Nowadays, CPU bound performance measurement is ... complicated

2019-07-11 golang perf

Yamaguchi-san asked for help to better reproduce a CPU performance measurement on Twitter. Here’s the long reply.

Performance measurement is a pick-your-poison kind of situation.

Host’s physical characteristics matters. High end Xeon may have different multi-minutes throttling behavior than low end Celeron. Low end mobile CPUs can only burst for time measurable in seconds. Active cooling versus passive cooling influences sustained peak performance a lot too.

What matters is simple: heat dissipation 🔥 and how the OS manages CPU cores frequency.

Evolution

Back in the days, things were simple. A good old Intel 8088 would run at full speed, continuously, with no heatsink! Happily dissipating the same amount of power all the time. Running a performance test would lead consistent reproducible results, if you didn’t mind running at 4.77 MHz. 🐢

That hasn’t been true for a while.

As transistors became smaller, processors started to pack more and more, now with billions of them in a single die. We saw a big shift in the late 90s: a die can pack more transistors than what can be used simultaneously, so not all transistors can be active simultaneously!

With higher transistor density, thermal throttling becomes inevitable even on desktop, just like how it happens on mobile CPUs. Two techniques are used: dynamic frequency scaling coupled with dynamic voltage scaling. While initially it was meant as a way to save power for laptops, it eventually became inevitable even for desktop class CPUs.

Desktop

On a recent high end desktop CPU, single threaded performance will likely give you the turbo frequency for a while, but only if you carefully reduce all other background threads. Also, there’s a delay before one CPU core scales all the way up to its maximum frequency. A delay which can be relatively high.

Consummers CPUs are now great at bursting, not much at broad sustained workload. Get a Xeon for that. Googlers are spoiled here, yet the original question was likely asked while trying to reproduce a benchmark on a high end CPU.

Windows 10 has a Game Mode and I’m pretty sure it’s designed explicitly for this but the documentation is sadly vague. See GetExpandedResourceExclusiveCpuCount() to get exclusive access to one or more cores. The OS decides which one you get, not the app. It’s the OS that has the power to tell the CPU to lock a specific core into turbo boost mode. I am not aware of such functionality on other OSes.

Linux has a different knob which is the CPU governor, which is configurable as root. Not all governors are available on all systems, the list of supported CPU governors is especially spotty on mobile CPUs.

Heat 🔥

But before diving more into performance, first, I lied.

CPUs are not throttled by heat dissipation limit. They are throttled by heat accumulation. Think of the CPU as a heat machine. The system’s input is heat production in Watts. The system’s output is cooling, also in Watts, negative. The capacity is amount of accumulated heat, in Joules (W·s). In this case, the energy accumulated is measurable as temperature, in Kelvin. The CPU can measure its own temperature, and uses it to control thermal throttling.

So in the end, you have a energy accumulation budget, where you can get the CPU die to ramp up and accumulate more energy than it dissipate, but that’s temporary by design. And changing each core frequency implies delays, so there’s a trade off the operating system must make.

Benchmarking

The current state of CPU designs means that in practice performance measurement is affected by:

Background activity, due to the CPU TPD that is nowadays always lower than what it should be possible to achieve if all of its functionality was used simultaneously.
Prior state, both in the base core frequency before the measurement starts, and the current CPU temperature.
CPU scaling variation delays, which means that parts of the benchmark runs at different frequency. This is directly affected by the CPU governor.

You need to address all three variables to improve performance measurements.

The first one can only be achieved by running less processes on your system. It’s not as much about background processes priority, it’s about not having them be scheduled at all. While running a single Go binary in user mode can be an extreme solution, typical workstations with many cores and no anti-virus installed will ‘generally’ be fine for CPU bound benchmarks.

The two later points can be worked around in two ways:

Prewarm the CPU with single threaded loop so that when the measurement starts, the CPU scaled up already. One or two seconds of prewarming will likely be enough.
- Note that things like doing a build is an anti-pattern to help with this, build systems will try to run on as many cores as possible.
- This only matters for single threaded benchmarks. If your benchmark runs concurrently (like go test -bench does by default), then you may as well to prewarm with a build, so that all CPU cores are warmed up.
Set the CPU cores frequency to a fixed value. This has the advantages of ensuring the largest reproducibity, trading off peak performance.
- On linux this can be done with userspace CPU governor, cpupower frequency-set is a wrapper tool to help with that.

For Go users, the most practical way to improve the reproducibility remains to tell the go test tool to use a single thread:

go test -bench=. -cpu 1

and run it a few times in a row. Yes this sounds silly, but this is the closest you’ll get without messing around with your CPU settings on a turbo boost enabled CPU.

Mobile

Mobile is worth its own post because it is very touchy, here’s a short summary from what we learned on the Chrome infrastructure.

In practice mobile devices are limited by the heat dissipation, so having active cooling on the physical device, without a case, will help. That’s where thermodynamics University courses come handy.

The best way to increase the heat accumulation budget is to only start benchmarks when the CPU is under 35°C or so, so that the CPU can exceed the budget for a moment as the CPU temperature increases.

If the device is rooted, change its cpu governor to powersave to accelerate cool down between benchmarks.

Then you should be able to reliably measure peak performance.

An alternative is heat the device first and only start your benchmark once it’s near breakdown so you measure the worst case performance. This is not something I personally tried.

Intel 8088 picture by Konstantin Lanzet