Stream: general

Topic: benchmarking on a thermally-constrained laptop


Jake Goulding (Jun 23 2019 at 16:11, on Zulip):

I’ve a MacBook with a 6-core i9 at 2.9 GHz. I’m trying to do some CPU-related benchmarking but I’m seeing some pretty big swings (e.g. ~14%) when I haven’t changed code. I’m guessing that things like Turboboost and thermal load are affecting me.

Does anyone know of a way I can construct a more stable benchmark on this machine?

nagisa (Jun 23 2019 at 18:07, on Zulip):

If it has anything resembling firmware configuration, you could go there an underclock your CPU, which by extension would disable boosting and any sort of frequency adjustment..

nagisa (Jun 23 2019 at 18:10, on Zulip):

your best bet is put your macbook into a freezer and load the null cpufreq kmod (don’t remember what it is called, sadly)

nagisa (Jun 23 2019 at 18:15, on Zulip):

there is a kext to disable turboboost too, so coupled with null cpufreq kext you should be in a pretty stable situation https://github.com/nanoant/DisableTurboBoost.kext

nagisa (Jun 23 2019 at 18:16, on Zulip):

There’s still thermal throttling, and you won’t be able to deal with it without a higher difference in temperature between your rads and air.

RalfJ (Jun 23 2019 at 19:14, on Zulip):

@Jake Goulding have you tried other metrics, such as instruction count? it's usually much more stable.

RalfJ (Jun 23 2019 at 19:14, on Zulip):

on Linux perf can measure that with very little overhead, no idea about macOS though.

Jake Goulding (Jun 23 2019 at 19:51, on Zulip):

@RalfJ thats a good point. I could even run in a VM for that, since the actual wallclock wouldn’t matter as much

Jake Goulding (Jun 23 2019 at 19:51, on Zulip):

I’m not super excited about putting my laptop in a freezer :upside_down:

nagisa (Jun 23 2019 at 20:34, on Zulip):

Instruction count is only useful if you know that instructions have approximately the same throughput and latency, which they don’t.

nagisa (Jun 23 2019 at 20:35, on Zulip):

However if what is very useful is number of cycles :slight_smile:

nagisa (Jun 23 2019 at 20:35, on Zulip):

(also measured by perf)

RalfJ (Jun 23 2019 at 21:11, on Zulip):

however number of cycles has almost as much variance as time

RalfJ (Jun 23 2019 at 21:11, on Zulip):

so while instruction count is a worse proxy for performance than cycles, it can be measured much better, making it overall more useful

RalfJ (Jun 23 2019 at 21:11, on Zulip):

that's why instructions is also the default view for rust-perf

RalfJ (Jun 23 2019 at 21:12, on Zulip):

(there was a long post where someone from the team explained that all in more detail but I cannot find it now)

RalfJ (Jun 23 2019 at 21:14, on Zulip):

if you want to get a visual idea for the variance, some pictures from perf measurement of a research project I am involved in:
instructions: two nice clear lines with basically constant separation: https://coq-speed.mpi-sws.org/d/Ne7jkX6kk/coq-speed?orgId=1&from=1556168879192&to=1558362701773&var-metric=instructions&var-project=iris&var-branch=master&var-config=All&var-group=().*
time: just a mess: https://coq-speed.mpi-sws.org/d/Ne7jkX6kk/coq-speed?orgId=1&from=1556168879192&to=1558362701773&var-metric=time&var-project=iris&var-branch=master&var-config=All&var-group=().*
cycles are less messy but still don't give a useful signal here: https://coq-speed.mpi-sws.org/d/Ne7jkX6kk/coq-speed?orgId=1&from=1556168879192&to=1558362701773&var-metric=cycles&var-project=iris&var-branch=master&var-config=All&var-group=().*
(beware, non-zeroed y axis)

RalfJ (Jun 23 2019 at 21:15, on Zulip):

this is on a system where we did what we could, with our limited knowledge, to isolate things in terms of performance (it's a 2-socket system with one socket entirely reserved to the thing being benchmarked)

RalfJ (Jun 23 2019 at 21:16, on Zulip):

ah here's the post by nnethercote I was mentioning earlier: https://internals.rust-lang.org/t/what-is-perf-rust-lang-org-measuring-and-why-is-instructions-u-the-default/9815/5?u=ralfjung

Jake Goulding (Jun 23 2019 at 21:44, on Zulip):

Now, can I trick criterion into gathering those numbers and making a pretty graph

RalfJ (Jun 24 2019 at 08:24, on Zulip):

if you find out how, please tell us!

nagisa (Jun 24 2019 at 15:35, on Zulip):

@RalfJ is that branchy code?

nagisa (Jun 24 2019 at 15:35, on Zulip):

Dedicating a single core is not sufficient when cache is shared.

Jake Goulding (Jun 24 2019 at 19:29, on Zulip):

@RalfJ looks like it’s getting closer to possible: https://github.com/bheisler/criterion.rs/issues/130

RalfJ (Jun 24 2019 at 19:40, on Zulip):

@nagisa

Dedicating a single core is not sufficient when cache is shared.

I said single socket

RalfJ (Jun 24 2019 at 19:40, on Zulip):

as in, entire physical CPU. dedicated L1, L2, L3 caches and even dedicated memory controller, if the OS doesn't screw up and allocate memory attached to the other socket.

RalfJ (Jun 24 2019 at 19:40, on Zulip):

that machine has two of them, each with 10 cores and HT (so 40 virtual cores total)

RalfJ (Jun 24 2019 at 19:40, on Zulip):

RalfJ is that branchy code?

no idea. this is running Coq to check proofs we wrote. Coq is written in OCaml.

Jake Goulding (Jun 24 2019 at 19:42, on Zulip):

Anyone know how to read the instruction count? Is that a magic x86 register thing?

RalfJ (Jun 24 2019 at 19:44, on Zulip):

I know how to do it with perf^^

RalfJ (Jun 24 2019 at 19:44, on Zulip):

which uses special Linux kernel APIs

Jake Goulding (Jun 24 2019 at 19:46, on Zulip):

pffft. I like writing that low-level assembly

RalfJ (Jun 24 2019 at 19:46, on Zulip):

well perf does way more than this magic instruction that also exists

Jake Goulding (Jun 24 2019 at 19:46, on Zulip):

There’s http://gz.github.io/rust-perfcnt/perfcnt/, but it’s lInux-only for that reason

Jake Goulding (Jun 24 2019 at 19:47, on Zulip):

sure, but how much of that is needed in the context of e.g. criterion?

RalfJ (Jun 24 2019 at 19:47, on Zulip):

it's some kind of statistical thing where the CPU takes snapshots of stuff and runs a bit of priviledged code to compress it enough to allow streaming the rest to disk for post-mortem analysis or whatever... you end up being told in which function all the instructions are spent etc. a bit like callgrind, but without the overhead. it's pure magic.

RalfJ (Jun 24 2019 at 19:47, on Zulip):

sure, but how much of that is needed in the context of e.g. criterion?

I have no idea. ;) Just know the user side of this.

nagisa (Jun 24 2019 at 20:07, on Zulip):

Perfcounters are most likely not directly accessible by user-space programs.

nagisa (Jun 24 2019 at 20:08, on Zulip):

So your options are to a) figure out what (possibly undocumented) syscall things like Instruments use to measure that or b) write a kernel module which exposes that data to you

simulacrum (Jun 24 2019 at 20:15, on Zulip):

dtrace should be similar enough to perf on macOS I think

simulacrum (Jun 24 2019 at 20:15, on Zulip):

I know very little about it though

Wesley Wiser (Jun 24 2019 at 20:29, on Zulip):

If you have SIP enabled on macOS, you're going to have a bad time trying to use dtrace. For example: https://www.reddit.com/r/macsysadmin/comments/ahrn10/has_apple_broken_dtruss_in_mojave/

RalfJ (Jun 24 2019 at 20:46, on Zulip):

I gather SIP here is not the VoIP protocol...?

simulacrum (Jun 24 2019 at 20:51, on Zulip):

system integrity protection

RalfJ (Jun 24 2019 at 20:52, on Zulip):

ah, that makes so much more sense :D

nagisa (Jun 24 2019 at 21:35, on Zulip):

~Hackers are strong tonight~
~They will profile your code~
~Figure out instructions~
~And their costs~
~For much improved security~
~We disabled perfcounters~

Last update: Nov 20 2019 at 11:35UTC