Stream: t-compiler

Topic: Perf reproducibility

marmeladema (Apr 20 2020 at 13:22, on Zulip):

For a work related project, we want to achieve something very similar to where we can see performance improvements / regressions over time.
But i wonder how did you manage to have a setup the give reproducible results?

marmeladema (Apr 20 2020 at 13:23, on Zulip):

like overcome frequency scaling properly etc
because from my own experiences, its very hard to have something that you compare from run to run

simulacrum (Apr 20 2020 at 13:23, on Zulip):

perf is not very reproducible :)

It's been an ongoing project to reduce variability -- currently, we have ASLR (both kernel and user space) disabled and that's pretty much it

I have core pinning, frequency scaling, etc. planned as things to look at

simulacrum (Apr 20 2020 at 13:23, on Zulip):

but for the most part we just go for instruction counts, which are mostly reliable

marmeladema (Apr 20 2020 at 13:49, on Zulip):

Ok! Locally (ie: on my laptop) i've isolated 1 physical core (2 threads) and i've removed as much irq as i can etc

marmeladema (Apr 20 2020 at 13:49, on Zulip):

But even with that, instruction cache can make things very noisy

marmeladema (Apr 20 2020 at 13:50, on Zulip):

Also laptop cpus are totally unpredictable

marmeladema (Apr 20 2020 at 13:50, on Zulip):

Can I ask on which kind of machine do you run those tests? And in which environment? Like containers etc?

simulacrum (Apr 20 2020 at 13:55, on Zulip):

no containers, this is on a cloud provided machine (AMD Ryzen 5 3600 6-Core Processor)

simulacrum (Apr 20 2020 at 13:55, on Zulip):

in theory the machine is not shared though

marmeladema (Apr 20 2020 at 13:56, on Zulip):

Yeah ok that was my next question. So in theory, this machine is meant to run only rusct perf tests

simulacrum (Apr 20 2020 at 13:57, on Zulip):

that's the only thing we use it for (and IIRC, the cloud provider claims we have dedicated hardware, but I forget)

Pietro Albini (Apr 20 2020 at 13:59, on Zulip):

yep, hetzner claims it's a dedicated server

bjorn3 (Apr 20 2020 at 14:00, on Zulip):

I run echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor to ensure that the cpu frequency changes as little as possible. You may want to set it to powersave afterwards.

marmeladema (Apr 20 2020 at 14:01, on Zulip):

Did disabling ASLR really made a difference? For reproducibility?

marmeladema (Apr 20 2020 at 14:02, on Zulip):

Also is there some kind of normalization step of your metrics / numbers? Or, if you change hardware, then all the data becomes irrelevant?

bjorn3 (Apr 20 2020 at 14:02, on Zulip):

I believe all data was removed when the hardware changed the last time.

Hanna Kruppe (Apr 20 2020 at 14:04, on Zulip):

Note that that due to numerous microarchitectural features in modern CPUs, wall clock time (and other metrics correlated with it) can vary greatly depending on details of the overall code and data layout (e.g. sizes as well as relative and absolute addresses of individual code sections and heap/stack allocations) that are almost guaranteed to be perturbed as side effect of essentially any code change. So it's questionable if "stable" performance measurements across code changes that should be performance-neutral are even possible to achieve.

simulacrum (Apr 20 2020 at 14:05, on Zulip):

yeah, we're mostly aiming for stable instruction counts (or as much as possible)

simulacrum (Apr 20 2020 at 14:05, on Zulip):

ah, yes, we also set scaling governor to performance, forgot about that

simulacrum (Apr 20 2020 at 14:06, on Zulip):

yeah I don't think anything we've done on the machine itself has had appreciable impact on reproducibility -- there's been some changes to the compiler itself or how we build it, but not beyond that

marmeladema (Apr 20 2020 at 14:07, on Zulip):

@Hanna Kruppe yep I understand, but still I need to monitor perf for critical projects, and i wondered how people do it

marmeladema (Apr 20 2020 at 14:08, on Zulip):

And ideally i'd like to understand if a change has an impact or not

Hanna Kruppe (Apr 20 2020 at 14:12, on Zulip):

Well, the answer is they mostly don't address this problem :) Either side-stepping it by looking at more stable but limited measures like instruction count or something higher-level, or trying to remove as much variance as possible (some techniques discussed above) and then trying to account for the remaining "noise" with rules of thumb, often with questionable results IMO. The only rigorous approach I know of it stabilizer
(<>) but as far as I can tell it hasn't achieved much adoption.

marmeladema (Apr 20 2020 at 14:15, on Zulip):

Thank you for input and to the paper, i'll read and share that :)

marmeladema (Apr 20 2020 at 17:32, on Zulip):

@Hanna Kruppe i gave a quick read to the paper, its interesting. I lack the knowledge in statistiscs though^^ Its sad that is unmaintained :(

Last update: May 29 2020 at 17:20UTC