For a work related project, we want to achieve something very similar to perf.rust-lang.org where we can see performance improvements / regressions over time.
But i wonder how did you manage to have a setup the give reproducible results?
like overcome frequency scaling properly etc
because from my own experiences, its very hard to have something that you compare from run to run
perf is not very reproducible :)
It's been an ongoing project to reduce variability -- currently, we have ASLR (both kernel and user space) disabled and that's pretty much it
I have core pinning, frequency scaling, etc. planned as things to look at
but for the most part we just go for instruction counts, which are mostly reliable
Ok! Locally (ie: on my laptop) i've isolated 1 physical core (2 threads) and i've removed as much irq as i can etc
But even with that, instruction cache can make things very noisy
Also laptop cpus are totally unpredictable
Can I ask on which kind of machine do you run those tests? And in which environment? Like containers etc?
no containers, this is on a cloud provided machine (AMD Ryzen 5 3600 6-Core Processor)
in theory the machine is not shared though
Yeah ok that was my next question. So in theory, this machine is meant to run only rusct perf tests
that's the only thing we use it for (and IIRC, the cloud provider claims we have dedicated hardware, but I forget)
yep, hetzner claims it's a dedicated server
echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor to ensure that the cpu frequency changes as little as possible. You may want to set it to
Did disabling ASLR really made a difference? For reproducibility?
Also is there some kind of normalization step of your metrics / numbers? Or, if you change hardware, then all the data becomes irrelevant?
I believe all data was removed when the hardware changed the last time.
Note that that due to numerous microarchitectural features in modern CPUs, wall clock time (and other metrics correlated with it) can vary greatly depending on details of the overall code and data layout (e.g. sizes as well as relative and absolute addresses of individual code sections and heap/stack allocations) that are almost guaranteed to be perturbed as side effect of essentially any code change. So it's questionable if "stable" performance measurements across code changes that should be performance-neutral are even possible to achieve.
yeah, we're mostly aiming for stable instruction counts (or as much as possible)
ah, yes, we also set scaling governor to performance, forgot about that
yeah I don't think anything we've done on the machine itself has had appreciable impact on reproducibility -- there's been some changes to the compiler itself or how we build it, but not beyond that
@Hanna Kruppe yep I understand, but still I need to monitor perf for critical projects, and i wondered how people do it
And ideally i'd like to understand if a change has an impact or not
Well, the answer is they mostly don't address this problem :) Either side-stepping it by looking at more stable but limited measures like instruction count or something higher-level, or trying to remove as much variance as possible (some techniques discussed above) and then trying to account for the remaining "noise" with rules of thumb, often with questionable results IMO. The only rigorous approach I know of it stabilizer
(<https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf>) but as far as I can tell it hasn't achieved much adoption.
Thank you for input and to the paper, i'll read and share that :)
@Hanna Kruppe i gave a quick read to the paper, its interesting. I lack the knowledge in statistiscs though^^ Its sad that https://github.com/ccurtsinger/stabilizer is unmaintained :(