Stream: t-compiler/shrinkmem-sprint

Topic: Memory profiling tooling


view this post on Zulip rylev (Feb 15 2021 at 12:45):

@pnkfelix Are you hoping to find a way to profile the rustc compilation process explicitly or to collect system wide profiling analytics which allow you to dive into the specifics of the rustc process? On Windows, I'm hoping to be able to gather analytics directly related to rustc, but I'll mostly be going through system wide analytics (through Windows Performance Recorder [WPR] and Windows Performance Analyzer [WPA]).

view this post on Zulip nagisa (Feb 15 2021 at 12:55):

there might be interesting correlations system wide too. like for instance, paging file behaviour is closely but sometimes not obviously related to the memory use.

view this post on Zulip pnkfelix (Feb 15 2021 at 14:48):

I'm planning to focus on DHAT for now, which IIUC will only focus on a single process (or process tree rooted at that process? Not sure.) System-wide stuff might be interesting, and I want to spend time better understanding my options there, but I don't think I can guarantee have something up and running on that front this week.

view this post on Zulip pnkfelix (Feb 15 2021 at 14:50):

(I am skimming over parts of Brendan Gregg's book on "Systems Performance", which has been on my bookshelf for probably 4 or 5 years now, to get a better feel of what tools are available here.)

view this post on Zulip pnkfelix (Feb 18 2021 at 16:03):

so one obvious issue with us using only DHAT is that its not really appropriate for gather data on arbitrary runs

view this post on Zulip pnkfelix (Feb 18 2021 at 16:03):

due to the overhead it imposes

view this post on Zulip pnkfelix (Feb 18 2021 at 16:05):

it will probably be more appropriate to settle on distinct tools: a minimal overhead tool that we can apply aggressively (i.e. to every rustc run during x.py build), and a more invasive one like DHAT for when we identify a specific rustc invocation that we want more information out of

view this post on Zulip pnkfelix (Feb 18 2021 at 20:43):

pnkfelix said:

so one obvious issue with us using only DHAT is that its not really appropriate for gather data on arbitrary runs

oh, actually, this might have been … false?

view this post on Zulip pnkfelix (Feb 18 2021 at 20:43):

(I misinterpreted data from a run that I was doing overnight that got interrupted by my putting the computer to sleep without checking if it was still running the build!)

view this post on Zulip pnkfelix (Feb 18 2021 at 20:45):

At least, I just did a bootstrap where it took 6m30s to bootstrap without dhat, and a subsequent clean+build with RUSTC_WRAPPER=dhat took 7m01s.

view this post on Zulip Tyson Nottingham (Feb 18 2021 at 20:55):

The rustc-perf collector readme says DHAT and Massif both incur a 5-20x slowdown, and that seems right from what I've seen. I've tried running DHAT on rustc_middle compilation before, and gave up when it it was taking too long. I can't remember how long that was though. I've run Massif successfully on rustc_middle, FWIW.

view this post on Zulip pnkfelix (Feb 18 2021 at 21:05):

I must be doing something wrong with DHAT

view this post on Zulip pnkfelix (Feb 18 2021 at 21:06):

In hindsight, the data it is spitting out seems incomplete. I wasn't using --trace-children=true, but I don't think I should have to do that just to see the memory profile of the single rustc process..

view this post on Zulip simulacrum (Feb 18 2021 at 21:08):

are you running rustc via the rustup shim?

view this post on Zulip pnkfelix (Feb 18 2021 at 21:08):

oh ... oh dear

view this post on Zulip pnkfelix (Feb 18 2021 at 21:09):

that must be what's happening

view this post on Zulip pnkfelix (Feb 18 2021 at 21:10):

but ... I thought the whole point is that x.py won't do that?

view this post on Zulip simulacrum (Feb 18 2021 at 21:10):

wait, how are you running it?

view this post on Zulip pnkfelix (Feb 18 2021 at 21:10):

/me turns on --verbose mode

view this post on Zulip simulacrum (Feb 18 2021 at 21:10):

x.py won't go through the rustup shim

view this post on Zulip simulacrum (Feb 18 2021 at 21:10):

but it has its own shim

view this post on Zulip pnkfelix (Feb 18 2021 at 21:11):

ah

view this post on Zulip simulacrum (Feb 18 2021 at 21:11):

if you want dhat or similar, I'd strongly recommend directly running build/x86_.../stage1/bin/rustc

view this post on Zulip pnkfelix (Feb 18 2021 at 21:11):

yeah okay

view this post on Zulip pnkfelix (Feb 18 2021 at 21:11):

I want to start with developing some recipe for gathering data per-crate

view this post on Zulip pnkfelix (Feb 18 2021 at 21:11):

with low-overhead

view this post on Zulip simulacrum (Feb 18 2021 at 21:12):

within x.py?

view this post on Zulip pnkfelix (Feb 18 2021 at 21:13):

yes

view this post on Zulip pnkfelix (Feb 18 2021 at 21:13):

or

view this post on Zulip pnkfelix (Feb 18 2021 at 21:13):

at least by setting RUSTC_WRAPPER

view this post on Zulip pnkfelix (Feb 18 2021 at 21:13):

i.e. try to keep the current build workflow as unchanged as I can

view this post on Zulip simulacrum (Feb 18 2021 at 21:14):

hm

view this post on Zulip pnkfelix (Feb 18 2021 at 21:14):

/usr/bin/time might be fine

view this post on Zulip simulacrum (Feb 18 2021 at 21:15):

we already have a mode to print time (via std::time::instant) in rustbuild

view this post on Zulip simulacrum (Feb 18 2021 at 21:15):

er

view this post on Zulip pnkfelix (Feb 18 2021 at 21:15):

does it print RSS too?

view this post on Zulip pnkfelix (Feb 18 2021 at 21:15):

/usr/bin/time -v will show memory info

view this post on Zulip simulacrum (Feb 18 2021 at 21:15):

not currently, but we could add it

view this post on Zulip simulacrum (Feb 18 2021 at 21:15):

/usr/bin/time is platform specific, I guess anything in rustbuild would be too

view this post on Zulip simulacrum (Feb 18 2021 at 21:16):

not sure what's better

view this post on Zulip pnkfelix (Feb 18 2021 at 21:16):

tell people to install the GNU time?

view this post on Zulip simulacrum (Feb 18 2021 at 21:16):

probably easier to give instructions if it's integrated and we support whatever is needed for windows, macos, and linux to get rss data

view this post on Zulip pnkfelix (Feb 18 2021 at 21:16):

but yeah, I think you're right: Add it to rustbuild

view this post on Zulip simulacrum (Feb 18 2021 at 21:17):

I would add as a separate flag from the current time one (but potentially also something we enable on perf.rlo etc); the time output is parsed in theory by some scripts, and certainly by perf.rlo, so I wouldn't want to change it ad-hoc too much

view this post on Zulip pnkfelix (Feb 18 2021 at 21:18):

right, I won't perturb current output format.

view this post on Zulip pnkfelix (Feb 18 2021 at 21:41):

@simulacrum is RUSTC_PRINT_STEP_TIMINGS the thing you're referencing above?

view this post on Zulip simulacrum (Feb 18 2021 at 21:41):

yes

view this post on Zulip simulacrum (Feb 18 2021 at 21:41):

I think there's a config.toml option for it

view this post on Zulip simulacrum (Feb 18 2021 at 21:41):

we just pass state to the shim scripts via environment

view this post on Zulip pnkfelix (Feb 18 2021 at 21:42):

oh, you're right, there's a print_step_timings key mentioned in bootstrap/config.rs

view this post on Zulip pnkfelix (Feb 18 2021 at 22:01):

hmm, I had forgotten that -Ztime-passes already spits out rss info

view this post on Zulip pnkfelix (Feb 18 2021 at 22:01):

(at a finer-grain, but still)

view this post on Zulip simulacrum (Feb 18 2021 at 22:04):

I think only on linux?

view this post on Zulip pnkfelix (Feb 18 2021 at 22:04):

that might explain why I've never seen it (since I've tended to use -Ztime-passes on my mac alone), I think

view this post on Zulip pnkfelix (Feb 18 2021 at 22:09):

rustc_data_structures::profiling::get_resident_set_size looks like its trying to do something on Windows

view this post on Zulip pnkfelix (Feb 18 2021 at 22:10):

ah, I see: that function won't work for "Mac OS X", because its leveraging the /proc file system

view this post on Zulip pnkfelix (Feb 18 2021 at 22:11):

but I could replace it with a getrusage call

view this post on Zulip pnkfelix (Feb 18 2021 at 22:11):

there must already be a crate for this

view this post on Zulip simulacrum (Feb 18 2021 at 22:11):

rustc-perf handrolls it

view this post on Zulip simulacrum (Feb 18 2021 at 22:11):

with getrusage

view this post on Zulip pnkfelix (Feb 18 2021 at 22:12):

oh, I'm having flashbacks to my PhD days

view this post on Zulip pnkfelix (Feb 18 2021 at 22:12):

"probably misreports RSS on macOS"

view this post on Zulip pnkfelix (Feb 18 2021 at 22:12):

(from description for https://crates.io/crates/getr )

view this post on Zulip pnkfelix (Feb 18 2021 at 22:13):

oh dear

view this post on Zulip pnkfelix (Feb 18 2021 at 22:13):

they didn't even try to make the struct OS speciifc

view this post on Zulip simulacrum (Feb 18 2021 at 22:14):

https://docs.rs/sysinfo/0.16.3/sysinfo/trait.ProcessExt.html#tymethod.memory is very heavy as a dependency I imagine, but I guess is cross platform :)

view this post on Zulip pnkfelix (Feb 18 2021 at 22:15):

I have to go, but I'll look more at this later.

view this post on Zulip The 8472 (Feb 19 2021 at 11:34):

To record large allocations one could also lower the size at which the malloc implementation does a direct mmap allocation and then perf record mmap syscalls.

view this post on Zulip tm (Feb 19 2021 at 12:11):

rustc compiles jemalloc with stats enabled, so that would be another potential source of information (there is also memory profiling support, but disabled by default).

view this post on Zulip pnkfelix (Feb 19 2021 at 15:53):

pnkfelix said:

they didn't even try to make the struct OS speciifc

Is this struct actually defined with a common type (or at least prefix) on Linux and Mac? That does not match my recollection from over a decade ago, but maybe I'm thinking of a different struct

view this post on Zulip pnkfelix (Feb 19 2021 at 15:55):

seems like its at least worth taking a shot here

view this post on Zulip pnkfelix (Feb 19 2021 at 20:59):

how it started: “worth taking a shot”. how its going: "[RUSTC-TIMING] core test:false 32.757 max rss (kb): 121542246"

view this post on Zulip pnkfelix (Feb 19 2021 at 21:00):

To be clear, maxrss report seems okay on Linux. (Above is from my Mac, after dividing by 1024 to correct for the bytes/kb discrepancy between Linux+Mac. So, something is still wrong.)

view this post on Zulip simulacrum (Feb 19 2021 at 22:15):

@pnkfelix imo we should report in bytes and do that multiplication in the code

view this post on Zulip simulacrum (Feb 19 2021 at 22:15):

it's been a long-standing annoyance that perf.rlo doesn't

view this post on Zulip pnkfelix (Feb 20 2021 at 00:16):

I don’t have a preference about whether the metric is reported in kb or in bytes. Just needs to be consistent (or, if its inconsistent, then the unit needs to be attached to the value as a suffix)

view this post on Zulip pnkfelix (Mar 01 2021 at 13:13):

(for people following along: my PR #82532 "Add build.print_step_rusage to config.toml" landed an hour ago. I suspect it only works reliably on Linux, as noted above.)

view this post on Zulip pnkfelix (Mar 04 2021 at 03:58):

pnkfelix said:

so one obvious issue with us using only DHAT is that its not really appropriate for gather data on arbitrary runs

some concrete data here: Running dhat on the compile of rustc_query_impl took a little less than ... 6 hours on my machine.

view this post on Zulip pnkfelix (Mar 04 2021 at 03:59):

(it normally takes about 80 seconds.)

view this post on Zulip pnkfelix (Mar 04 2021 at 04:00):

so if my math is right, that's a 270x overhead. I wonder if something's wrong, or if the overhead is not O(1).

view this post on Zulip pnkfelix (Mar 04 2021 at 14:30):

Decided to try a similar experiment of running massif on the compile of rustc_query_impl. The good news is that this "finished" in a little over 2 hours. The bad news is I put "finished" in quotes because it didn't complete successfully: Valgrind signaled an Internal Error:

view this post on Zulip pnkfelix (Mar 04 2021 at 14:31):

--776900-- VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) - exiting
--776900-- si_code=2;  Faulting address: 0x104E5DCFB8;  sp: 0x104bcced30

valgrind: the 'impossible' happened:
   Killed by fatal signal

view this post on Zulip Joshua Nelson (Mar 04 2021 at 14:33):

pnkfelix said:

so if my math is right, that's a 270x overhead. I wonder if something's wrong, or if the overhead is not O(1).

DHAT has definitely taken outrageous amounts of time to run on rustdoc before

view this post on Zulip Joshua Nelson (Mar 04 2021 at 14:34):

https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/DHAT.20and.20rustdoc/near/219669766


Last updated: Oct 21 2021 at 21:32 UTC