Stream: t-compiler/shrinkmem-sprint

Topic: sprint day 1 chat


view this post on Zulip pnkfelix (Mar 01 2021 at 12:52):

Hi everyone in @T-compiler/meeting ! Its the first day of the sprint. I really wish I had started this topic at midnight last night. :)

view this post on Zulip rylev (Mar 01 2021 at 12:53):

/me totally forgot the sprint was this week and thought it was next week.

view this post on Zulip rylev (Mar 01 2021 at 12:53):

/me furiously tries to clear schedule in a last minute ditch effort.

view this post on Zulip pnkfelix (Mar 01 2021 at 12:53):

One of my hopes for the sprint is to encourage some extra socialization in a semi-casual manner. Its something we haven't really been able to do so well: Its one of the few weaknesses I'll admit of Zulip over Discord. (and of course it also doesn't help that we cannot meet in person right now.)

view this post on Zulip rylev (Mar 01 2021 at 12:55):

@pnkfelix shall we do a standup style progress check every day in the following style:

view this post on Zulip pnkfelix (Mar 01 2021 at 12:56):

Yeah that's exactly the sort of thing I'd like to see. I'm not sure if its best off in this topic or in a different one. I guess having it here is good for now.

view this post on Zulip rylev (Mar 01 2021 at 12:56):

Maybe a dedicated "standup" topic?

view this post on Zulip pnkfelix (Mar 01 2021 at 12:56):

That is another option I was considering.

view this post on Zulip pnkfelix (Mar 01 2021 at 12:57):

or checkins

view this post on Zulip pnkfelix (Mar 01 2021 at 12:59):

in parallel, I've created the #t-compiler > shrinkmem sprint highlights topic. My hope is to post the most important bits of info there, so that people who are just watching from the sidelines in an infrequent manner can still see what's up.

view this post on Zulip rylev (Mar 01 2021 at 13:12):

By the way, in our "investigations" so far, @Wesley Wiser and I have discovered the not so surprising fact that LLVM dominates memory consumption (at least when compiling small crates like regex)

view this post on Zulip pnkfelix (Mar 01 2021 at 13:14):

yep, that's true

view this post on Zulip pnkfelix (Mar 01 2021 at 13:14):

I'd like to know if we can correlate the size of the codegen units we make with LLVM's memory usage

view this post on Zulip bjorn3 (Mar 01 2021 at 13:15):

When using cg_clif, the memory consumption of Cranelift should be tiny given that only a single function is compiled at a time and once that function is done, only the compiled bytes and relocation info is stored until the object file corresponding to the codegen unit is written.

view this post on Zulip pnkfelix (Mar 01 2021 at 13:15):

(I.e. if there's a predictable multiplicative factor there.)

view this post on Zulip rylev (Mar 01 2021 at 13:15):

pnkfelix said:

I'd like to know if we can correlate the size of the codegen units we make with LLVM's memory usage

This would be pretty easy to do (as far as I can tell) on Windows by adding ETW annotations when creating codegen units. But obviously we don't want to come up with a solution that is Windows only.

view this post on Zulip pnkfelix (Mar 01 2021 at 13:16):

@bjorn3 How is cg_clif doing with respect to other aspects of performance?

view this post on Zulip pnkfelix (Mar 01 2021 at 13:16):

@rylev do you or @Wesley Wiser know if there is some equivalent to getrusage you can call from within a Windows process?

view this post on Zulip pnkfelix (Mar 01 2021 at 13:16):

In particular, I'd like to know if we could generalize my PR #82532 to work on Windows as well as Linux.

view this post on Zulip pnkfelix (Mar 01 2021 at 13:17):

(in terms of making the process report the max-rss among its child processes.)

view this post on Zulip bjorn3 (Mar 01 2021 at 13:18):

The new Cranelift backend regressed performance of when value debuginfo is used. (I should probably disable it for now as no DWARF value debuginfo is emitted anyway) But otherwise ~30% perf improvement. That is with only a single thread doing compilation for cg_clif and a normal amount for cg_llvm.

view this post on Zulip pnkfelix (Mar 01 2021 at 13:19):

I assume that's ~30% perf improvement in terms of compile-times. What is codegen quality at that point, compared to say llvm --opt-level=0 ?

view this post on Zulip rylev (Mar 01 2021 at 13:19):

pnkfelix said:

rylev do you or Wesley Wiser know if there is some equivalent to getrusage you can call from within a Windows process?

I don't believe there is an exact equivalent (though I'm not a win32 expert), but we should be able to find the same info. For example GetProcessTimes offers some of the same info albeit without maxrss info

view this post on Zulip bjorn3 (Mar 01 2021 at 13:20):

Codegen quality is roughly on par with cg_llvm + -Copt-level=0.

view this post on Zulip rylev (Mar 01 2021 at 13:20):

I think GetProcessMemoryInfo can help with maxrss

view this post on Zulip rylev (Mar 01 2021 at 13:21):

@Wesley Wiser should we coordinate on seeing if we can get https://github.com/rust-lang/rust/issues/82532 working on Windows?

view this post on Zulip pnkfelix (Mar 01 2021 at 13:22):

Cool; I'd just want to double-check whether GetProcessMemoryInfo handles only oneself, or if it will handle child proceses that have terminated (and your process waited for, analogous to Linux).

view this post on Zulip rylev (Mar 01 2021 at 13:34):

So GetProcessMemoryInfo doesn't enumerate children like getrusage with RUSAGE_CHILDRENdoes. Perhaps we can enumerate the child processes ourselves and get that info, but I've been looking at other people's translation of getrusage for Windows and most don't support RUSAGE_CHILDREN which is not a good sign.

view this post on Zulip oli (Mar 01 2021 at 13:36):

I added some ideas to the brainstorm hackmd about perf and mem tests that can be done on mir optimizations

view this post on Zulip pnkfelix (Mar 01 2021 at 14:19):

rylev said:

So GetProcessMemoryInfo doesn't enumerate children like getrusage with RUSAGE_CHILDRENdoes. Perhaps we can enumerate the child processes ourselves and get that info, but I've been looking at other people's translation of getrusage for Windows and most don't support RUSAGE_CHILDREN which is not a good sign.

Part of the goal is to get the peak memory usage over the entirety of each processes' lifetime, so that's why its part of the interface that you get the info for RUSAGE_CHILDREN only on terminated processes (and its accumulated via the wait). Still, I agree that we might be able to hack something in ourselves; e.g., feed the peak memory value back up ourselves manually as info passed to the parent process. (But then again, maybe we would be just well off focusing on reporting RUSAGE_SELF in some consistent way across the platforms.)

view this post on Zulip Wesley Wiser (Mar 01 2021 at 15:11):

Should we be profiling with incremental compilation enabled or without? I guess more generally, what is the "scenario" we want to optimize for?

view this post on Zulip Joshua Nelson (Mar 01 2021 at 15:13):

personally the scenario I care about most is incr-full, since I have incremental turned on for rust-lang/rust

view this post on Zulip simulacrum (Mar 01 2021 at 15:17):

though it's a bit worrisome that you care about that scenario, as it sort of implies you care about "incremental, but on a clean build" which is odd :)

view this post on Zulip simulacrum (Mar 01 2021 at 15:17):

but we might not have anything better.

view this post on Zulip Joshua Nelson (Mar 01 2021 at 15:24):

I use it a lot when rebasing, which means libstd has been modified and the whole compiler needs to be rebuilt

view this post on Zulip Joshua Nelson (Mar 01 2021 at 15:25):

Also I use incremental a ton for patched builds, but it works well there :) so I don't think we need to focus on improving that

view this post on Zulip Wesley Wiser (Mar 01 2021 at 15:26):

I guess this isn't terribly surprising but a clean incremental compilation has a significantly higher peak memory use on my machine than a clean non-incremental compilation.

Command Peak Memory Usage (mb)
build 3,161
build -i 4,213

view this post on Zulip Joshua Nelson (Mar 01 2021 at 15:28):

Wesley Wiser said:

I guess this isn't terribly surprising but a clean incremental compilation has a significantly higher peak memory use on my machine than a clean non-incremental compilation.

Command Peak Memory Usage (mb)
build 3,161
build -i 4,213

to me that is surprising because it means incremental uses significantly more memory than LLVM

view this post on Zulip Joshua Nelson (Mar 01 2021 at 15:28):

which seems ... concerning

view this post on Zulip Wesley Wiser (Mar 01 2021 at 15:29):

incremental uses significantly more memory than LLVM

Wouldn't that mean peak would be 2x higher?

view this post on Zulip Wesley Wiser (Mar 01 2021 at 15:29):

Incremental uses a lot but I don't think it's more than LLVM.

view this post on Zulip Joshua Nelson (Mar 01 2021 at 15:29):

I don't follow? -i doesn't affect memory usage for LLVM, only for incremental

view this post on Zulip Wesley Wiser (Mar 01 2021 at 15:29):

Right but it's only 25% more memory

view this post on Zulip Wesley Wiser (Mar 01 2021 at 15:29):

If incremental used more than LLVM, wouldn't we see memory usage spike by at least 2x?

view this post on Zulip Joshua Nelson (Mar 01 2021 at 15:30):

maybe my assumptions are wrong - is the incremental graph held in memory at the same time LLVM is running?

view this post on Zulip Wesley Wiser (Mar 01 2021 at 15:30):

I think it has to be

view this post on Zulip Joshua Nelson (Mar 01 2021 at 15:30):

ok right because CGUs are cached

view this post on Zulip simulacrum (Mar 01 2021 at 15:30):

At least part of it, sure

view this post on Zulip Joshua Nelson (Mar 01 2021 at 15:30):

ignore me then

view this post on Zulip Wesley Wiser (Mar 01 2021 at 15:31):

Well at least until all the CGUs are ready to be optimized

view this post on Zulip Tyson Nottingham (Mar 01 2021 at 19:14):

Joshua Nelson said:

I don't follow? -i doesn't affect memory usage for LLVM, only for incremental

You might have meant something else, but it does affect it in that it changes the number of codegen units. That means smaller units in memory simultaneously during codegen and pre-LTO optimization, but more serialized units in memory during thin local LTO.

view this post on Zulip Joshua Nelson (Mar 01 2021 at 19:18):

oh good point :laughing: I was wrong in many ways then

view this post on Zulip Tyson Nottingham (Mar 01 2021 at 19:30):

pnkfelix said:

I'd like to know if we can correlate the size of the codegen units we make with LLVM's memory usage

By the way, I have a PR out for adding time-passes events for each CGU codegen -- #81538. The output includes CGU size estimates (number of MIR statements IIRC):

        time:   2.682; rss: 1926MB -> 2060MB ( +134MB)  codegen_module(3k1mmmip9jmhdnm3, 740742)
        time:   3.829; rss: 2061MB -> 2176MB ( +115MB)  codegen_module(35g2r0huhfuoe5v3, 715671)
        time:   4.005; rss: 2279MB -> 2335MB (  +57MB)  codegen_module(3shjsbojvk8p48io, 513640)

view this post on Zulip Tyson Nottingham (Mar 01 2021 at 19:34):

(Though you have to be careful of the RSS readings during codegen, because things are happening in parallel unless you control for it with -j1 or -Z no-parallel-llvm.)

view this post on Zulip tm (Mar 01 2021 at 20:18):

There is #65281 (unmerged) which included inlined items in CGU size estimates before merging them. This reduced the size dispersion of CGUs quite a bit (using no-opt.bc file sizes as true value) when I was testing it.

view this post on Zulip pnkfelix (Mar 01 2021 at 20:22):

there Is discussion at the tail end of #65281 that maybe we should retry landing it. That may be a relatively simple task for someone to look into.

view this post on Zulip pnkfelix (Mar 01 2021 at 20:22):

As an aside, It is sort of a sad observation about how much the current perf results “build on luck”, as adnjo403 put it...

view this post on Zulip pnkfelix (Mar 01 2021 at 20:23):

Or maybe there are strictly smaller ways we could improve the estimation outlined in #69382 ...

view this post on Zulip cjgillot (Mar 01 2021 at 20:24):

Should we try to define a smarter partition based on the full MIR control-flow graph?

view this post on Zulip cjgillot (Mar 01 2021 at 20:26):

The query system can be extended to recover on cycles. With that modification, the full crate CFG can be computed efficiently using queries. Is this a relevant information to build codegen units?

view this post on Zulip pnkfelix (Mar 01 2021 at 20:26):

I was thinking more something smaller. E.g. trying to estimate the cost of drop terminators. The last comment on #69382 says that a drop shim receives cost estimate of zero today...

view this post on Zulip tm (Mar 01 2021 at 20:29):

I looked into improving the estimate itself, but without inclusion of inlined items into estimate, in terms of CGU size balance, the results were rather inconclusive, so I would suggest doing something like #65281 first. Though, size estimates are also used for scheduling (and at that point they do include inlined items) so maybe it would be worth it from that perspective.


Last updated: Oct 21 2021 at 20:33 UTC