Stream: t-compiler/shrinkmem-sprint

Topic: brainstorming hackmd


view this post on Zulip pnkfelix (Feb 12 2021 at 16:03):

Created hackmd for people to collaboratively post ideas/notes on, as discussed in #t-compiler/meetings > [planning meeting] 2021-02-12

view this post on Zulip Joshua Nelson (Feb 12 2021 at 16:11):

I can start by adding things from the meeting :)

view this post on Zulip Joshua Nelson (Feb 12 2021 at 16:19):

I also want to mention "send less IR to LLVM" should also help with memory usage, and I've had a vague idea for a while how to do it: https://github.com/rust-lang/rust/issues/77960

view this post on Zulip cjgillot (Feb 12 2021 at 19:47):

I added a few bullet points to the HackMD.
@Tyson Nottingham: do you have ideas from your investigations of the DepGraph and query system?

view this post on Zulip pnkfelix (Feb 12 2021 at 20:12):

wow great start everyone!

view this post on Zulip Tyson Nottingham (Feb 12 2021 at 20:15):

cjgillot said:

Tyson Nottingham: do you have ideas from your investigations of the DepGraph and query system?

Sure. I'll add some notes to the HackMD when I get a chance. I've been looking into the memory usage for a couple months, so I've got a few ideas.

view this post on Zulip Tyson Nottingham (Feb 12 2021 at 20:16):

Mostly outside the DepGraph though tbh

view this post on Zulip Tyson Nottingham (Feb 12 2021 at 21:09):

I'm somewhat actively working on codegen scheduling. There's one low-hanging fruit there that I'm profiling a solution for now (I'll link the PR when it's ready). But there are more gains to be had there, for sure.

I was planning to take on the general scheduling issue, but perhaps someone else should pick it up if there's demand to get it done more quickly. I'm having to scale back working on rustc, as it's become hard to justify it while I'm out of work. I'll create a GH issue / write-up for improving the scheduling more generally and link it here. That should be useful either way.

view this post on Zulip The 8472 (Feb 14 2021 at 14:28):

Find a way to swap in a custom allocator? (#[global_allocator] won’t work)

At the risk of asking the obvious, does LD_PRELOAD=/path/to/custommalloc.so not work?

view this post on Zulip Tyson Nottingham (Feb 15 2021 at 02:51):

Here's the PR for the low-hanging codegen scheduling memory reduction I mentioned -- #82127.

view this post on Zulip pnkfelix (Feb 15 2021 at 15:54):

The 8472 said:

Find a way to swap in a custom allocator? (#[global_allocator] won’t work)

At the risk of asking the obvious, does LD_PRELOAD=/path/to/custommalloc.so not work?

Its worth investigating. (The #[global_allocator] wont work comment was in reference to that not intercepting memory requests from LLVM, if I recall correctly. LD_PRELOAD might handle that context fine; I need to look into it.)

view this post on Zulip pnkfelix (Feb 15 2021 at 16:19):

Tyson Nottingham said:

Here's the PR for the low-hanging codegen scheduling memory reduction I mentioned -- #82127.

Silly Question: What tools did you use to make those graphs that you posted in the PR? I would like for us all to coalesce around a standard workflow, and its possible your workflow is a good candidate.

view this post on Zulip pnkfelix (Feb 15 2021 at 16:22):

(Oh, I see you wrote "Stats were gathered by polling system memory usage once per second via the free command on Linux. The machine was otherwise left alone during benchmarking. I'm sure there's a better way, but this did a passable job." I'd agree: There's probably better, but this is passable, especially since I think everyone will have free installed. Did you just hack together a perl script or something to turn the free outputs into data to feed into gnuplot?)

view this post on Zulip The 8472 (Feb 15 2021 at 18:03):

There also is src/ci/cpu-usage-over-time.py, but that doesn't seem to run on PR builds

view this post on Zulip lqd (Feb 15 2021 at 18:57):

there are also fun rust system probes like https://github.com/kali/readings which does its graphs using the plotters crate IIRC

view this post on Zulip Tyson Nottingham (Feb 15 2021 at 19:40):

pnkfelix said:

Did you just hack together a perl script or something to turn the free outputs into data to feed into gnuplot?)

Yeah, just used bash and imported into LibreOffice Calc.

time=0; while [ -f ../rust/timer.txt ]; do mem=$(free -m | awk '/Mem/ {print $3}'); echo "$time,$mem" >> memory_usage.csv; sleep 1; time=$((time + 1)); done

view this post on Zulip Tyson Nottingham (Feb 16 2021 at 01:08):

Added a bunch of ideas related to CGU stuff.

view this post on Zulip Joshua Nelson (Mar 01 2021 at 20:07):

Might not play well with rustup but should work with rustc?

rustup is a red herring, it's jemalloc that breaks

view this post on Zulip pnkfelix (Mar 01 2021 at 20:10):

Anyone want to review the hackmd with me now and try to identify the ideas that are worthy of publicizing in a dedicated zulip topic in this stream?

view this post on Zulip cjgillot (Mar 01 2021 at 20:11):

Yep.

view this post on Zulip pnkfelix (Mar 01 2021 at 20:14):

lets see.

tools for measuring memory usage

view this post on Zulip pnkfelix (Mar 01 2021 at 20:15):

the section on -Z time-passes leads me to wonder: Is there anyway to make the output reflect the nesting of passes?

view this post on Zulip pnkfelix (Mar 01 2021 at 20:15):

not sure. Probably not quickly

view this post on Zulip pnkfelix (Mar 01 2021 at 20:17):

@Joshua Nelson do you have a recommended strategy for getting around the heaptrack breakage when linking jemalloc statically? E.g., do you turn off jemalloc support in rustc for such builds?

view this post on Zulip Joshua Nelson (Mar 01 2021 at 20:17):

@pnkfelix I've never gotten it working. I mainly profile rustdoc so it hasn't bothered me too much

view this post on Zulip Joshua Nelson (Mar 01 2021 at 20:17):

I would expect turning off jemalloc to fix it though

view this post on Zulip pnkfelix (Mar 01 2021 at 20:18):

I’ll admit that I’m still confused about our default status with respect to jemalloc. The config.toml.example file leads me to think that we have it off by default, in which case I would not worry about this.

view this post on Zulip pnkfelix (Mar 01 2021 at 20:19):

/me goes to see if the jemalloc symbols are in their rustc binary

view this post on Zulip Tyson Nottingham (Mar 01 2021 at 20:24):

It's off by default for local x.py builds, but I'm pretty sure it's enabled for rustc binaries distributed by rustup.

view this post on Zulip Joshua Nelson (Mar 01 2021 at 20:35):

looks like it's set on Mac and Linux at least:

$ rg jemalloc src/ci
src/ci/github-actions/ci.yml
435:              RUST_CONFIGURE_ARGS: --host=x86_64-apple-darwin --target=x86_64-apple-darwin,aarch64-apple-ios,x86_64-apple-ios --enable-full-tools --enable-sanitizers --enable-profiler --set rust.jemalloc --set llvm.ninja=false
446:              RUST_CONFIGURE_ARGS: --enable-extended --enable-profiler --set rust.jemalloc --set llvm.ninja=false
456:              RUST_CONFIGURE_ARGS: --build=x86_64-apple-darwin --enable-sanitizers --enable-profiler --set rust.jemalloc --set llvm.ninja=false

src/ci/docker/host-x86_64/dist-i686-linux/Dockerfile
97:      --set rust.jemalloc

src/ci/docker/host-x86_64/dist-x86_64-linux/Dockerfile
102:      --set rust.jemalloc

view this post on Zulip pnkfelix (Mar 01 2021 at 20:37):

Right, okay; so this is one of those instances where the CI differs from the default developer experience ...

view this post on Zulip pnkfelix (Mar 01 2021 at 21:50):

(one of these days I'll put a note about this in the config.toml.example)

view this post on Zulip Tyson Nottingham (Mar 02 2021 at 03:11):

I braindumped a bit about codegen scheduling in #82685. I've moved on to some other areas of interest, so I'm kind of throwing that over the wall. :)

view this post on Zulip Joshua Nelson (Mar 02 2021 at 03:12):

I pinged a friend who works on scheduling for his PhD thesis, I'll let you know if he's interested :)

view this post on Zulip pnkfelix (Mar 04 2021 at 00:25):

Is there any reason why we shouldn't put a default (but user-overridable) upper-bound on the CGU size?

view this post on Zulip Wesley Wiser (Mar 04 2021 at 00:26):

Nothing immediately comes to mind.

view this post on Zulip pnkfelix (Mar 04 2021 at 00:26):

I was just musing to myself that the processing of combining them unconditionally until one hits the desired number w.r.t. processor cores is obviously problematic for memory usage

view this post on Zulip pnkfelix (Mar 04 2021 at 00:26):

a point that @Tyson Nottingham pointed out in their write-up on #82685.

view this post on Zulip pnkfelix (Mar 04 2021 at 00:26):

but i figured a bound like I outlined should be trivial to implement, and if its large enough, it need not have noticeable impact on most crates...

view this post on Zulip Wesley Wiser (Mar 04 2021 at 00:27):

I was investigating the cgu merging algorithm today and after testing out a few different things, I found no difference in the biggest cgus in rustc_mir because the largest cgus don't appear to be the result of merging smaller ones together. They just start out huge.

view this post on Zulip Wesley Wiser (Mar 04 2021 at 00:32):

So breaking up large cgus before we start the merging process seems worth while to me.

view this post on Zulip simulacrum (Mar 04 2021 at 00:53):

I'd be interested in some data on why they're big - are we including a bunch of generic code? Are they still large with -Zshare-generics (not sure if that precisely does what I want)?

Mostly I'm thinking that if the problem is, loosely, modules that are "too big" that's not great, and I'd like to tackle it in the compiler rather than via education, but it feels like a very different problem to us adding a bunch of code which potentially all gets inlined by LLVM, where splitting may result in worse performance.

My recollection is that our algorithm today for creating modules isn't based on the "usage" graph of function calls etc, but rather is module-based, and then we add a bunch atop that for the used generic code. Maybe that's wrong though?

view this post on Zulip tm (Mar 04 2021 at 09:03):

I think that is generally accurate. Modules are primary driving factor behind partitioning, so the generated CGUs tend to have say all HashMaps together, all Vecs together, all Zip iterators together. The call graph is significant for inline functions.


Last updated: Oct 21 2021 at 21:32 UTC