Stream: t-compiler/shrinkmem-sprint

Topic: cgu sizes


view this post on Zulip Wesley Wiser (Mar 04 2021 at 16:48):

So I spent yesterday looking at cgu sizes for rustc_mir on the theory that if we can get a more even partitioning, we can probably improve peak memory usage and hopefully that would be true of other large crates as well. What's interesting to me was that I found we have a huge disparity between the largest CGUs and the smallest CGUs after merging.

To give an idea:

What's even more interesting is that the largest CGU is not the result of the merging algorithm adding some large CGUs together, it's actually that way before the merging step even runs! So the question in mind was then "What's in that cgu?" and the answer appears to be hashbrown monomorphizations for the various HashMaps we use in rustc_mir. (gist with details)

view this post on Zulip simulacrum (Mar 04 2021 at 16:53):

@Wesley Wiser am I right to assume that in theory all of those functions are independent - since they are all generic with different arguments - and so there's pretty much no chance they'd get inlined into each other? the only win might be llvm merging them (i.e. seeing they're identical) which seems not high priority to me?

view this post on Zulip Wesley Wiser (Mar 04 2021 at 16:57):

As for what to do with this, I see two potential avenues worth exploring:

  1. Implement CGU splitting. When we have a "large" CGU (for some definition of large) before merging, we should break it up into smaller CGUs so the merge step can do a better job of generating more evenly sized CGUs. One complication here is that after splitting a CGU, we need to make sure it still has all of the necessary inline functions inside it (basically we need to re-run the inlining CGU partitioning step).

  2. Change the initial partitioning so we don't put all this code in the same CGU to begin with. Perhaps for things like Vec<T> instead of putting every Vec<_> instantiation together, we can split them across multiple CGUs in some way? For example we could do something similar to what we do for local items and group all the Vec<T> instances where T is in the same module together.

view this post on Zulip Wesley Wiser (Mar 04 2021 at 16:59):

@simulacrum No, I'm pretty sure a lot of these are calling each other and going to be inlined. Just as an example, Vec<Foo>::get(usize) calls Vec<Foo>::len() and we want that to be inlined.

view this post on Zulip simulacrum (Mar 04 2021 at 16:59):

right, but Vec<bar> and vec<foo> are pretty much independent, since we don't have polymorphization

view this post on Zulip Wesley Wiser (Mar 04 2021 at 16:59):

Yes, that's true.

view this post on Zulip simulacrum (Mar 04 2021 at 17:00):

i.e., point 2 in what you said

view this post on Zulip Wesley Wiser (Mar 04 2021 at 17:00):

Which is why I'm thinking option 2 might be the first thing to try here.

view this post on Zulip simulacrum (Mar 04 2021 at 17:00):

it's definitely my instinct

view this post on Zulip simulacrum (Mar 04 2021 at 17:00):

I'm a bit surprised we group them

view this post on Zulip simulacrum (Mar 04 2021 at 17:01):

seems odd to group by the self type w/o generics? or maybe we're grouping by the module of definition

view this post on Zulip simulacrum (Mar 04 2021 at 17:01):

I think it'll be annoying to get inlining still nice, but it might not be too bad

view this post on Zulip Wesley Wiser (Mar 04 2021 at 17:03):

I'm pretty sure it's a result of using characteristic_def_id_of_mono_item to decide what CGU to initially put the item in.

view this post on Zulip Tyson Nottingham (Mar 04 2021 at 21:15):

Based on the big comment at the top ofrustc_mir/src/monomorphize/partitioning/mod.rs, it looks like reducing CGU invalidation in incremental was a primary motivator for the current partitioning scheme.

IIUC, grouping instantiations of a generic together has the advantage that only one CGU will be invalidated if the generic definition changes.

So, when generic instantiations all being in one CGU isn't actually a problem, then splitting them into different CGUs might be all downside.

view this post on Zulip Tyson Nottingham (Mar 04 2021 at 21:16):

Though I don't have a full understanding of all that goes into partitioning, so I could be off

view this post on Zulip Wesley Wiser (Mar 04 2021 at 21:22):

Tyson Nottingham said:

IIUC, grouping instantiations of a generic together has the advantage that only one CGU will be invalidated if the generic definition changes.

So, when generic instantiations all being in one CGU isn't actually a problem, then splitting them into different CGUs might be all downside.

Yeah, that's a great point! I think the tweak I want to pursue is doing something differently when the type is not local to the current crate.

view this post on Zulip simulacrum (Mar 04 2021 at 21:37):

If the smaller cgus are faster to process and less long-tail, then invalidating more cgus seems fine to me. Presuming that the new scheme doesn't result in too much overhead (functions we "add" to more than one cgu for inlining), it may still be worthwhile. And in non-incr is just a win.

view this post on Zulip Tyson Nottingham (Mar 04 2021 at 21:40):

When you modify a dependency crate, are all of the work products for downstream crates thrown out (or invalidated)?

view this post on Zulip Wesley Wiser (Mar 04 2021 at 21:45):

I'm ... not entirely sure.

view this post on Zulip Tyson Nottingham (Mar 04 2021 at 21:48):

If they are, then it seems like there might be little downside to separating the generic instantiations into separate CGUs for generics from upstream crates. Since any modification to the upstream crate would invalidate all the downstream CGUs anyway... if that's how it works. :)

view this post on Zulip Tyson Nottingham (Mar 04 2021 at 21:49):

simulacrum said:

If the smaller cgus are faster to process and less long-tail, then invalidating more cgus seems fine to me. Presuming that the new scheme doesn't result in too much overhead (functions we "add" to more than one cgu for inlining), it may still be worthwhile. And in non-incr is just a win.

True, it might not be a problem in practice.

view this post on Zulip simulacrum (Mar 04 2021 at 21:53):

Tyson Nottingham said:

When you modify a dependency crate, are all of the work products for downstream crates thrown out (or invalidated)?

I don't know, but I'd guess the answer is no. It may depend on the modification, though.

view this post on Zulip Tyson Nottingham (Mar 04 2021 at 22:00):

I can also imagine this creating a bunch of smaller CGUs that end up having to get merged into larger CGUs, which might be a bad thing. A modification to the generic would invalidate every CGU that the instantiations got merged into, which might be a lot if there were many instantiations. But this is all speculation...

view this post on Zulip Tyson Nottingham (Mar 04 2021 at 22:03):

Anyway, not trying to naysay. Just thinking we could factor some of these considerations into the design. :)

view this post on Zulip simulacrum (Mar 04 2021 at 22:10):

Yeah, I'd love to get a collection of traces or something like that so we could write quick simulators to evaluate the cache invalidation and expected runtime of various strategies. Not sure how to do that though.

view this post on Zulip Tyson Nottingham (Mar 05 2021 at 02:35):

Just to corroborate Wesley's findings, rustc_middle's and rustc_query_impl's largest CGUs are also just instantiations from hashbrown::raw.

Estimated sizes and contents of some significant CGUs in rustc_query_impl (built with debug assertions) at each stage of partitioning:

Initial partitioning:

584685: hashbrown::raw generics
412743: rustc_query_system::query generics
324035: Vec generics
214190: RawVec generics
61992:  rustc_query_system::query::caches generics

Post merging:

Same.

Post inlining:

744205: hashbrown::raw generics + inlines
684480: rustc_query_system::query generics + inlines
529637: Vec generics + inlines
314849: RawVec generics + inlines
513496: rustc_query_system::query::caches + inlines

Interesting how much rustc_query_system::query::caches ballooned from inlining. While it didn't matter here, it shows how it could be important to consider the effects of inlining before merging. If a couple CGUs like that had been merged, it might have created a monster CGU after inlining.

FWIW, a large number of the inlined items in that CGU were related to iterators.

view this post on Zulip simulacrum (Mar 05 2021 at 02:57):

One other question I've had is the extent to which these sizes are "real", i.e. do we have an estimate for how accurate our size counts are? I think they're based on MIR statements, it'd be interesting to compare to pre and post opt llvm IR for that cgu

view this post on Zulip Tyson Nottingham (Mar 05 2021 at 03:00):

Relevant -- #69382

view this post on Zulip Tyson Nottingham (Mar 05 2021 at 08:03):

Seeing if I can reduce the explosion of that rustc_query_system::query::caches CGU. Looks like it might be overly generic.

view this post on Zulip Wesley Wiser (Mar 05 2021 at 21:23):

Well, I tried the "change the initial partitioning" idea and it didn't make much of a difference. Most of the hashmap instantiations in rustc_mir are for types defined in rustc_middle.

view this post on Zulip Wesley Wiser (Mar 05 2021 at 21:24):

However, I do have a patch set that implements the cgu splitting idea. It does a much better job of creating evenly sized cgus and I think we're seeing lower peak memory usage during bootstrap.

view this post on Zulip Wesley Wiser (Mar 05 2021 at 21:24):

I need to profile more.

view this post on Zulip tm (Mar 05 2021 at 21:26):

What was the new idea behind the initial partitioning?

view this post on Zulip Wesley Wiser (Mar 05 2021 at 21:27):

Wesley Wiser said:

As for what to do with this, I see two potential avenues worth exploring:

  1. Implement CGU splitting. When we have a "large" CGU (for some definition of large) before merging, we should break it up into smaller CGUs so the merge step can do a better job of generating more evenly sized CGUs. One complication here is that after splitting a CGU, we need to make sure it still has all of the necessary inline functions inside it (basically we need to re-run the inlining CGU partitioning step).

  2. Change the initial partitioning so we don't put all this code in the same CGU to begin with. Perhaps for things like Vec<T> instead of putting every Vec<_> instantiation together, we can split them across multiple CGUs in some way? For example we could do something similar to what we do for local items and group all the Vec<T> instances where T is in the same module together.

view this post on Zulip Wesley Wiser (Mar 05 2021 at 21:27):

  1. in that list

view this post on Zulip Wesley Wiser (Mar 05 2021 at 21:27):

and then

view this post on Zulip Wesley Wiser (Mar 05 2021 at 21:27):

Wesley Wiser said:

Tyson Nottingham said:

IIUC, grouping instantiations of a generic together has the advantage that only one CGU will be invalidated if the generic definition changes.

So, when generic instantiations all being in one CGU isn't actually a problem, then splitting them into different CGUs might be all downside.

Yeah, that's a great point! I think the tweak I want to pursue is doing something differently when the type is not local to the current crate.

view this post on Zulip cjgillot (Mar 05 2021 at 21:29):

Tyson Nottingham said:

Seeing if I can reduce the explosion of that rustc_query_system::query::caches CGU. Looks like it might be overly generic.

Indeed, QueryCache::lookup is instantiated 3 times for each query. That's... a lot.

view this post on Zulip cjgillot (Mar 05 2021 at 21:34):

Wesley Wiser said:

Wesley Wiser said:

Tyson Nottingham said:

IIUC, grouping instantiations of a generic together has the advantage that only one CGU will be invalidated if the generic definition changes.

So, when generic instantiations all being in one CGU isn't actually a problem, then splitting them into different CGUs might be all downside.

Yeah, that's a great point! I think the tweak I want to pursue is doing something differently when the type is not local to the current crate.

Can we consider that a function from non-local types is either: inlined at MIR level, or optimized through LTO? This would avoid having to include it in another function's CGU.

view this post on Zulip cjgillot (Mar 05 2021 at 21:46):

Actually, all instantiation of a same function with non-local types should pretty much be independent. Could we be dumber and put them all in a different CGU?

view this post on Zulip Wesley Wiser (Mar 09 2021 at 22:41):

So I have data about the split large CGUs idea. The idea is basically that after the initial partitioning, we check if any CGUs are large and if they are, we split them in half and then repeat the check again on each part recursively. A large CGU is defined as a CGU that is 10x larger than the median CGU's size and is at least 50,000 estimation units in size.

The net effect on bootstrap that it don't change memory used much at all and it regresses compilation time by about 2%. (Note: an initial version did not have the "must be at least 50,000 estimation units in size" restriction and the regression was much more significant there). The issue is that without more intelligent CGU splitting and merging algorithms, we end up generating many more copies of inlined functions than we do previously which causes the regressions (the extra copies increased the total estimated size by ~30%).

While that's unfortunate, I think the good news here is that if we can be more intelligent to avoid having to duplicate the inlined functions so much, we could probably see some significant wins here.

The data I collected:

Stat baseline with splitting
Largest CGU 253,976 37,598
Smallest CGU 1,209 8,706
Delta from largest & smallest 252,767 28,892
Count of CGUs less than avg size 202 148 (out of 256 CGUs total)
Stddev 26,705.86214 6,312.718697
Total size of all CGUs 3,502,838.00 4,333,445.00

view this post on Zulip lqd (Mar 10 2021 at 12:09):

is the 2% longer build time accounting for the time taken to do this partitioning, or purely downstream from that (I assume it's included, and that's it the overall bootstrap time) ? I should know this, so I'm sorry in advance if this is wrong/stupid: I assume the sizes are still the "instance def size estimate"s but do we have an idea whether in this test there were many MIR statements that are not really used in codegen (say, fake reads) or heavily skew the numbers w.r.t to "total estimated size" increasing by 30% (maybe we filter those out before partitioning so ignore this if that's the case) (that is, was the 30% cgu size increase also seen in the codegen) ? "if we can be more intelligent" do you think we could have a way to test this theory, à la "we analyze which are the costly duplicated inlined functions, manually figure out what a good partitioning would look like in this case, hardcode that ideal results for this test and measure again" to see if there are indeed the probable significant wins we hope for with this technique ? (maybe we can then work backwards from those results to infer the tweaks that strategy needs, and of course see how well it generalizes to many different tests)

view this post on Zulip Wesley Wiser (Mar 16 2021 at 20:31):

is the 2% longer build time accounting for the time taken to do this partitioning, or purely downstream from that (I assume it's included, and that's it the overall bootstrap time) ?

The 2% time regression is in terms of the overall bootstrap process.

I assume the sizes are still the "instance def size estimate"s

Yeah, it's still that same code.

do we have an idea whether in this test there were many MIR statements that are not really used in codegen (say, fake reads) or heavily skew the numbers w.r.t to "total estimated size" increasing by 30% (maybe we filter those out before partitioning so ignore this if that's the case) (that is, was the 30% cgu size increase also seen in the codegen) ?

I'm not sure. I thought we had a pass that removed those statements at the beginning of the MIR opt pipeline. In terms of your overall question, yeah it's very possible the actual work is not 30% larger, it's just our bad estimates that are making it look that way.

do you think we could have a way to test this theory, à la "we analyze which are the costly duplicated inlined functions, manually figure out what a good partitioning would look like in this case, hardcode that ideal results for this test and measure again" to see if there are indeed the probable significant wins we hope for with this technique ?

That's an interesting approach! I think we could do that if we use a much smaller crate. There's ~70,000 instances that get codegen'd from rustc_mir so doing any kind of manual analysis on it felt intractable to me :smile:


Last updated: Oct 21 2021 at 21:32 UTC