Stream: t-compiler/help

Topic: debugging rustc OOM


Izzy Swart (Jul 16 2020 at 18:22, on Zulip):

I'm new to zulip, so apologies in advance if I violate any implicit rules of etiquette or am using this space for the wrong thing. I'm trying to debug an issue that causes exponential growth in rustc's memory usage for seemingly trivial code. I've used -Zself-profile etc to disambiguate the issue down to evaluate_obligation/codegen_fulfill_obligation but I'm not sure how to gather more information to determine exactly what bit of code is adversarial and exercising the issue etc. The RUST_LOG output for those queries comes in the form of UUIDs that I'm unfamiliar with and don't know how to convert to a human-readable form, and even so there's no timing or memory info in RUST_LOG so I don't know which evaluations are causing the problem. Could someone point me to the usual methodology for investigating things like this? I'm blocked on this issue across essentially all of my active research as it occurs in essentially any source that uses any but the most trivial generated code from my protocol crate (https://github.com/noocene/protocol) and I'd really appreciate some guidance so I can contribute a fix or at least better understand the problem. My familiarity with the trait solver doesn't go much beyond the cursory description in the compiler internals book.

Wesley Wiser (Jul 16 2020 at 18:26, on Zulip):

Hi @Izzy Swart! Have you tried adding the -Zself-profile-events flag to your trace? https://doc.rust-lang.org/unstable-book/compiler-flags/self-profile-events.html

Wesley Wiser (Jul 16 2020 at 18:27, on Zulip):

If you include -Zself-profile-events=default,args, it will make additional information available at the cost of making tracing more expensive.

Wesley Wiser (Jul 16 2020 at 18:27, on Zulip):

Just out of curiosity, how did you narrow the issue down to evaluate_obligation/codegen_fulfill_obligation?

Izzy Swart (Jul 16 2020 at 18:28, on Zulip):

Thanks, that looks useful. Will that include query arguments in a human-readable format?

Wesley Wiser (Jul 16 2020 at 18:28, on Zulip):

Yes

Wesley Wiser (Jul 16 2020 at 18:28, on Zulip):

Which tool(s) are you using to look at the self-profile data?

Izzy Swart (Jul 16 2020 at 18:29, on Zulip):

Wesley Wiser said:

Just out of curiosity, how did you narrow the issue down to evaluate_obligation/codegen_fulfill_obligation?

I used -Zself-profile and crox/summarize to identify the query that was consuming the most CPU time. It could be a flawed assumption that doing so is also correctly identifying which query is resulting in the extreme memory usage.

Wesley Wiser (Jul 16 2020 at 18:29, on Zulip):

Gotcha!

Wesley Wiser (Jul 16 2020 at 18:30, on Zulip):

I know @njn has used massif before when profiling rustc memory usage but I don't have any experience using that.

Wesley Wiser (Jul 16 2020 at 18:31, on Zulip):

When you add -Zself-profile-events=default,args, you'll stuff like this in crox image.png

Wesley Wiser (Jul 16 2020 at 18:31, on Zulip):

Note the arg0 at the bottom of the image

Izzy Swart (Jul 16 2020 at 18:32, on Zulip):

Wesley Wiser said:

Note the arg0 at the bottom of the image

Yep, thanks. Running a build with those arguments right now to see if it provides me with enough information to localize the issue and create an MCVE.

Wesley Wiser (Jul 16 2020 at 18:33, on Zulip):

Awesome!

Izzy Swart (Jul 16 2020 at 18:36, on Zulip):

Wesley Wiser said:

If you include -Zself-profile-events=default,args, it will make additional information available at the cost of making tracing more expensive.

I encounter this issue: https://github.com/rust-lang/rust/issues/73593 that I filed previously while working on something unrelated when trying to run a build with those flags

Wesley Wiser (Jul 16 2020 at 18:48, on Zulip):

Hmm... that's unfortunate. I don't think we anticipated rustc needing to record more than 1 billion strings during a profile session.

Wesley Wiser (Jul 16 2020 at 18:49, on Zulip):

Is your code open source by any chance?

Izzy Swart (Jul 16 2020 at 18:51, on Zulip):

Wesley Wiser said:

Is your code open source by any chance?

yeah, that issue has a codebase linked (with deps cloned in-tree for reproducibility) that's very similar to the one I was testing on

https://github.com/syntacticsugarglider/self-profile-ice-repro

Wesley Wiser (Jul 16 2020 at 18:52, on Zulip):

Oh duh. Sorry I missed that.

Wesley Wiser (Jul 16 2020 at 18:53, on Zulip):

I will try to take a look tonight and see if I can figure out a simple way to reduce how many strings we're recording to fix that issue.

Izzy Swart (Jul 16 2020 at 18:54, on Zulip):

Wesley Wiser said:

I will try to take a look tonight and see if I can figure out a simple way to reduce how many strings we're recording to fix that issue.

Thanks, I appreciate it.

njn (Jul 16 2020 at 21:58, on Zulip):

Yeah, Massif and DHAT, both Valgrind tools and so work well on Linux and might work on Mac. There's a bit of info about both tools at https://github.com/rust-lang/rustc-perf/blob/master/collector/README.md, which might be a useful starting point.

eddyb (Jul 17 2020 at 09:52, on Zulip):

@Wesley Wiser @simulacrum I wonder if we should use #t-compiler/performance for stuff like this, or have something for self-profile or hwat

Wesley Wiser (Jul 17 2020 at 10:51, on Zulip):

@eddyb There is a #t-compiler/wg-self-profile stream if we want to move this there.

eddyb (Jul 17 2020 at 11:33, on Zulip):

oh right we have both, hmpf. no idea

Izzy Swart (Jul 17 2020 at 16:47, on Zulip):

This feels related to https://github.com/rust-lang/rust/issues/38528

I got Massif working on the linked example code and while I still don't know what arguments are being passed there are absolutely huge allocations (for a constructed example that uses a peak of like 4GB a substantial 1.1GB part of that is a single allocation) in trait selection here:
https://github.com/rust-lang/rust/blob/7e11379f3b4c376fbb9a6c4d44f3286ccc28d149/src/librustc_trait_selection/traits/select/mod.rs#L2053

maybe deduplication is failing? would be nice to know what's actually being processed here. any way I can get that def_id and associate it with an actual bit of source without -Zself-profile-events working properly?

this all appears to be part of associated type normalization fwiw, here is an image of the profile for the heaviest snapshot and here is the massif output file for that specific run

Wesley Wiser (Jul 17 2020 at 17:13, on Zulip):

I had hoped to find some obvious duplicated strings but the repro I did last night seemed to have similar but distinct strings. Most of the event args look like stuff from the trait engine so that profile seems accurate to me.

Wesley Wiser (Jul 17 2020 at 17:14, on Zulip):

-Zself-profile does dump the data to disk as it works so all the data prior to the ICE is available in those files but I'm not sure how well some of the tooling will handle the files missing the ends.

Wesley Wiser (Jul 17 2020 at 17:15, on Zulip):

For example, if you vim ./profiles/protocol-*.string_data and then go to the end of the file, you can see the last event args right before the ICE happened.

Wesley Wiser (Jul 17 2020 at 17:17, on Zulip):

I'm going to look into either making StringId either 64 bit instead of 32-bit or perhaps see how terrible it would be to support failable allocation from that api. That would at least allow the trace to complete at the expense of having some of the event arg data.

Wesley Wiser (Jul 17 2020 at 17:17, on Zulip):

But I doubt I'll have anything that can upstreamed for a least a few days.

Izzy Swart (Jul 17 2020 at 17:32, on Zulip):

Wesley Wiser said:

For example, if you vim ./profiles/protocol-*.string_data and then go to the end of the file, you can see the last event args right before the ICE happened.

unfortunately that crash is happening while building protocol and not the actual dependent crate so that information is essentially useless.

Izzy Swart (Jul 17 2020 at 17:36, on Zulip):

is there some way I can build all the dependencies (incl. protocol) without profiling and then only build the final compilation unit with self-profile enabled? it seems to require a clean build

Wesley Wiser (Jul 17 2020 at 17:47, on Zulip):

Oh I see

Wesley Wiser (Jul 17 2020 at 17:47, on Zulip):

Hmm

Wesley Wiser (Jul 17 2020 at 17:47, on Zulip):

Well, I don't think there's an out the box way to do that right now. But landing a flag that lets you control which crate the profiler runs on would probably be pretty easy.

simulacrum (Jul 17 2020 at 17:50, on Zulip):

you can likely sort of emulate it with cargo rustc -- -Zself-profile instead of setting it through rustflags?

Wesley Wiser (Jul 17 2020 at 17:50, on Zulip):

Oh, brilliant! Yeah that should do exactly what you want @Izzy Swart

Izzy Swart (Jul 17 2020 at 18:18, on Zulip):

Wesley Wiser said:

Oh, brilliant! Yeah that should do exactly what you want Izzy Swart

I got it working but this doesn't seem very granular? most of the time is spent as self time in a few calls that look similar to this where, as far as I understand it, it's just evaluating whether things implement Unravel which is kinda like... the root trait predicate. I assumed it was struggling with a huge number of predicates generated based on that constraint or something but anything further down the call tree is practically a rounding error in terms of CPU time.

Canonical { max_universe: U0, variables: [CanonicalVarInfo { kind: Region(U0) }], value: ParamEnvAnd { param_env: ParamEnv { caller_bounds: [], reveal: UserFacing, def_id: None }, value: Binder(TraitPredicate(<Item as __DERIVE_PROTOCOL_Id::__protocol::Unravel<protocol_mve_transport::Transport<std::boxed::Box<dyn CloneSpawn>, std::convert::Infallible, std::convert::Infallible, Item>>>)) } }

Izzy Swart (Jul 17 2020 at 18:19, on Zulip):

looks like this

Wesley Wiser (Jul 17 2020 at 18:24, on Zulip):

If I'm reading the Massif screenshot correctly, it looks like there's a ~1gb vector being resized which is probably what's dominating the time of the evaluate_obligation query.

Izzy Swart (Jul 17 2020 at 18:25, on Zulip):

that would make sense I guess. let me try using a more granular like call-level profiler from rustc-perf to clarify that.

Izzy Swart (Jul 17 2020 at 19:14, on Zulip):

Wesley Wiser said:

If I'm reading the Massif screenshot correctly, it looks like there's a ~1gb vector being resized which is probably what's dominating the time of the evaluate_obligation query.

https://github.com/rust-lang/rust/blob/7e11379f3b4c376fbb9a6c4d44f3286ccc28d149/src/librustc_trait_selection/traits/select/mod.rs#L2090

this right here is super costly in terms of CPU time, that retain call takes up over 68% of the time elapsed in a given build of the test crate I linked earlier

Interestingly, it doesn't seem like most of that time is spent allocating. it appears that instead it's spent evaluating the Hash of ObligationCauseCode for those insert calls. This makes sense (given the amount of memory that's being used to store that huge obligations Vec) I would presume the Hash implementation isn't all that slow, it's just a ridiculously hot path here because that structure is so large.

This really just confirms prior concerns though. The question isn't whether that obligation list is big, it's why it's so big. I don't know whether it's a deduplication failure or a huge growth before deduplication ever happens because the perf output just uses sampling so I have no idea how many times a given function was called or how long it took on a given call etc.

Wesley Wiser (Jul 17 2020 at 19:30, on Zulip):

Unfortunately, this is not a part of the compiler I'm very familiar with so I'm not very sure where to go from here.

Wesley Wiser (Jul 17 2020 at 19:30, on Zulip):

@eddyb Do you have any ideas?

Wesley Wiser (Jul 17 2020 at 19:47, on Zulip):

@Izzy Swart FWIW I think that at this point, you've done way more than necessary to open a bug report. That might be the fastest way to get some more eyes on this issue.

Wesley Wiser (Jul 17 2020 at 19:48, on Zulip):

There's a few folks on the issue tracker that seem to enjoy finding minimal repros for bugs so they might be able to narrow this down further.

Izzy Swart (Jul 17 2020 at 19:51, on Zulip):

Wesley Wiser said:

There's a few folks on the issue tracker that seem to enjoy finding minimal repros for bugs so they might be able to narrow this down further.

I didn't want to file something given the complete lack of MCVE but I'll do so anyhow if you think that's appropriate. Thanks.

Wesley Wiser (Jul 17 2020 at 19:51, on Zulip):

We've had far worse lol :slight_smile:

Wesley Wiser (Jul 17 2020 at 19:52, on Zulip):

I think as long as you've shown some effort to finding that, people are happy to help out.

Wesley Wiser (Jul 17 2020 at 19:52, on Zulip):

And you've certainly done that

Izzy Swart (Jul 17 2020 at 20:05, on Zulip):

https://github.com/rust-lang/rust/issues/74456

alright I threw together an issue and filed it

diverges from the usual format a bit and if there's anything I should tweak to make it more readable/informative/etc. let me know

Wesley Wiser (Jul 17 2020 at 20:11, on Zulip):

I just added a few tags but that looks great to me!

Wesley Wiser (Jul 17 2020 at 20:13, on Zulip):

Thank you

Izzy Swart (Jul 17 2020 at 20:27, on Zulip):

Thanks so much for all your help!

Wesley Wiser (Jul 17 2020 at 20:30, on Zulip):

You're very welcome!

Izzy Swart (Jul 25 2020 at 20:22, on Zulip):

@Wesley Wiser sorry to bug you again but there hasn't been any movement on the issue since you pinged the cleanup team and I haven't received any response to a thread I made in #wg-traits. I'd like to find someone familiar with the relevant components of the trait solver who might be able to give me some advice on how I can personally continue investigating this. Is there anyone who might be willing and able to help who I should consider getting in touch with?

Wesley Wiser (Jul 26 2020 at 15:38, on Zulip):

@Izzy Swart Let me message a few people and see if I can find the right person to help with this.

Izzy Swart (Jul 31 2020 at 16:44, on Zulip):

Wesley Wiser said:

Izzy Swart Let me message a few people and see if I can find the right person to help with this.

Hey, sorry to bother you yet again but have you found anyone who might be of assistance? I'd like to make some progress on this issue before I start uni in a month or so, at that point I'll be more focused on my research and I'd prefer not to be blocked on a compiler bug. I hate to be impatient but I just wish I were able to resolve this independently. Unfortunately, I don't have the necessary expertise and I'm unable to find any information through written sources.

Last update: Sep 28 2020 at 16:15UTC