Stream: wg-async-foundations

Topic: perf callers


tmandry (Oct 08 2019 at 00:14, on Zulip):

I'm profiling a really long compile with perf record -F99 --call-graph dwarf, like it says in the guide

tmandry (Oct 08 2019 at 00:15, on Zulip):

the top entry from perf report is <smallvec::SmallVec<A> as core::iter::traits::collect::FromIterator<<A as smallvec::Array>::Item>>::from_iter

tmandry (Oct 08 2019 at 00:15, on Zulip):

it seems we spend a ton of time building SmallVecs, but I can't figure out why

tmandry (Oct 08 2019 at 00:15, on Zulip):

there are no callers listed in the call graph for some reason

tmandry (Oct 08 2019 at 00:17, on Zulip):

I tried using perf focus --tree-callers (cc @nikomatsakis) and it showed the same thing

tmandry (Oct 08 2019 at 00:17, on Zulip):

also, main isn't listed anywhere, and almost no query-related frames seem to be showing up, even though -Z self-profile says we spend almost all our time inside one kind of query

tmandry (Oct 08 2019 at 00:19, on Zulip):

(this is in order to debug #65147 since so far I haven't been able to minimize an example.. it seems the compile time correlates with the complexity of the underlying async/await call graph)

simulacrum (Oct 08 2019 at 00:36, on Zulip):

@tmandry hm, so there's no parent information (i.e., callchain) in the perf output?

tmandry (Oct 08 2019 at 00:38, on Zulip):

I can see what that function is calling, but not what's calling it

simulacrum (Oct 08 2019 at 00:38, on Zulip):

hm, that's odd -- one thing to try might be --call-graph dwarf,16000 or some higher number to increase the stack depth

tmandry (Oct 08 2019 at 00:39, on Zulip):

yes, I tried that, all the way up to the max, no change :(

tmandry (Oct 08 2019 at 00:39, on Zulip):

it's possible I'm misreading something ofc

tmandry (Oct 08 2019 at 00:39, on Zulip):
# Children      Self  Command  Shared Object                               Symbol
#
    43.60%    43.60%  rustc    librustc_driver-fd1d337d61819709.so         [.] <smallvec::SmallVec<A> as core::iter::traits::collect::FromIterator<<A as smallvec::Array>::Item>>::from_iter
            |
            ---<smallvec::SmallVec<A> as core::iter::traits::collect::FromIterator<<A as smallvec::Array>::Item>>::from_iter
               |
                --26.36%--<smallvec::SmallVec<A> as core::iter::traits::collect::Extend<<A as smallvec::Array>::Item>>::extend (inlined)
                          |
                          |--3.49%--smallvec::SmallVec<A>::triple_mut (inlined)
                          |          |
                          |           --0.92%--smallvec::SmallVec<A>::spilled (inlined)
                          |
                          |--2.86%--<core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::size_hint (inlined)
                          |          <core::slice::Iter<T> as core::iter::traits::iterator::Iterator>::size_hint (inlined)
                          |
                          |--2.18%--<core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::next (inlined)
                          |          |
                          |          |--1.60%--<core::slice::Iter<T> as core::iter::traits::iterator::Iterator>::next (inlined)
                          |          |
                          |           --0.57%--core::option::Option<T>::map (inlined)
                          |
                          |--1.95%--smallvec::SmallVec<A>::reserve (inlined)
                          |
                           --0.75%--core::ptr::write (inlined)
tmandry (Oct 08 2019 at 00:39, on Zulip):

that's the very top entry in perf report

simulacrum (Oct 08 2019 at 00:40, on Zulip):

yeah, so we're clearly spending a lot of time in collecting SmallVecs from iterators

simulacrum (Oct 08 2019 at 00:41, on Zulip):

I would try maybe looking at flamegraph output via e.g. inferno https://github.com/jonhoo/inferno if you don't have a pre-existing setup with the perl scripts, and as a starter maybe just perf script output

tmandry (Oct 08 2019 at 00:41, on Zulip):

I would kind of expect the top entry to be something like main, though

simulacrum (Oct 08 2019 at 00:42, on Zulip):

well, perf can't get the whole callchain usually

simulacrum (Oct 08 2019 at 00:43, on Zulip):

at some point dwarf is not good enough to go up through the stack frames afaik, though I've never tried to find out why that is

simulacrum (Oct 08 2019 at 00:44, on Zulip):

is this output from a local compiler?

tmandry (Oct 08 2019 at 00:44, on Zulip):

yes

tmandry (Oct 08 2019 at 00:50, on Zulip):

here's a flamegraph... flamegraph.svg

simulacrum (Oct 08 2019 at 00:51, on Zulip):

that looks.. very odd

simulacrum (Oct 08 2019 at 00:51, on Zulip):

maybe try removing -F99 from collection?

simulacrum (Oct 08 2019 at 00:51, on Zulip):

it might be that you happen to only be hitting inside the smallvec code I guess...

simulacrum (Oct 08 2019 at 00:52, on Zulip):

(the default is way higher, I believe)

tmandry (Oct 08 2019 at 00:52, on Zulip):

there's substantial time spent in TypeFoldable::fold_with, which makes sense because all time is being spent in check_mod_item_types according to -Z self-profile

tmandry (Oct 08 2019 at 00:52, on Zulip):

I've done at least several profiles at this point, but I'll try

simulacrum (Oct 08 2019 at 00:55, on Zulip):

It might be the InternalSubsts::identity_for_item call because that's AFAICT recollecting all the generic parameters each time it's called

simulacrum (Oct 08 2019 at 00:55, on Zulip):

not sure why _two_ awaits would cause a problem whereas one doesn't, but perhaps it has something to do with the temporaries we're forcing to live longer?

tmandry (Oct 08 2019 at 00:56, on Zulip):

it's gotten more complicated since I wrote that description

tmandry (Oct 08 2019 at 00:57, on Zulip):

as it turns out, both of those awaits are in a single "critical chain" of awaits that causes the slowdown

tmandry (Oct 08 2019 at 00:57, on Zulip):

if you take one of them out, you break the chain

tmandry (Oct 08 2019 at 00:58, on Zulip):

the chain leads to one function in particular that awaits many other async fns in a match

tmandry (Oct 08 2019 at 00:59, on Zulip):

removing awaits from that function speeds up compiles, and the more you remove the faster the compile is

simulacrum (Oct 08 2019 at 00:59, on Zulip):

yeah, so my experience with await has been that you create a ton of generics (lifetimes, mostly, but that doesn't matter here)

simulacrum (Oct 08 2019 at 00:59, on Zulip):

since ~every temporary gets uplifted to be a lifetime param

simulacrum (Oct 08 2019 at 01:01, on Zulip):

and I believe this code is creating a fresh SmallVec from the list whenever it encounters an opaque ty (e.g. impl trait) which is again common in async land

simulacrum (Oct 08 2019 at 01:01, on Zulip):

though how you'd be able to confirm this theory I'm not sure

simulacrum (Oct 08 2019 at 01:02, on Zulip):

I guess you can try with a compiler that does not have these three lines by just commenting them out

tmandry (Oct 08 2019 at 01:03, on Zulip):

okay, I can try that

simulacrum (Oct 08 2019 at 01:04, on Zulip):

one unfortunate bit is that it might be that this is just "bad defaults"

simulacrum (Oct 08 2019 at 01:05, on Zulip):

previously, before async/await, we sort of always expected lists of generics etc in the compiler to be short, so all queries assume 8 elements as the default length before switching from a SmallVec to a Vec

simulacrum (Oct 08 2019 at 01:05, on Zulip):

if we're going over that frequently that can be a performance cliff I imagine

tmandry (Oct 08 2019 at 01:08, on Zulip):

yeah, but this slowdown didn't happen in a compiler from a month ago

simulacrum (Oct 08 2019 at 01:09, on Zulip):

true, yeah

tmandry (Oct 08 2019 at 01:09, on Zulip):

I think something went from O(N) to O(N^2) or similar

simulacrum (Oct 08 2019 at 01:10, on Zulip):

possibly -- this is a pretty weird query to be seeing the slow down in, since it's not really responsible for much

tmandry (Oct 08 2019 at 01:17, on Zulip):

flamegraph with many more samples: flamegraph.1.svg

simulacrum (Oct 08 2019 at 01:17, on Zulip):

hm that's really weird

simulacrum (Oct 08 2019 at 01:18, on Zulip):

perf script doesn't show anything either I presume? i.e., if you look for the smallvec bits

simulacrum (Oct 08 2019 at 01:19, on Zulip):

I guess technically one thing you could try beyond just commenting the 3 lines is just sticking a return; at the top of the function - I believe this is strictly checking not inference etc so it should be a working (if buggy, in the sense that errors are no longer caught) compiler

simulacrum (Oct 08 2019 at 01:20, on Zulip):

I am surprised at the lack of callchains though

simulacrum (Oct 08 2019 at 01:20, on Zulip):

maybe incremental is to blame? that's the only thing I can think of

tmandry (Oct 08 2019 at 01:21, on Zulip):

one thing from perf script that looks interesting:

<rustc::ty::util::<impl rustc::ty::context::TyCtxt>::try_expand_impl_trait_type::OpaqueTypeExpander as rustc::ty::fold::TypeFolder>::fold_ty+0x113 (inlined)
simulacrum (Oct 08 2019 at 01:22, on Zulip):

that's https://github.com/rust-lang/rust/blob/f3c9cece7b6829e6fd7854a1aee6a1619a81a38c/src/librustc/ty/util.rs#L694

simulacrum (Oct 08 2019 at 01:22, on Zulip):

which could be slow, I guess

tmandry (Oct 08 2019 at 01:23, on Zulip):

I mean I'm just cherry picking

simulacrum (Oct 08 2019 at 01:23, on Zulip):

that's from the 3 lines I suggested to comment out though

simulacrum (Oct 08 2019 at 01:23, on Zulip):

(did you try that yet?)

tmandry (Oct 08 2019 at 01:23, on Zulip):

yeah, I'm building it now

simulacrum (Oct 08 2019 at 01:23, on Zulip):

ah okay :)

tmandry (Oct 08 2019 at 01:24, on Zulip):

it's slow because I have to build a stage2 compiler (cross compiling)

tmandry (Oct 08 2019 at 01:24, on Zulip):

not sure how hard it would be to build this for my linux machine

tmandry (Oct 08 2019 at 01:25, on Zulip):

this crate that is

tmandry (Oct 08 2019 at 01:25, on Zulip):

I do have to head out though.. I'll check back later and post anything I learn in here and on the issue

simulacrum (Oct 08 2019 at 01:25, on Zulip):

unless there's e.g. build.rs involved a cargo check --target would likely work w/o stage 2 build, I imagine?

tmandry (Oct 08 2019 at 01:25, on Zulip):

mm maybe, but we don't use cargo :P

simulacrum (Oct 08 2019 at 01:26, on Zulip):

e.g. x.py build --stage 1 --target <target> src/libtest and then cargo +stage1 check --target <target>

simulacrum (Oct 08 2019 at 01:26, on Zulip):

well, the rustc binary would just work for that target if you did the x.py build first

simulacrum (Oct 08 2019 at 01:26, on Zulip):

in theory

tmandry (Oct 08 2019 at 01:26, on Zulip):

good to know. I don't think we have the ability to do check in our build system but I could hack it in

simulacrum (Oct 08 2019 at 01:27, on Zulip):

to be clear you could plausibly even do build, but that has a linker dependency and you might not be able to satisfy that for said target

simulacrum (Oct 08 2019 at 01:27, on Zulip):

I guess you'd need the linker to get a std for said target in stage 2, so it should "just work" I would think

simulacrum (Oct 08 2019 at 01:28, on Zulip):

if not feel free to bug me :)

tmandry (Oct 08 2019 at 01:31, on Zulip):

okay, I'll start building it that way, this should save me some time :)

tmandry (Oct 08 2019 at 01:31, on Zulip):

I have to go now though

tmandry (Oct 08 2019 at 01:31, on Zulip):

thanks for all your help!!

nagisa (Oct 08 2019 at 01:36, on Zulip):

There was a PR somewhere around related to collecting into SmallVec

nagisa (Oct 08 2019 at 01:36, on Zulip):

https://github.com/rust-lang/rust/pull/64949

nagisa (Oct 08 2019 at 01:37, on Zulip):

(from_iter == collect pretty much)

nagisa (Oct 08 2019 at 01:38, on Zulip):

@tmandry ^ pinging just so you notice before you spend too much time on it

tmandry (Oct 08 2019 at 04:50, on Zulip):

thanks. unfortunately I lost access to my workstation so I'll have to wait until tomorrow to make more progress.

nikomatsakis (Oct 08 2019 at 14:21, on Zulip):

there are some problems with perf and deep stacks

nikomatsakis (Oct 08 2019 at 14:21, on Zulip):

there's a hard coded limit of 2^16 bytes of stack data being collected or something

nikomatsakis (Oct 08 2019 at 14:21, on Zulip):

which is rather hard to circumvent

nikomatsakis (Oct 08 2019 at 14:21, on Zulip):

that leads to the "bottom" of a stack trace being cut off

nikomatsakis (Oct 08 2019 at 14:21, on Zulip):

in general, I've found perf has become quite unuseful lately

Last update: Nov 18 2019 at 01:35UTC