Stream: t-compiler/wg-parallel-rustc

Topic: Profiling Cargo (again)


Alex Crichton (Oct 28 2019 at 20:55, on Zulip):

Ok so I'm starting to collect some data and write some things to https://hackmd.io/oUdvUU2lTk2ZxfOWQBT9xQ

Alex Crichton (Oct 28 2019 at 20:56, on Zulip):

I'm still in the process of doing so, but @cuviper I'm curious, do you know of good tools on Linux to profile contention?

simulacrum (Oct 28 2019 at 21:00, on Zulip):

(that ping did not appear to work, @cuviper)

simulacrum (Oct 28 2019 at 21:00, on Zulip):

@Alex Crichton note that an alt build will have slower codegen (llvm asserts + IR verification)

Alex Crichton (Oct 28 2019 at 21:01, on Zulip):

yep

Alex Crichton (Oct 28 2019 at 21:01, on Zulip):

I'm just comparing it to itself

Alex Crichton (Oct 28 2019 at 21:01, on Zulip):

not comparing it to nightly

simulacrum (Oct 28 2019 at 21:01, on Zulip):

makes sense, just wanted to be sure :)

cuviper (Oct 28 2019 at 21:42, on Zulip):

@Alex Crichton what sort of profiling do you want? even those strace counts of futex are a pretty good indicator, which will be some balance of FUTEX_WAIT and FUTEX_WAKE calls

Alex Crichton (Oct 28 2019 at 21:43, on Zulip):

I suppose another way to put this is that a Zthreads=28 build for me spends 870ms in futex syscall, but -Zthreads=1 spends 46ms in there

cuviper (Oct 28 2019 at 21:44, on Zulip):

SystemTap has some locking examples, especially mutex-contention.stp
https://sourceware.org/systemtap/examples/keyword-index.html#LOCKING

Alex Crichton (Oct 28 2019 at 21:44, on Zulip):

more worrisome though is -Zthreads=1 spends 4us in sched_yield and -Zthreads=28 spends 4 seconds

Alex Crichton (Oct 28 2019 at 21:44, on Zulip):

ok cool I might poke around at that

cuviper (Oct 28 2019 at 21:45, on Zulip):

sched_yield is probably due to rayon threads spinning down

cuviper (Oct 28 2019 at 21:45, on Zulip):

niko's scheduling changes will hopefully help there

cuviper (Oct 28 2019 at 21:46, on Zulip):

(less spinning, more precise wakeups, etc.)

Alex Crichton (Oct 28 2019 at 21:47, on Zulip):

for context you may want to see https://hackmd.io/oUdvUU2lTk2ZxfOWQBT9xQ?view#Crates-in-the-first-3-seconds as well

Alex Crichton (Oct 28 2019 at 21:47, on Zulip):

where basically I'm seeing that a huge amount of system time is spent in the first few seconds of the build

Alex Crichton (Oct 28 2019 at 21:47, on Zulip):

so when there's tons of rustc parallelism then rustc internal parallelism is stressing the system quite a lot

Alex Crichton (Oct 28 2019 at 21:47, on Zulip):

which if rayon has 28 threads (for mysefl locally) all spinning

Alex Crichton (Oct 28 2019 at 21:48, on Zulip):

that's 28 * 28 threads all spinning

Alex Crichton (Oct 28 2019 at 21:48, on Zulip):

which would explain that

cuviper (Oct 28 2019 at 21:48, on Zulip):

FWIW, I think those strace times are _wall_ times, not CPU usage

Alex Crichton (Oct 28 2019 at 21:48, on Zulip):

oh this is perf trace if that makes a difference

Alex Crichton (Oct 28 2019 at 21:48, on Zulip):

and I think it's wall time per syscall, right?

Alex Crichton (Oct 28 2019 at 21:48, on Zulip):

or rather I don't think it can be overlal wall time

cuviper (Oct 28 2019 at 21:49, on Zulip):

a total of time from enter->exit the syscall

Alex Crichton (Oct 28 2019 at 21:49, on Zulip):

b/c the whole build was 3 seconds and it say 4s was spent in sched_yield

Alex Crichton (Oct 28 2019 at 21:49, on Zulip):

yeah, summed up across all threds

cuviper (Oct 28 2019 at 21:49, on Zulip):

right, they can overlap

cuviper (Oct 28 2019 at 21:51, on Zulip):

I assume msec is milliseconds, no? so 4015660 is 4015.66 seconds

cuviper (Oct 28 2019 at 21:51, on Zulip):

lots of overlap, surely

Alex Crichton (Oct 28 2019 at 21:53, on Zulip):

so I thought so, but I think "msec" is "microsecond"

Alex Crichton (Oct 28 2019 at 21:53, on Zulip):

I find perf output pretty bad from time to time...

Alex Crichton (Oct 28 2019 at 21:53, on Zulip):

I definitely didn't wait for an hour

Alex Crichton (Oct 28 2019 at 21:53, on Zulip):

so from this I see two things:

Alex Crichton (Oct 28 2019 at 21:54, on Zulip):
cuviper (Oct 28 2019 at 21:54, on Zulip):

no, but spread across 3998 clone tasks (process and/or threads)

Alex Crichton (Oct 28 2019 at 21:54, on Zulip):
Alex Crichton (Oct 28 2019 at 21:54, on Zulip):

I wish I could drill more into why futex is taking so much longer

Alex Crichton (Oct 28 2019 at 21:54, on Zulip):

but that seems like it's possibly just bland everyday contention which would be harder to optimize

Alex Crichton (Oct 28 2019 at 21:55, on Zulip):

but jobserver at least seems entirely unrelated to the slowness that I'm seeing

Alex Crichton (Oct 28 2019 at 21:55, on Zulip):

or rather, it's hidden by the massive costs of sched_yield

Alex Crichton (Oct 28 2019 at 21:56, on Zulip):

there are about 2x read/write syscalls, and that is likely related to the jobserver

Alex Crichton (Oct 28 2019 at 21:56, on Zulip):

but the main costs seem to be elsewhere

cuviper (Oct 28 2019 at 21:57, on Zulip):

can you cargo-patch to niko's rayon branch to see its effects?

cuviper (Oct 28 2019 at 21:57, on Zulip):

er, well, I guess you need rustc-rayon with those changes, not as simple

Alex Crichton (Oct 28 2019 at 21:57, on Zulip):

yeah :(

Alex Crichton (Oct 28 2019 at 21:57, on Zulip):

if we could get that up to a try build though that'd be great

simulacrum (Oct 28 2019 at 23:27, on Zulip):

@Alex Crichton here is a try build with no jobserver support (efc4e1038fd187f9817854c353e56a74d6f49709, see https://github.com/rust-lang/rust/pull/65908) which also disables llvm asserts etc so alt and non-alt are fine to compare parallel vs. non-parallel w/o jobserver. I've also kicked off a perf.rlo run to confirm that jobserver is not harming the current case

it should be compared against 8d78bf6b273848d17da8f5c92162c6a6b9b10dfd if you wan

simulacrum (Oct 29 2019 at 02:02, on Zulip):

er, if you want to compare against "real" master

simulacrum (Oct 29 2019 at 02:02, on Zulip):

i.e., without jobserver changes

Alex Crichton (Oct 29 2019 at 14:52, on Zulip):

Thanks @simulacrum! Updated the doc - https://hackmd.io/oUdvUU2lTk2ZxfOWQBT9xQ?view - with that information

Alex Crichton (Oct 29 2019 at 14:52, on Zulip):

my conclusion is that removing the jobserver does help, but not as much as you might think

Alex Crichton (Oct 29 2019 at 14:53, on Zulip):

I think that we still have the same two conclusions to start with and get a lion's share of improvements

Alex Crichton (Oct 29 2019 at 14:53, on Zulip):

which is to test out the new scheduler which is more eager to shut threads down

Alex Crichton (Oct 29 2019 at 14:54, on Zulip):

and additionally a separate change to spin up threads lazily instead of all at once

Alex Crichton (Oct 29 2019 at 14:55, on Zulip):

For those following this, I think that's all the investigation I think needs to be done before next week's mtg, I'm pretty confident that this is enough data to dig in and start a round of improvements. If others though think there's something else that should be investigated, let me know!

Last update: Nov 17 2019 at 07:00UTC