Stream: t-compiler/wg-parallel-rustc

Topic: Bug on highly parallel builds


Josh Triplett (Dec 18 2019 at 19:26, on Zulip):

@Alex Crichton What's the nature of the bug with highly parallel builds?

simulacrum (Dec 18 2019 at 19:34, on Zulip):

@Josh Triplett I think Alex was referring to the fact that at least on Linux we've measured fairly high contention inside the kernel due to our current strategy for limiting parallelism (i.e., GNU make jobserver, which on linux is a machine-global pipe), which can lead to abysmal performance with lots of threads

simulacrum (Dec 18 2019 at 19:35, on Zulip):

we have a few ideas for mitigation and are currently working on the likely solution which is making cargo be the source of truth for jobserver communication (though rustc will support, and always default to, normal jobserver operation)

Alex Crichton (Dec 18 2019 at 19:51, on Zulip):

yes this has to do with how we're limiting parallelism

Alex Crichton (Dec 18 2019 at 19:52, on Zulip):

the bug we've seen is that when you write to a pipe on linux, it wakes up everyone waiting on it

Alex Crichton (Dec 18 2019 at 19:52, on Zulip):

so, in the worst case, when you do -Zthreds=72 you have 72^2 threads

Alex Crichton (Dec 18 2019 at 19:52, on Zulip):

72 rustc's, each with 72 threads

Alex Crichton (Dec 18 2019 at 19:52, on Zulip):

that's a lot of people to wake up all the time

Alex Crichton (Dec 18 2019 at 19:52, on Zulip):

and a lot of wasted time churning around

Alex Crichton (Dec 18 2019 at 19:52, on Zulip):

we plan to fix this by not actually having 72^2 threads waiting

Alex Crichton (Dec 18 2019 at 19:52, on Zulip):

but only 72 :)

Josh Triplett (Dec 18 2019 at 20:03, on Zulip):

@Alex Crichton This is actually a known bug that's being fixed in Linux right now.

Josh Triplett (Dec 18 2019 at 20:03, on Zulip):

There was a large pipe rework in upstream Linux, and in the course of evaluating that, Linus noticed that both the old and the new pipe behavior had the "thundering herd" problem.

Josh Triplett (Dec 18 2019 at 20:03, on Zulip):

So that's getting fixed now, which should substantially speed up make -j72 as well.

Josh Triplett (Dec 18 2019 at 20:04, on Zulip):

The only reason it isn't already fixed is that there's a bug in make that fixing it uncovered. :)

simulacrum (Dec 18 2019 at 20:04, on Zulip):

sounds promising!

Do we need to do anything on our side to get that behavior?

I expect we'll still want our changes since macOS etc probably won't get that fix for a while

Josh Triplett (Dec 18 2019 at 20:05, on Zulip):

@simulacrum You wouldn't need to do anything. The main thing would be that you could just drop the limit to 4 threads on recent Linux.

simulacrum (Dec 18 2019 at 20:05, on Zulip):

where recent is "unreleased"?

Josh Triplett (Dec 18 2019 at 20:05, on Zulip):

Also, does macOS actually have the thundering herd problem?

Josh Triplett (Dec 18 2019 at 20:05, on Zulip):

@simulacrum Where recent right now is "change not yet in git master, being evaluated".

simulacrum (Dec 18 2019 at 20:06, on Zulip):

We've not done benchmarking on macOS, but sort of assumed so -- it's also hard to expose as macOS with more than a 4-6 cores is pretty hard to get (modulo just released mac pros)

Josh Triplett (Dec 18 2019 at 20:06, on Zulip):

Yeah, that's fair. :)

Josh Triplett (Dec 18 2019 at 20:07, on Zulip):

/me really wants someone to build a scalable macOS cloud.

simulacrum (Dec 18 2019 at 20:09, on Zulip):

I imagine we'll not really get any benefits from these changes for at least a year or two, right? That's my impression of approximate timeline on getting kernel changes pushed into end-user's hands

simulacrum (Dec 18 2019 at 20:09, on Zulip):

i.e., there's no point in waiting or anything like that

simulacrum (Dec 18 2019 at 20:10, on Zulip):

so we probably want to move ahead with our fixes regardless

simulacrum (Dec 18 2019 at 20:10, on Zulip):

even if they get even more improved in the eventuality

simulacrum (Dec 18 2019 at 20:11, on Zulip):

Also @Alex Crichton -- I think you're not quite right that we'd have 72 threads waiting with our proposed fix, since at least I was expecting that we'd loosely have 1 thread per pipe

simulacrum (Dec 18 2019 at 20:11, on Zulip):

(just 72 threads total but that's not too interesting)

Josh Triplett (Dec 18 2019 at 20:11, on Zulip):

The changes should get into one of the next two or so kernels, I'd guess.

Josh Triplett (Dec 18 2019 at 20:12, on Zulip):

It won't take years.

Josh Triplett (Dec 18 2019 at 20:12, on Zulip):

Also, there's still value in interoperating with the standard jobserver.

Alex Crichton (Dec 18 2019 at 20:13, on Zulip):

@Josh Triplett whoa that's awesome! (that linux is fixing the root of the issue)

Josh Triplett (Dec 18 2019 at 20:13, on Zulip):

Yeah. I'm currently trying to grab the WIP patches and test them, to see how that does.

simulacrum (Dec 18 2019 at 20:13, on Zulip):

Well to be clear our proposal is entirely interoperable, rustc won't lose compat, just be more limited with default jobserver

Alex Crichton (Dec 18 2019 at 20:13, on Zulip):

@simulacrum oh true, it's more of a moral equivalent

Alex Crichton (Dec 18 2019 at 20:13, on Zulip):

@Josh Triplett so far we tested out switching to literal posix semaphores, and that simple change ended up fixing the scaling issues effectively on my 28-core machine

Alex Crichton (Dec 18 2019 at 20:14, on Zulip):

it's still not perfect because rustc immediately spawns ncores threads, which is quite a lot

Alex Crichton (Dec 18 2019 at 20:14, on Zulip):

so we need to fix that a bit, but the scaling was much better with posix semaphores

Alex Crichton (Dec 18 2019 at 20:14, on Zulip):

(where sempahores presumably don't have the thundering herd issue)

Alex Crichton (Dec 18 2019 at 20:14, on Zulip):

but yeah we're unlikely to stick with vanilla jobservers for macos/windows which are likely to have similar problems

Josh Triplett (Dec 18 2019 at 20:28, on Zulip):

Windows has pipe-equivalents that guarantee to only wake up one waiter.

Josh Triplett (Dec 18 2019 at 20:28, on Zulip):

I don't know macOS internals well enough to know if it does.

Josh Triplett (Dec 18 2019 at 20:28, on Zulip):

I think I'm mostly trying to figure out if you can avoid doing that work by instead making sure you're using the right OS primitive.

Josh Triplett (Dec 18 2019 at 20:29, on Zulip):

Hypothetically, if Windows and macOS could work around this problem by using something pipe-like that isn't a POSIX pipe, and Linux worked fine with recent kernels and we could limit to 4 threads on older kernels, would you still want to do the extra architectural work for a workaround?

simulacrum (Dec 18 2019 at 20:32, on Zulip):

I at least expect that there are other advantages to letting Cargo know this information -- for example, I've had some thoughts about scheduling jobserver tokens down in a more intelligent way than what any system primitive would give us I think

Josh Triplett (Dec 18 2019 at 20:33, on Zulip):

How so?

Josh Triplett (Dec 18 2019 at 20:33, on Zulip):

/me is interested in ways to improve parallel builds.

simulacrum (Dec 18 2019 at 20:34, on Zulip):

e.g. right now if you're building lots of crates it might make sense to make sure each rustc has 2 threads (presuming you have the cores) rather than have one rustc with 10 threads and the rest stuck at 1 or something like that

Josh Triplett (Dec 18 2019 at 20:35, on Zulip):

I mean, as long as you stay 100% CPU bound, you're doing work that will need to happen regardless. The problem comes in if you drop below 100% CPU.

simulacrum (Dec 18 2019 at 20:35, on Zulip):

well, yes, but it's not necessarily true that you want to e.g. run codegen immediately for some crate if you won't need it for a while

Josh Triplett (Dec 18 2019 at 20:35, on Zulip):

Yeah, scheduling at a crate level may make sense.

Josh Triplett (Dec 18 2019 at 20:36, on Zulip):

On that note, though...

Josh Triplett (Dec 18 2019 at 20:36, on Zulip):

At that point it might make sense to adopt a "pull" model where we try to build later crates and let those drive what we build next.

Josh Triplett (Dec 18 2019 at 20:36, on Zulip):

Simulating the idea of earliest-deadline-first.

simulacrum (Dec 18 2019 at 20:37, on Zulip):

yes, or something like that -- certainly right now rustc isn't really capable of communicating that information up in an on demand fashion

Josh Triplett (Dec 18 2019 at 20:37, on Zulip):

For instance, almost any project I build that has syn in its dependency graph tends to wind up at a bottleneck where it's building only syn at some point.

simulacrum (Dec 18 2019 at 20:37, on Zulip):

but e.g. rust analyzer I think has a model like this

Josh Triplett (Dec 18 2019 at 20:37, on Zulip):

So it'd be nice to start syn as soon as possible.

simulacrum (Dec 18 2019 at 20:39, on Zulip):

I think that's kind of what I'm getting at -- we would potentially want to give syn as many threads as we can

simulacrum (Dec 18 2019 at 20:40, on Zulip):

and no system primitive would allow that level of prioritization I expect

Josh Triplett (Dec 18 2019 at 20:42, on Zulip):

True.

Josh Triplett (Dec 18 2019 at 20:42, on Zulip):

I think the ideal model effectively looks like "run everything at once, at two priority levels, higher for pull and lower for things we know we'll need eventually but we don't need yet".

Josh Triplett (Dec 18 2019 at 20:44, on Zulip):

You know the "readahead" trick on Linux, of "figure out everything you need to read from disk, and next time start reading it in the order you'll need it"? We could get a decent approximation of the same thing if we have an estimate for how long (in CPU time) each crate takes to build, assume that future builds will take similar amounts of time, and prioritize accordingly.

Josh Triplett (Dec 18 2019 at 20:45, on Zulip):

Anyway, I think I've found the Linux patch I need to test.

Alex Crichton (Dec 18 2019 at 20:50, on Zulip):

I would agree that we probably want our own scheduling system even if linux/windows are fixed

Alex Crichton (Dec 18 2019 at 20:50, on Zulip):

I think at the bare minimum I agree that this linux kernel change may take a very long time to propagate, and we want to ramp up default parallelism sooner

Alex Crichton (Dec 18 2019 at 20:51, on Zulip):

but I also agree that we can probably more cleverly prioritize crates

Alex Crichton (Dec 18 2019 at 20:51, on Zulip):

e.g. cargo already has some degree of scheduling heuristics

Alex Crichton (Dec 18 2019 at 20:51, on Zulip):

which can help keep everything saturated in theory

Alex Crichton (Dec 18 2019 at 20:51, on Zulip):

and we could apply similar heuristics to "we have a token, and N rustc instances want a token, who gets it?"

Josh Triplett (Dec 18 2019 at 20:55, on Zulip):

And we could always simplify the logic in the future, if we ever get to the point that we can start relying on fixed versions.

Alex Crichton (Dec 18 2019 at 21:03, on Zulip):

@Josh Triplett oh another thing you may want to do for more parallelism on a 72-core machine is to set -Ccodegen-units=100000000

Alex Crichton (Dec 18 2019 at 21:03, on Zulip):

or just set CARGO_INCREMENTAL=1 in release mode

Alex Crichton (Dec 18 2019 at 21:03, on Zulip):

which may drop perf a bit but not a huge amount due to thinlto

Josh Triplett (Dec 18 2019 at 21:15, on Zulip):

I'll give those a try too, once I have numbers for baseline and -Zthreads=72 with the patch I'm testing.

Alex Crichton (Dec 18 2019 at 21:16, on Zulip):

man that's a whole new world for me, compiling a new kernel and then running it locally :ghost:

Josh Triplett (Dec 18 2019 at 21:26, on Zulip):

I'm also using the compile of the kernel I'm testing to cross-check the results. :)

Josh Triplett (Dec 18 2019 at 21:26, on Zulip):

Kernel compile, on the kernel without the patch:

real    0m41.377s
user    10m1.718s
sys 2m25.655s
Josh Triplett (Dec 18 2019 at 21:48, on Zulip):

Kernel compile on the kernel with the patch:

real    0m40.463s
user    9m55.603s
sys 2m25.449s

Looks like it doesn't make that much difference to kernel compiles, but considering that half that time is serialized compressing/linking at the end anyway, that's still decent.

Josh Triplett (Dec 18 2019 at 21:48, on Zulip):

Now to try wasmtime...

Josh Triplett (Dec 18 2019 at 21:52, on Zulip):

Didn't help the wasmtime build with the default threads=4:

real    1m10.096s
user    17m38.136s
sys 0m24.415s
simulacrum (Dec 18 2019 at 21:52, on Zulip):

that looks pretty suspiciously high

simulacrum (Dec 18 2019 at 21:53, on Zulip):

I would expect system time to be far lower

simulacrum (Dec 18 2019 at 21:53, on Zulip):

maybe that's with cold disk cache?

Josh Triplett (Dec 18 2019 at 21:54, on Zulip):

Nope, that's the second build.

Josh Triplett (Dec 18 2019 at 21:54, on Zulip):

For reference, the wasmtime build with parallel rustc (and default threads) was:

real    1m9.191s
user    16m55.383s
sys 0m23.964s
Josh Triplett (Dec 18 2019 at 21:54, on Zulip):

Which is basically the same wall-clock (slightly better) and noticeably better user time.

Josh Triplett (Dec 18 2019 at 21:58, on Zulip):

The kernel patch did slightly improve the results with RUSTFLAGS=-Zthreads=72, but that's still much worse than 4:

real    1m27.163s
user    40m44.138s
sys 3m4.341s
simulacrum (Dec 18 2019 at 21:58, on Zulip):

hm, I think I was remembering when we were benchmarking first 3 seconds of a build which might explain my expectation of lower system times

simulacrum (Dec 18 2019 at 21:59, on Zulip):

could you try on both with timeout 3s cargo ...?

Josh Triplett (Dec 18 2019 at 21:59, on Zulip):

For comparison, -Zthreads=72 without this kernel patch gave 2m1s real-time and 50m user.

Josh Triplett (Dec 18 2019 at 21:59, on Zulip):

So the kernel patch to fix thundering herd wakeups for pipes is a huge relative improvement for -Zthreads=72, but there's still a problem there.

Josh Triplett (Dec 18 2019 at 22:00, on Zulip):

@simulacrum This system takes a while to reboot, so I'd like to get all the results I can with the kernel patch before I switch back to without.

simulacrum (Dec 18 2019 at 22:00, on Zulip):

ah, okay, wasn't sure if you were working on a separate machine or something

Josh Triplett (Dec 18 2019 at 22:01, on Zulip):

I'm doing these tests on a separate system than the one I'm chatting with, but that separate system takes minutes to reboot.

Josh Triplett (Dec 18 2019 at 22:01, on Zulip):

(server BIOSes, sigh)

simulacrum (Dec 18 2019 at 22:01, on Zulip):

fwiw, one thing that might be helpful is to get some loose syscall counts/timing

Josh Triplett (Dec 18 2019 at 22:01, on Zulip):

@simulacrum But purely subjectively, without the kernel patch, cargo spent a few seconds showing 0 crates compiled, while with the patch, the "slow startup" problem doesn't seem to be true anymore.

simulacrum (Dec 18 2019 at 22:02, on Zulip):

that's the primary behavior we're expecting to fix, so that's good to hear

Josh Triplett (Dec 18 2019 at 22:06, on Zulip):

I can easily get some -Ztimings data from this patched kernel.

Josh Triplett (Dec 18 2019 at 22:06, on Zulip):

Also, I'm trying some experiments with different -Zthreads values.

Josh Triplett (Dec 18 2019 at 22:07, on Zulip):

For instance, -Zthreads=16 gives:

real    1m7.279s
user    18m13.828s
sys 0m38.424s
Josh Triplett (Dec 18 2019 at 22:07, on Zulip):

That's a bit more user time, but a few seconds wall-clock improvement over 4 threads, and the improvement definitely isn't noise.

Josh Triplett (Dec 18 2019 at 22:08, on Zulip):

Subjectively, this feels like there's some critical non-linear scaling issue that dominates with high numbers of threads but not with low numbers of threads.

simulacrum (Dec 18 2019 at 22:11, on Zulip):

I tend to agree -- I think we've not yet tracked down what that is. It might be that we just don't have enough data or so, or maybe we're not yet parallel enough (i.e., the compiler itself is not sufficiently able to utilize that parallelism)

Josh Triplett (Dec 18 2019 at 22:25, on Zulip):

/me would be happy to help get the data you need.

simulacrum (Dec 18 2019 at 22:26, on Zulip):

Do you have experience or time to dig in? One of the problems we've had is that none of us really know how to figure out what the problem is (or what the tools are really either)

Josh Triplett (Dec 18 2019 at 22:26, on Zulip):

This is entirely subjective, but having worked on parallel programming for many years, this doesn't feel like "we don't have enough parallelism opportunities in the compiler". This feels like either "something is scaling quadratically or worse" or "something is blocking/spinning/waking-up unnecessarily".

Josh Triplett (Dec 18 2019 at 22:27, on Zulip):

I do have the experience, and some of the time.

Josh Triplett (Dec 18 2019 at 22:27, on Zulip):

(Parallel programming was my dissertation topic. ;) )

simulacrum (Dec 18 2019 at 22:27, on Zulip):

We'd be happy to receive help, and if you can even write up some quick "things to look at" or so that would be amazing

simulacrum (Dec 18 2019 at 22:28, on Zulip):

I can try and help out with how to get builds and such with the parallelism enabled

Josh Triplett (Dec 18 2019 at 22:30, on Zulip):

It depends on the nature of the problem you're debugging. But to a first approximation, if the problem is "too much user time", perf is great for this.

Josh Triplett (Dec 18 2019 at 22:30, on Zulip):

("too much blocking" is harder to debug, but possible.)

simulacrum (Dec 18 2019 at 22:31, on Zulip):

I think we've not had any major success with basic use of perf -- i.e., parallel compiler does not differ from non-parallel

simulacrum (Dec 18 2019 at 22:31, on Zulip):

(if you mean just perf record and friends)

Josh Triplett (Dec 18 2019 at 22:32, on Zulip):

That'd be really surprising. If user time is going from (in my case) 16 minutes to 40-50 minutes, perf really should point to something there...

Josh Triplett (Dec 18 2019 at 22:32, on Zulip):

A thought crosses my mind...

simulacrum (Dec 18 2019 at 22:32, on Zulip):

It's possible we've just not spent enough time on high enough core machines

Josh Triplett (Dec 18 2019 at 22:32, on Zulip):

Sometimes, these scaling issues show up much better on bigger systems, precisely because there's more contention. They also can show up more on multi-socket systems, because unnecessary communication is slower between sockets.

Josh Triplett (Dec 18 2019 at 22:33, on Zulip):

O(n^2) gets much more noticeable at 72 than 16. :)

simulacrum (Dec 18 2019 at 22:33, on Zulip):

(I also don't recall specifically looking at this in recent time, though I do recall doing something like this a few months ago, so things may have also changed since then)

Josh Triplett (Dec 18 2019 at 22:33, on Zulip):

So, in addition to me personally working on this: is there someone on the parallel rustc team who has room to host an 88-way box somewhere? I'll mail them one. :)

Josh Triplett (Dec 18 2019 at 22:34, on Zulip):

It's rackmount hardware, and produces enough noise that you don't want it under or atop your desk, but if someone has somewhere to rack it and make it available for everyone on the parallel rustc team to use it...

simulacrum (Dec 18 2019 at 22:35, on Zulip):

I am uncertain. I suspect the answer might be no -- and we might have more luck with just one-off running something on EC2, though I'd need to take a look at pricing there.

Josh Triplett (Dec 18 2019 at 22:35, on Zulip):

Pricing there is about $5/hour for a comparable box.

Josh Triplett (Dec 18 2019 at 22:36, on Zulip):

That's not counting storage and bandwidth.

simulacrum (Dec 18 2019 at 22:37, on Zulip):

(We do have some budget from AWS so we might be able to afford it, I'm just not sure off the top of my head :)

Josh Triplett (Dec 18 2019 at 22:40, on Zulip):

Just confirmed, c5.metal is what you want, which is $4.08/hour.

Josh Triplett (Dec 18 2019 at 22:40, on Zulip):

That's a 96-way parallel system.

Josh Triplett (Dec 18 2019 at 22:41, on Zulip):

(That said, I would love to give this hardware to Rust, where it'll be free for anyone to do this kind of testing on an ongoing basis.)

simulacrum (Dec 18 2019 at 22:42, on Zulip):

It's one of my desires to find a way for the infra team to be able to accept this sort of hardware donation :)

Josh Triplett (Dec 18 2019 at 22:43, on Zulip):

It doesn't have to be the infra team. I'd accept "someone in the project can arrange to rack it, one-off". So if anyone has an office near a lab...

Josh Triplett (Dec 18 2019 at 22:43, on Zulip):

(It can always become infra later when there's a path for that.)

simulacrum (Dec 18 2019 at 22:44, on Zulip):

sure, yeah, I agree with that

Josh Triplett (Dec 18 2019 at 22:58, on Zulip):

Are there debug symbols available that would allow directly running perf on the nightly-2019-12-18 build, or am I going to need to build from source?

simulacrum (Dec 18 2019 at 23:00, on Zulip):

you're going to need to build from source unfortunately

simulacrum (Dec 18 2019 at 23:01, on Zulip):

you can either directly checkout the SHA of that nightly or toggle parallel-compiler = true in config.toml (not sure how much experience you have with rustc compiler dev)

Josh Triplett (Dec 18 2019 at 23:02, on Zulip):

Fairly little. Mind pointing me to the "build rustc and cargo from source and rustup link them 101"? :)

Josh Triplett (Dec 18 2019 at 23:03, on Zulip):

Even without source, though, I'm seeing some obvious perf issues:

  13.07%  rustc            librustc_driver-0d78d9a30be443c5.so          [.] std::thread::local::LocalKey<T>::try_with
  10.95%  rustc            librustc_driver-0d78d9a30be443c5.so          [.] crossbeam_epoch::internal::Global::try_advance
   6.93%  rustc            [unknown]                                    [k] 0xffffffff91a00163
   5.86%  rustc            librustc_driver-0d78d9a30be443c5.so          [.] crossbeam_deque::Stealer<T>::steal
   4.14%  rustc            librustc_driver-0d78d9a30be443c5.so          [.] <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold
Josh Triplett (Dec 18 2019 at 23:03, on Zulip):

I'm going to grab a similar profile without -Zthreads=72 and see what things stand out as differences.

simulacrum (Dec 18 2019 at 23:05, on Zulip):

you shouldn't need to build cargo (at least yet) -- it doesn't change for this

Josh Triplett (Dec 18 2019 at 23:05, on Zulip):

Good to know.

simulacrum (Dec 18 2019 at 23:05, on Zulip):

otherwise the doc to look at is https://rust-lang.github.io/rustc-guide/building/how-to-build-and-run.html

simulacrum (Dec 18 2019 at 23:06, on Zulip):

which essentially boils down to cloning the repo, and then this sequence of commands or so:

cp config.toml.example config.toml
$EDITOR config.toml
# edit parallel-compiler to be "true" and uncomment it
./x.py build --stage 1 src/libtest
rustup toolchain link stage1 build/<host-triple>/stage1
# cd elsewhere
cargo +stage1 build ...
simulacrum (Dec 18 2019 at 23:07, on Zulip):

and you'll only need to run the link step once

simulacrum (Dec 18 2019 at 23:07, on Zulip):

ah, you might actually need to enable debug symbols as well

simulacrum (Dec 18 2019 at 23:07, on Zulip):

as https://rust-lang.github.io/rustc-guide/building/how-to-build-and-run.html#create-a-configtoml notes

simulacrum (Dec 18 2019 at 23:07, on Zulip):

@Josh Triplett ^

Josh Triplett (Dec 18 2019 at 23:07, on Zulip):

Thanks, I'll give that a try.

Josh Triplett (Dec 18 2019 at 23:08, on Zulip):

Also, yeah, just a simple perf record is showing some massive scaling issues.

Josh Triplett (Dec 18 2019 at 23:08, on Zulip):

With the default threads=4, a parallel rustc has free and malloc at the top of the profile.

Josh Triplett (Dec 18 2019 at 23:08, on Zulip):

And various parts of libLLVM-9-rust-1.41.0-nightly.so.

Josh Triplett (Dec 18 2019 at 23:09, on Zulip):

With threads=72, all of the top hits look like scaling failures.

simulacrum (Dec 18 2019 at 23:10, on Zulip):

aha, yeah, so looks like we probably just weren't benchmarking with enough threds

simulacrum (Dec 18 2019 at 23:11, on Zulip):

the first thing I'd look at is probably bumping the shard bits here: https://github.com/rust-lang/rust/blob/master/src/librustc_data_structures/sharded.rs#L13-L17

simulacrum (Dec 18 2019 at 23:11, on Zulip):

(should be a matter of bumping that constant up)

Alex Crichton (Dec 18 2019 at 23:13, on Zulip):

@Josh Triplett another issue is that there's a branch of rayon which in theory greatly improves threads going to sleep

Alex Crichton (Dec 18 2019 at 23:13, on Zulip):

that may help a bit here as well

Alex Crichton (Dec 18 2019 at 23:13, on Zulip):

this is great to have this system to test on though

Josh Triplett (Dec 18 2019 at 23:14, on Zulip):

At the moment, it looks like the bottleneck may be in crossbeam.

Alex Crichton (Dec 18 2019 at 23:15, on Zulip):

@Josh Triplett another possibility perhaps, if you can find one crate that isn't getting nearly the speedup you'd think, I was taking a look at -Z self-profile graphs and saw very little work actually being stolen

Alex Crichton (Dec 18 2019 at 23:15, on Zulip):

so it may just be that rustc isn't great at saturating cores right now

Josh Triplett (Dec 18 2019 at 23:15, on Zulip):

I'm less worried about that, and more worried about the massively increased user time (total CPU time spent) when increasing the number of threads.

Josh Triplett (Dec 18 2019 at 23:17, on Zulip):

Interestingly, it looks like the pile of CPU spent on TLS in std::thread::local::LocalKey<T>::try_with is coming from crossbeam_deque::Stealer<T>::steal.

simulacrum (Dec 18 2019 at 23:17, on Zulip):

that sounds like plausibly rayon not quite doing a good job

Josh Triplett (Dec 18 2019 at 23:19, on Zulip):

Would you recommend testing the same commit that nightly-2019-12-18 used, or testing latest master and just enabling parallelism?

simulacrum (Dec 18 2019 at 23:20, on Zulip):

shouldn't matter in practice

simulacrum (Dec 18 2019 at 23:20, on Zulip):

might be a bit harder to test latest master

simulacrum (Dec 18 2019 at 23:20, on Zulip):

but if you're setting threads up already then you should be fine

simulacrum (Dec 18 2019 at 23:20, on Zulip):

(i.e., current master does not default to 4 threads but rather 1)

Josh Triplett (Dec 18 2019 at 23:22, on Zulip):

I'm fine with manually specifying RUSTFLAGS=-Zthreads=4. :)

Josh Triplett (Dec 18 2019 at 23:22, on Zulip):

What about the pile of other changes that got made and reverted?

Josh Triplett (Dec 18 2019 at 23:22, on Zulip):

I saw a bunch of type-related changes, for instance (usize vs u64...).

Josh Triplett (Dec 18 2019 at 23:23, on Zulip):

@Alex Crichton What's the rayon branch that improves threads going to sleep?

simulacrum (Dec 18 2019 at 23:23, on Zulip):

shouldn't matter in practice

Josh Triplett (Dec 18 2019 at 23:24, on Zulip):

Alright.

Alex Crichton (Dec 18 2019 at 23:24, on Zulip):

@Josh Triplett the rayon branch should help with user time because the current bug for rayon is that threads take way too long to go to sleep, which would burn a lot of user time

Josh Triplett (Dec 18 2019 at 23:24, on Zulip):

That makes sense; that's absolutely worth testing.

Alex Crichton (Dec 18 2019 at 23:25, on Zulip):

the branch was last tested in https://github.com/rust-lang/rust/pull/66608

Alex Crichton (Dec 18 2019 at 23:25, on Zulip):

I'm not sure if it's easily switchable-too

Alex Crichton (Dec 18 2019 at 23:25, on Zulip):

we're mostly waiting on rayon to merge the changes itself :)

Josh Triplett (Dec 18 2019 at 23:26, on Zulip):

Are the changes solid enough that they're expected to be merged, or do they need further hammering-on?

Josh Triplett (Dec 18 2019 at 23:26, on Zulip):

It doesn't look too hard to switch to.

Zoxc (Dec 18 2019 at 23:29, on Zulip):

I'd expect some of https://tokio.rs/blog/2019-10-scheduler/ could apply well to Rayon too. I don't think anyone has optimized Rayon for 72 threads.

Josh Triplett (Dec 18 2019 at 23:32, on Zulip):

That does look interesting, thank you for the pointer.

Zoxc (Dec 18 2019 at 23:35, on Zulip):

Having a backoff from stealing stuff when idling would probably help a bit too. Rayon just tries to steal stuff in a loop currently

Josh Triplett (Dec 18 2019 at 23:36, on Zulip):

Which makes sense if you expect to saturate the system, but not if you expect to cooperate with other parallel work.

Josh Triplett (Dec 18 2019 at 23:37, on Zulip):

Successfully built rustc master (Build completed successfully in 0:12:13, most of which was LLVM), trying that out to see if I get comparable results.

Alex Crichton (Dec 18 2019 at 23:39, on Zulip):

@Josh Triplett I don't know the status of the rayon changes myself, @nikomatsakis would be able to speak more to that (the status of the rayon branch and how close it is to landing upstream)

Josh Triplett (Dec 19 2019 at 00:01, on Zulip):

@simulacrum I tried building a parallel rustc based on your instructions, but even with RUSTFLAGS=-Zthreads=72 (or 4) I don't seem to get any parallelism.

Josh Triplett (Dec 19 2019 at 00:01, on Zulip):

/me will try again later.

simulacrum (Dec 19 2019 at 00:02, on Zulip):

hm okay

simulacrum (Dec 19 2019 at 00:02, on Zulip):

you might need to x.py clean to get rid of artifacts, which might not be getting cleaned up for whatever reason after toggling on parallel-compiler = true in config.toml

Josh Triplett (Dec 19 2019 at 07:17, on Zulip):

@simulacrum I tried that, and rebuilt, and I'm still not getting parallelism.

Josh Triplett (Dec 19 2019 at 07:24, on Zulip):

That seems to have helped, I think.

Last update: Jul 02 2020 at 19:55UTC