@Alex Crichton What's the nature of the bug with highly parallel builds?
@Josh Triplett I think Alex was referring to the fact that at least on Linux we've measured fairly high contention inside the kernel due to our current strategy for limiting parallelism (i.e., GNU make jobserver, which on linux is a machine-global pipe), which can lead to abysmal performance with lots of threads
we have a few ideas for mitigation and are currently working on the likely solution which is making cargo be the source of truth for jobserver communication (though rustc will support, and always default to, normal jobserver operation)
yes this has to do with how we're limiting parallelism
the bug we've seen is that when you write
to a pipe on linux, it wakes up everyone waiting on it
so, in the worst case, when you do -Zthreds=72
you have 72^2 threads
72 rustc's, each with 72 threads
that's a lot of people to wake up all the time
and a lot of wasted time churning around
we plan to fix this by not actually having 72^2 threads waiting
but only 72 :)
@Alex Crichton This is actually a known bug that's being fixed in Linux right now.
There was a large pipe rework in upstream Linux, and in the course of evaluating that, Linus noticed that both the old and the new pipe behavior had the "thundering herd" problem.
So that's getting fixed now, which should substantially speed up make -j72
as well.
The only reason it isn't already fixed is that there's a bug in make
that fixing it uncovered. :)
sounds promising!
Do we need to do anything on our side to get that behavior?
I expect we'll still want our changes since macOS etc probably won't get that fix for a while
@simulacrum You wouldn't need to do anything. The main thing would be that you could just drop the limit to 4 threads on recent Linux.
where recent is "unreleased"?
Also, does macOS actually have the thundering herd problem?
@simulacrum Where recent right now is "change not yet in git master, being evaluated".
We've not done benchmarking on macOS, but sort of assumed so -- it's also hard to expose as macOS with more than a 4-6 cores is pretty hard to get (modulo just released mac pros)
Yeah, that's fair. :)
/me really wants someone to build a scalable macOS cloud.
I imagine we'll not really get any benefits from these changes for at least a year or two, right? That's my impression of approximate timeline on getting kernel changes pushed into end-user's hands
i.e., there's no point in waiting or anything like that
so we probably want to move ahead with our fixes regardless
even if they get even more improved in the eventuality
Also @Alex Crichton -- I think you're not quite right that we'd have 72 threads waiting with our proposed fix, since at least I was expecting that we'd loosely have 1 thread per pipe
(just 72 threads total but that's not too interesting)
The changes should get into one of the next two or so kernels, I'd guess.
It won't take years.
Also, there's still value in interoperating with the standard jobserver.
@Josh Triplett whoa that's awesome! (that linux is fixing the root of the issue)
Yeah. I'm currently trying to grab the WIP patches and test them, to see how that does.
Well to be clear our proposal is entirely interoperable, rustc won't lose compat, just be more limited with default jobserver
@simulacrum oh true, it's more of a moral equivalent
@Josh Triplett so far we tested out switching to literal posix semaphores, and that simple change ended up fixing the scaling issues effectively on my 28-core machine
it's still not perfect because rustc immediately spawns ncores threads, which is quite a lot
so we need to fix that a bit, but the scaling was much better with posix semaphores
(where sempahores presumably don't have the thundering herd issue)
but yeah we're unlikely to stick with vanilla jobservers for macos/windows which are likely to have similar problems
Windows has pipe-equivalents that guarantee to only wake up one waiter.
I don't know macOS internals well enough to know if it does.
I think I'm mostly trying to figure out if you can avoid doing that work by instead making sure you're using the right OS primitive.
Hypothetically, if Windows and macOS could work around this problem by using something pipe-like that isn't a POSIX pipe, and Linux worked fine with recent kernels and we could limit to 4 threads on older kernels, would you still want to do the extra architectural work for a workaround?
I at least expect that there are other advantages to letting Cargo know this information -- for example, I've had some thoughts about scheduling jobserver tokens down in a more intelligent way than what any system primitive would give us I think
How so?
/me is interested in ways to improve parallel builds.
e.g. right now if you're building lots of crates it might make sense to make sure each rustc has 2 threads (presuming you have the cores) rather than have one rustc with 10 threads and the rest stuck at 1 or something like that
I mean, as long as you stay 100% CPU bound, you're doing work that will need to happen regardless. The problem comes in if you drop below 100% CPU.
well, yes, but it's not necessarily true that you want to e.g. run codegen immediately for some crate if you won't need it for a while
Yeah, scheduling at a crate level may make sense.
On that note, though...
At that point it might make sense to adopt a "pull" model where we try to build later crates and let those drive what we build next.
Simulating the idea of earliest-deadline-first.
yes, or something like that -- certainly right now rustc isn't really capable of communicating that information up in an on demand fashion
For instance, almost any project I build that has syn in its dependency graph tends to wind up at a bottleneck where it's building only syn at some point.
but e.g. rust analyzer I think has a model like this
So it'd be nice to start syn as soon as possible.
I think that's kind of what I'm getting at -- we would potentially want to give syn as many threads as we can
and no system primitive would allow that level of prioritization I expect
True.
I think the ideal model effectively looks like "run everything at once, at two priority levels, higher for pull and lower for things we know we'll need eventually but we don't need yet".
You know the "readahead" trick on Linux, of "figure out everything you need to read from disk, and next time start reading it in the order you'll need it"? We could get a decent approximation of the same thing if we have an estimate for how long (in CPU time) each crate takes to build, assume that future builds will take similar amounts of time, and prioritize accordingly.
Anyway, I think I've found the Linux patch I need to test.
I would agree that we probably want our own scheduling system even if linux/windows are fixed
I think at the bare minimum I agree that this linux kernel change may take a very long time to propagate, and we want to ramp up default parallelism sooner
but I also agree that we can probably more cleverly prioritize crates
e.g. cargo already has some degree of scheduling heuristics
which can help keep everything saturated in theory
and we could apply similar heuristics to "we have a token, and N rustc instances want a token, who gets it?"
And we could always simplify the logic in the future, if we ever get to the point that we can start relying on fixed versions.
@Josh Triplett oh another thing you may want to do for more parallelism on a 72-core machine is to set -Ccodegen-units=100000000
or just set CARGO_INCREMENTAL=1
in release mode
which may drop perf a bit but not a huge amount due to thinlto
I'll give those a try too, once I have numbers for baseline and -Zthreads=72
with the patch I'm testing.
man that's a whole new world for me, compiling a new kernel and then running it locally :ghost:
I'm also using the compile of the kernel I'm testing to cross-check the results. :)
Kernel compile, on the kernel without the patch:
real 0m41.377s user 10m1.718s sys 2m25.655s
Kernel compile on the kernel with the patch:
real 0m40.463s user 9m55.603s sys 2m25.449s
Looks like it doesn't make that much difference to kernel compiles, but considering that half that time is serialized compressing/linking at the end anyway, that's still decent.
Now to try wasmtime...
Didn't help the wasmtime build with the default threads=4:
real 1m10.096s user 17m38.136s sys 0m24.415s
that looks pretty suspiciously high
I would expect system time to be far lower
maybe that's with cold disk cache?
Nope, that's the second build.
For reference, the wasmtime build with parallel rustc (and default threads) was:
real 1m9.191s user 16m55.383s sys 0m23.964s
Which is basically the same wall-clock (slightly better) and noticeably better user time.
The kernel patch did slightly improve the results with RUSTFLAGS=-Zthreads=72
, but that's still much worse than 4:
real 1m27.163s user 40m44.138s sys 3m4.341s
hm, I think I was remembering when we were benchmarking first 3 seconds of a build which might explain my expectation of lower system times
could you try on both with timeout 3s cargo ...
?
For comparison, -Zthreads=72
without this kernel patch gave 2m1s real-time and 50m user.
So the kernel patch to fix thundering herd wakeups for pipes is a huge relative improvement for -Zthreads=72, but there's still a problem there.
@simulacrum This system takes a while to reboot, so I'd like to get all the results I can with the kernel patch before I switch back to without.
ah, okay, wasn't sure if you were working on a separate machine or something
I'm doing these tests on a separate system than the one I'm chatting with, but that separate system takes minutes to reboot.
(server BIOSes, sigh)
fwiw, one thing that might be helpful is to get some loose syscall counts/timing
@simulacrum But purely subjectively, without the kernel patch, cargo spent a few seconds showing 0 crates compiled, while with the patch, the "slow startup" problem doesn't seem to be true anymore.
that's the primary behavior we're expecting to fix, so that's good to hear
I can easily get some -Ztimings
data from this patched kernel.
Also, I'm trying some experiments with different -Zthreads
values.
For instance, -Zthreads=16
gives:
real 1m7.279s user 18m13.828s sys 0m38.424s
That's a bit more user time, but a few seconds wall-clock improvement over 4 threads, and the improvement definitely isn't noise.
Subjectively, this feels like there's some critical non-linear scaling issue that dominates with high numbers of threads but not with low numbers of threads.
I tend to agree -- I think we've not yet tracked down what that is. It might be that we just don't have enough data or so, or maybe we're not yet parallel enough (i.e., the compiler itself is not sufficiently able to utilize that parallelism)
/me would be happy to help get the data you need.
Do you have experience or time to dig in? One of the problems we've had is that none of us really know how to figure out what the problem is (or what the tools are really either)
This is entirely subjective, but having worked on parallel programming for many years, this doesn't feel like "we don't have enough parallelism opportunities in the compiler". This feels like either "something is scaling quadratically or worse" or "something is blocking/spinning/waking-up unnecessarily".
I do have the experience, and some of the time.
(Parallel programming was my dissertation topic. ;) )
We'd be happy to receive help, and if you can even write up some quick "things to look at" or so that would be amazing
I can try and help out with how to get builds and such with the parallelism enabled
It depends on the nature of the problem you're debugging. But to a first approximation, if the problem is "too much user time", perf is great for this.
("too much blocking" is harder to debug, but possible.)
I think we've not had any major success with basic use of perf -- i.e., parallel compiler does not differ from non-parallel
(if you mean just perf record
and friends)
That'd be really surprising. If user time is going from (in my case) 16 minutes to 40-50 minutes, perf really should point to something there...
A thought crosses my mind...
It's possible we've just not spent enough time on high enough core machines
Sometimes, these scaling issues show up much better on bigger systems, precisely because there's more contention. They also can show up more on multi-socket systems, because unnecessary communication is slower between sockets.
O(n^2) gets much more noticeable at 72 than 16. :)
(I also don't recall specifically looking at this in recent time, though I do recall doing something like this a few months ago, so things may have also changed since then)
So, in addition to me personally working on this: is there someone on the parallel rustc team who has room to host an 88-way box somewhere? I'll mail them one. :)
It's rackmount hardware, and produces enough noise that you don't want it under or atop your desk, but if someone has somewhere to rack it and make it available for everyone on the parallel rustc team to use it...
I am uncertain. I suspect the answer might be no -- and we might have more luck with just one-off running something on EC2, though I'd need to take a look at pricing there.
Pricing there is about $5/hour for a comparable box.
That's not counting storage and bandwidth.
(We do have some budget from AWS so we might be able to afford it, I'm just not sure off the top of my head :)
Just confirmed, c5.metal
is what you want, which is $4.08/hour.
That's a 96-way parallel system.
(That said, I would love to give this hardware to Rust, where it'll be free for anyone to do this kind of testing on an ongoing basis.)
It's one of my desires to find a way for the infra team to be able to accept this sort of hardware donation :)
It doesn't have to be the infra team. I'd accept "someone in the project can arrange to rack it, one-off". So if anyone has an office near a lab...
(It can always become infra later when there's a path for that.)
sure, yeah, I agree with that
Are there debug symbols available that would allow directly running perf on the nightly-2019-12-18 build, or am I going to need to build from source?
you're going to need to build from source unfortunately
you can either directly checkout the SHA of that nightly or toggle parallel-compiler = true
in config.toml (not sure how much experience you have with rustc compiler dev)
Fairly little. Mind pointing me to the "build rustc and cargo from source and rustup link
them 101"? :)
Even without source, though, I'm seeing some obvious perf issues:
13.07% rustc librustc_driver-0d78d9a30be443c5.so [.] std::thread::local::LocalKey<T>::try_with 10.95% rustc librustc_driver-0d78d9a30be443c5.so [.] crossbeam_epoch::internal::Global::try_advance 6.93% rustc [unknown] [k] 0xffffffff91a00163 5.86% rustc librustc_driver-0d78d9a30be443c5.so [.] crossbeam_deque::Stealer<T>::steal 4.14% rustc librustc_driver-0d78d9a30be443c5.so [.] <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold
I'm going to grab a similar profile without -Zthreads=72
and see what things stand out as differences.
you shouldn't need to build cargo (at least yet) -- it doesn't change for this
Good to know.
otherwise the doc to look at is https://rust-lang.github.io/rustc-guide/building/how-to-build-and-run.html
which essentially boils down to cloning the repo, and then this sequence of commands or so:
cp config.toml.example config.toml $EDITOR config.toml # edit parallel-compiler to be "true" and uncomment it ./x.py build --stage 1 src/libtest rustup toolchain link stage1 build/<host-triple>/stage1 # cd elsewhere cargo +stage1 build ...
and you'll only need to run the link step once
ah, you might actually need to enable debug symbols as well
as https://rust-lang.github.io/rustc-guide/building/how-to-build-and-run.html#create-a-configtoml notes
@Josh Triplett ^
Thanks, I'll give that a try.
Also, yeah, just a simple perf record
is showing some massive scaling issues.
With the default threads=4, a parallel rustc has free and malloc at the top of the profile.
And various parts of libLLVM-9-rust-1.41.0-nightly.so
.
With threads=72, all of the top hits look like scaling failures.
aha, yeah, so looks like we probably just weren't benchmarking with enough threds
the first thing I'd look at is probably bumping the shard bits here: https://github.com/rust-lang/rust/blob/master/src/librustc_data_structures/sharded.rs#L13-L17
(should be a matter of bumping that constant up)
@Josh Triplett another issue is that there's a branch of rayon which in theory greatly improves threads going to sleep
that may help a bit here as well
this is great to have this system to test on though
At the moment, it looks like the bottleneck may be in crossbeam.
@Josh Triplett another possibility perhaps, if you can find one crate that isn't getting nearly the speedup you'd think, I was taking a look at -Z self-profile
graphs and saw very little work actually being stolen
so it may just be that rustc isn't great at saturating cores right now
I'm less worried about that, and more worried about the massively increased user time (total CPU time spent) when increasing the number of threads.
Interestingly, it looks like the pile of CPU spent on TLS in std::thread::local::LocalKey<T>::try_with
is coming from crossbeam_deque::Stealer<T>::steal
.
that sounds like plausibly rayon not quite doing a good job
Would you recommend testing the same commit that nightly-2019-12-18 used, or testing latest master and just enabling parallelism?
shouldn't matter in practice
might be a bit harder to test latest master
but if you're setting threads up already then you should be fine
(i.e., current master does not default to 4 threads but rather 1)
I'm fine with manually specifying RUSTFLAGS=-Zthreads=4
. :)
What about the pile of other changes that got made and reverted?
I saw a bunch of type-related changes, for instance (usize vs u64...).
@Alex Crichton What's the rayon branch that improves threads going to sleep?
shouldn't matter in practice
Alright.
@Josh Triplett the rayon branch should help with user time because the current bug for rayon is that threads take way too long to go to sleep, which would burn a lot of user time
That makes sense; that's absolutely worth testing.
the branch was last tested in https://github.com/rust-lang/rust/pull/66608
I'm not sure if it's easily switchable-too
we're mostly waiting on rayon to merge the changes itself :)
Are the changes solid enough that they're expected to be merged, or do they need further hammering-on?
It doesn't look too hard to switch to.
I'd expect some of https://tokio.rs/blog/2019-10-scheduler/ could apply well to Rayon too. I don't think anyone has optimized Rayon for 72 threads.
That does look interesting, thank you for the pointer.
Having a backoff from stealing stuff when idling would probably help a bit too. Rayon just tries to steal stuff in a loop currently
Which makes sense if you expect to saturate the system, but not if you expect to cooperate with other parallel work.
Successfully built rustc master (Build completed successfully in 0:12:13
, most of which was LLVM), trying that out to see if I get comparable results.
@Josh Triplett I don't know the status of the rayon changes myself, @nikomatsakis would be able to speak more to that (the status of the rayon branch and how close it is to landing upstream)
@simulacrum I tried building a parallel rustc based on your instructions, but even with RUSTFLAGS=-Zthreads=72
(or 4) I don't seem to get any parallelism.
/me will try again later.
hm okay
you might need to x.py clean
to get rid of artifacts, which might not be getting cleaned up for whatever reason after toggling on parallel-compiler = true in config.toml
@simulacrum I tried that, and rebuilt, and I'm still not getting parallelism.
That seems to have helped, I think.