quick update on jobserver work: afaict, the new(?) rayon is buggy and releases jobserver tokens it didn't acquire
which may have led to magical performance that we were seeing, uncertain
in any case, investigating to see if I can fix now
As in rustc-rayon, used on master?
no, the one we're trialing
@Alex Crichton Am I correct that we expect rustc to never release a jobserver token until after it's acquired it itself? i.e., it should not release the token that cargo acquired to spawn it
(in particular, rayon does this before blocking -- I'm not sure yet on what)
rustc can and will release the implicit token given to it by cargo though, which it didn't explicitly acquire.
@Zoxc I'm confused
@Alex Crichton just said that we should never do that
(and I tend to agree, otherwise we can end up with a bunch of rustcs even with
Ideally we shouldn't do it, but it's not buggy
I feel like it is buggy?
i.e., if you release that token then we're basically ending up in a situation where we can deadlock, right? since no one is able to make progress
you need each process to always have at least one token, otherwise it shouldn't do anything
It will immediately (or before) acquire the token back with the Rayon threadpool though.
but it can lose that race
at which point no one is going to make progress
hm, okay, I should clarify -- eventually, it'll make progress
but it'll stall until (potentially) N other rustcs try to finish
in any case it seems pretty clear that this behavior is pretty suboptimal
(since it means that
-j1 is no longer guaranteed to have at most one rustc process, etc)
and that to me even seems like just flat "buggy"
I guess I won't try to fix it and just hope things actually work, even if I can't really tell due to this
rustc can also "stall" in the same way due to Rayon giving up tokens to one of the threads it's using while it's idle, or due to rustc waiting on a query which is currently computing on another thread.
It would be nice to ensure that rustc always keeps at least one token. There was code to do this in some of the cases, but I removed it since it was buggy (https://github.com/rust-lang/rust/pull/59804/files). To fix the transfer of the jobserver token from the main thread to the thread pool would probably require some less pretty Rayon changes (an API to enter a thread pool while holding a token to be given to the thread pool).
well, it'd mostly be a matter of using the thread pool on the side
e.g. you'd not have the 'main' thread be a threadpool thread
(which I think is largely true most of the time? e.g. in non-rustc land)
It has to be a threadpool thread though, for efficiency and for
WorkerLocal to work.
Using it off the threadpool hits the slow paths in Rayon.
okay, but that seems like a solveable problem
maybe this is a good reason to not use rayon (as we've discussed a little)
No, any other thread pool should be designed the same way and we should be running the main rustc thread on it =P
Maybe an API for a rustc thread to become one of the Rayon threadpool threads would work? Or just an API for spawning and joining the threadpool at once should also probably suffice.
Actually we could just provide a parameter to the thread pool with how many jobserver tokens it starts with and skip acquiring tokens for some threads, and then have an API to enter the thread pool without giving up any tokens.
I don't really know what the APIs etc need to be like
but it does seem like we need to more deeply integrate token management into rayon
I guess we can just do nothing with jobserver tokens when entering a thread pool then. So we just need a flag which tells Rayon one of its threads don't need to release / acquire a jobserver token on startup / exit
Sorry haven't caught up on the discussion here, but it is incorrect for rustc to release its implicit token
and that is a bug we cannot ship with
part of the major purpose of the jobserver is to limit memory consumption
if you release the implicit token then it's possible to spawn unlimited rustc instances
and they're all just sitting idle consuming memory
for example if you run with
-j1, only one will ever be doing work
but we may spawn hundreds of
rustc instances in parallel
which can kill builds that rely on
-j for limiting memory
That doesn't seem to be an easy property to guarantee
no, it is not, and this is one reason why the llvm backend is so complicated right now, it has to juggle around this "implicit token"
I'm guessing it would require fibers in Rayon to avoid giving up jobserver tokens when waiting on queries, but I'm not sure if that covers all cases. Fiber also can also cause excessive memory usage if not limited by waiting, so it's not a great fix.
@Alex Crichton in theory cf92e2ebf129f4fd8075650b21c612d196b7a2f8 should be a valid try commit (you want alt build as usual, and you'll need to rebuild cargo -- I'm not sure if we uploaded a copy)
or how to get that copy if we did
https://github.com/Mark-Simulacrum/cargo/commit/c8373fbde42777e389c889e8b16a1d5f823e7a68 should be enough to rebuild cargo so hopefully not too hard
@simulacrum awesome, thanks!
my computer with 28 cores got shut down today and is being shipped to me over the thanksgiving break
so I'llg et back to you next monday on the numbers for that
okay, sounds good
I tried to do some loose profiling of locks and such but that mostly failed (wasn't able to get low enough overhead / enough samples)
mostly pointed at
env::var calls in MIR interpret and Cargo, along with Cargo's interner
neither of which seemed... plausible to actually be a hot spot
hopefully Santiago will have more success
Hm well env accesses are actually locked
So not entirely implausible if something is slamming env vars
We may need to instrument rustc to try to learn about contended locks
I think this pretty much does not see parking_lot locks at all, which might explain high percetage here (though unclear -- the numbers from perf stat indicate that it sees all calls to futex, at least)
<rustc::mir::interpret::error::InterpErrorInfo as core::convert::From<...>>,
I think the cargo ones probably don't matter in practice
like, maybe we could get wins, but ultimately that's not important, cargo is mostly idle
<rustc::mir::interpret::error::InterpErrorInfo as core::convert::From<...>> I plan to file a PR switching it out for a lazy static or something like that
unfortunately our normal benchmarks (perf.rlo) probably won't notice, as the uncontended lock is presumably quite cheap
Cargo should be easy to fix yeah, the string one seems the most plausible
Can you swap out parking lot for libstd to get contention numbers?
hm, maybe, I can try that
std's APIs are different I think and sufficiently so that this might not be readily possible
@Alex Crichton also -- forgot to note this -- you'll want to run
sudo rm -f /dev/shm/sem.jobserver-rust* between builds; obviously that's not tenable long-term but for now I believe is necessary-ish
sync.rs has a custom
Lock wrapper, so API doesn't matter too much (
LockGuard is reused though)
parking lot parks ~2000 times total in the first 3 seconds from raw_mutex...
~6000 total futex calls from parking lot
so parking lot is not the source of trouble
(this is on a run with perf stat reporting 457,198 futex calls)
@Zoxc yeah, std RwLock/Mutex don't support mapping afaict
initial findings suggest that contention is happening in:
utilized this patch to libstd to gather data: https://gist.github.com/Mark-Simulacrum/6dfa7678f2d449175aa1f3d8856340f7
it has fairly high overhead, though, so measurements may not be super reliable
overhead is much lower if I adjust the elapsed higher
(initial measurements were with ~100 nanoseconds)
but I can get ~65 job units in the 3 second run if I adjust to 300, which is about the same as I get without this
trying to run some more benchmarks and get a good sense of how reliable this is
I am consistently seeing stalls for hundreds of milliseconds in rustc_rayon_core::sleep::Sleep::wake_specific_thread
but maybe that's expected?
e.g., 51.508068ms; 772.124223ms; 773.960212ms; 58.569277ms; 951.101171ms; 51.476288ms; 866.302171ms; 867.230675ms
to make sure I understand this right, @simulacrum cf92e2ebf129f4fd8075650b21c612d196b7a2f8 is a build which is new rayon plus semaphore-for-jobserver, right? no other changes/
this looks to be a massive improvement
Possibly modulo cargo update to latest master but I can't imagine that makes a difference
ok so the numbers aren't entirely exonerating
but I posted them to https://hackmd.io/bmB35-7oRzCeGOCanI0SkA?both
they look massively better than the prior version
I found a flag on
epoll that might allow us to avoid thundering herd problem with the pipe, too, which might be enough that we can use that for now
oh -- are you using the right cargo there?
you need to use the one from that toolchain
basically what the script snippets say
I think rustup-toolchain-install-master doesn't do that though
I may need to fix my thing then
i.e., you're not downloading cargo by default
also how did this
I must be using the equivalent of no jobserver then
ok I'll need to recollect data
yeah normal cargo would basically be no jobserver (well, cargo itself would limit to 28, but not internal parallelism)
I cna't collect numbers right this second, will be a bit
sounds good -- I think you shouldn't need to build the cargo, but it's probably best to do so -- just to be safe -- the submodule in that commit was updated as needed to an appropriate commit I think
ok I think it's working now
I downloaded the precompiled cargo
and i confirmed that
-j3 actually limits things
(it'll loosely not work due to new-rayon bugs that we spoke about last week)
i.e., I think we can get in a situation where we have more than N running rustcs and such, but we will (hopefully) never actually be doing more than N threads of work
ok those numbers look even better -- https://hackmd.io/bmB35-7oRzCeGOCanI0SkA?both
system time barely changes
top still shows a smidge of red, but nowhere near where it was before
14 threads seems to be a sweet spot for me
so not ahuge amount of benefit for hyperthreads
do you know if you're using LLD by default locally?
I am not
okay, was going to suggest disabling threading if yes
When compiling one crate on 8 cores, SMT made things slower until I landed some PRs which reduced contention. I'm using a Ryzen R7 1700, so I don't know how SMT affect performance on Intel chips, but I know AMD generally performs better in multi-threaded workloads.
handler code is basically the default handler code and added the acquire and release stuff before and after spawning
That would release the token before the spawned thread finished, right?
ouch, we want that right after the run call
@Santiago Pastorino Wouldn't that cause rustc to wait until it has all the tokens before doing anything? Or did you do some Rayon changes to thread spawning too?
did this as request of @simulacrum but we didn't touch rayon thread spawning
Rayon also acquires tokens when spawning / closing threads, so you'll need to remove those too
@Alex Crichton I've been looking into the EPOLLEXCLUSIVE improvement I mentioned, and it looks like it's definitely linux-specific (along with all of epoll). It also looks like kqueue on macOS is claimed to avoid the thundering herd problem, but I can find no evidence of this in manpages (just in random blog posts).
Right now as far as I can tell we can just break compatibility with make, and use semaphores (which are supported on macOS and Linux, at least, and are in POSIX so I expect pretty wide support). That's probably not really tenable though. The alternative then is to work on the Cargo integration to have a single process read/write the file descriptor I guess.
Do you feel that working on the Cargo integration would be reasonable? Should we consider 'just' breaking?
If we pursue the route of semaphores, here's what I think we should do:
well, let me just type it out
I think we should add dual support in the
jobserver crate for semaphores and pipes. Cargo would then tell the compiler either (a) here's a semaphore, go wild, or (b) here's a pipe jobserver, you can spawn at most N threads where N is like 10.
Most compilations with Cargo do not integrate with an external jobserver
so that means most compilations would use the semaphore and can go wild
for those that do we'd limit rustc's parallelism to alleviate the thundering herd problem
and we can have a work item for later to fix this
basically Cargo just needs the ability to query a
jobserver::Client if it's a semaphore or a pipe
jobserver crate auto-detects semaphores or pipes
@simulacrum does that sound reasonable?
I think that should give us a lot of bang for not a lot of buck, while leaving it possible to fix this in the future
my impression is that most compilations do shell out to cmake/make somewhere or so which my impression was limits what we can do
but yes, that sounds reasonable
I can work on a PR to jobserver-rs to make that logic happen
(it's worth noting that e.g. on Windows it seems initial measurements suggest that a semaphore there still suffers from the thundering herd problem though we don't have super knowledgeable people testing that yet)
but this does sounds reasonable
@Alex Crichton will go ahead and start work on this
Oh wait right
I forgot that
The cmake make stuff
@simulacrum let's leave this on the backburner for now
and fix windows first
and see what that requires
if semaphores don't work there we'll likely require a system where cargo manages everything anyway
so fixing windows is basically a "shrug" from me -- I can look around at documentation for the primitives we use today
but if those -- already being semaphores -- are still leading to a bunch of wakeups, then there's plausibly nothing we can do
well the thing we can do to fix that is what we'd do to fix the integration with the jobserver
which is that each rustc instances has a separate ipc mechanism connecting it to cargo
and cargo selects which rustc to wake up, only waking up one instead of all of them
like there would be N jobservers instead of just 1
but not literally jobservers because cargo has to react to a request for a token
I thought you meant fixing windows without doing the cargo thing
@Alex Crichton so do you expect to invest time into trying to fix windows semaphores, and then if that fails, we invest into making cargo be the one to dispatch all jobserver operations?
oh sorry so to clarify, I mean that we should "fix windows" in that we should find some solution that is on the same par of performance as linux
I don't think it matters what that is, and it could be the semaphores we have today with a tweak, but we should fix it somehow
and then we can go from there to see how to unify all the implementations and land this all for real
I don't think I'll personally have time to work on this, but I can allocate time if necessary
Windows won't really get on par with Linux though. It's pretty known to be slow, especially with multithreading =P
makes sense; I can look at the flags to the semaphore primitives we use today at least
and then failing that take a look at making the cargo refactor
but do not know how mush the jobserver is affecting the times because when I look at eg. the cargo crate then there is no other crate running and I still get longer times with 32 threads compared with 4 threads at least on windows
There's likely contention with thread spawning in Windows and inside rustc too (miri, symbols, spans, unknown stuff)
@Zoxc fwiw, all the contention during normal execution we were able to detect with some loose benchmarks wasn't really at the syscall level -- of course, this may just be indicative of either bad measurement or lack of good parallelism in the compiler
as a quick update it looks like there's no better primitive to be using on windows at least that I can see
(based on MS docs)
I guess then the next step is to start working on making cargo act as the jobserver for rustc instances.. I'll start working on that soon then I suppose, though it feels quite painful