I don't think I have any updates
I have not spent time on looking into making Cargo act as the jobserver for rustc (vs. the pipe, etc)
@WG-parallel-rustc do we think it's worth meeting today, given this?
(not sure if others have things to talk about)
not from my side, last week I were doing some MIR stuff
okay, let's tentatively cancel, but if someone feels that we should meet, then please say so in the next hour
Cancelling sounds ok with me
Hey all sorry i'm slow
but i'm also ok :)
that said, I see there's been a lot of activity, @simulacrum is it possible to summarize what's up?
I saw some discussion of possible bugs in the new rustc scheduler, or at least the jobserver integration with it?
Yeah, I can provide some summary
We've loosely concluded that the current jobserver, at least on linux, is showing fairly high contention inside the kernel (for read/write calls on the pipe), which we believe is due to ~all rustc threads waiting for a jobserver token getting woken up on every new token (vs. just one). Using a POSIX IPC semaphore here seems to resolve things, but is not viable due to being incompatible with make (and cmake, etc.).
We also discovered that at least the new rayon branch is incorrectly releasing the implicit (and only) token held by a rustc process when it goes to sleep due to lack of work; we have a tentative fix planned that'll be within rustc itself essentially avoiding the bug by just not ever releasing the implicit token.
For the contention, the current plan is that rustc will no longer directly connect to the jobserver, but rather call into Cargo (via some sort of pipe, possibly, or some other mechanism -- this is a bit unclear right now), and Cargo will issue the blocking reads/writes on the jobserver pipe. This should give us flexibility in terms of which rustc's get tokens and is more generally nice for getting control over scheduling tokens across the
rustcs we spawn. That work is planned for myself to do, but I've been semi-avoiding it so far since it sounds like a relatively large task (and I need to spend some design time writing up a spec before digging in, I think).
@Santiago Pastorino is planning on doing the rayon bug mitigation, I think.
It seems like this might also impact folks attempting to run rustc outside of cargo
Though I guess we've never really "approved" of them using the jobserver, right?
Ah, so, in general the thought is that we can always fallback on the old mode
that's not too hard
(I remember at some point us -- or maybe just me -- thinking that it'd be nice if they could integrate into "standard" workflows)
it's just that the old mode might basically limit you to, say, 2-4 threads at most per rustc
the main motivation for interjecting cargo
is..well, what is it? :)
being able to have a richer vocab than just "give me and I'll block"?
having just one thread/process doing the writing into the jobserver pipe to mitigate the contention
We've loosely concluded that the current jobserver, at least on linux, is showing fairly high contention inside the kernel (for read/write calls on the pipe), which we believe is due to ~all rustc threads waiting for a jobserver token getting woken up on every new token (vs. just one).
do we believe this is because of rayon's "wake-up-the-world" scheduler behavior?
I think, well, maybe, but mostly no -- this is more so because linux itself will wake up all T*N (in the large core count case, up to ~800 threads) threads when a single byte is written into the jobserver pipe, where all are doing
read(&mut [_; 1]) basically
Ok. Yeah I was thinking that I didn't quite see how it could be linked to rayon since rayon threads are basically all just blocked
i.e., the rayon events will be going out to the threads, and if they were sleeping they'd wake up, but they're not sleeping, they're blocking waiting for jobserver events
it's only linked to rayon insofar as rayon currently is not quite well supporting the "don't spawn T threads until you need them" / slow-start or so, but that's just exacerbating the problem rather than causing it most likely
yes. the new scheduler in principle would help with that specific part of the problem perhaps
to be clear, ~all benchmarks were done with your fork of rayon
it won't help beyond the benchmarks
since they're already using it :)
I think that's all -- I don't think we've made other progress etc
thanks @simulacrum :heart: