I was wondering, is this an alright place to discuss perf impact of parallel rustc nowadays? Or is an issue a better place? I just built a parallel rustc locally and compared it with the previous nightly, and I was curious to test out
-Z timings and see what cpu usage looked like during the build, but it was unfortunately slower in that a parallel rustc took 74s for a full release build (nothing cached) compared to 62 on nightly
This seems like a perfect place.
That sounds somewhat expected, though -- I suspect we're pretty poorly tuned and overeager at grabbing jobserver tokens
You might see more success with e.g. RUSTFLAGS="-Zthreads=2"
Ok cool, so yeah taking a look at things I'm watching cpu usage on nightly and it's "all green" which I think means all in userspace
w/ parallel rustc it's "mostly red" which I think means most of the time is spent in the kernel
perf shows a huge amount of time in the kernel
trying to track down what's where
I think the jobserver management may also be wrong?
when I ran a cargo build I got "45 (jobs=28 ncpu=28)"
er, that means that there were at most 45 rustc instances running in parallel
but the default,
-j28, should have made it such that no more than 28 rustc instances were running
nightly does indeed not exceed 28
yeah a huge amount of time is spent acquiring/releaseing tokens
- 21.09% rustc_rayon_core::sleep::Sleep::sleep from perf
I am not really surprised -- I forget what our current management strategy is, but it might be something like "let's release/reacquire tokens whenever a rayon thread goes idle, which presumably happens quite often
is this something I should open an issue for?
I think at this point I would say no
we're not really at the point where performance is a (sufficient) concern
the current jobserver management strategy is unknown (I just pulled one out of thin air)
I think Zoxc might know it but not even sure about that
@Alex Crichton It might be good to open an issue about the jobserver management strategy though -- I don't know how we should do that, and discussing on an issue seems good
in particular I think we're going to need some more intelligent server than the current model allows to evenly distribute tokens between rustc instances
ok, I'll open an issue
e.g. if we have something like 8 cores we probably want 4 rustcs each with 2 threads (approximately) rather than 1 rustc with 8 threads, I'd imagine
so additionally after focusing the call grpah a bit more
80% of rustc's time is spent in
which I think is spawning threads
does rustc spawn threads on demand or immediately?
hm, I thought it was immediately, but I think we leave it up to rayon -- maybe rayon is spinning up threads if they're idle?
@cuviper might know
rayon creates threads for the global pool on first use
this is a small snippet of the perf sorted by self-time
or if you create a manual ThreadPool, it's new threads for each time you do so
hm ok, I'll open an issue for that
@cuviper so to be clear we would expect num_cpus threads per ThreadPool creation, approximately, right?
@Alex Crichton I wonder if this is the jobserver threads -- IIRC, there's a thread spawn in that crate?
the number is tunable, but it defaults to number of cpus, yes
@simulacrum yes there's one thread in the jobserver crate
but I suspect the 28 threads spawned by each rustc is dwarfing that
this is an aggregate of all rustc processes near the start of the build
(that profile I linked above)
and so if you spawn 28 rustc's that each spawn 28 threads
that's a lot of threads to spawn very quickly
ah -- so we're spawning like 800 threads :)
the crate at the beginning of the crate graph are also very quick to compile, on the order of a hundred or so ms
so we're thrashing thread creation quite a lot
I wonder if we could get rayon to sort of "slow start" thread spawning
e.g. create the pool with size 1 and then if it detects work after 1 second grow to num_cpus
we've talked about dynamic threads before, but that's ... challenging
I guess you're not talking fully dynamic though, just lazy
28 threads is also probably not the ideal number to use for rustc due to contention, etc.
maybe rustc needs a heuristic max on its threads?
just like codegen-units uses 16 regardless
We also should have a minimum on job tokens that a rustc holds
like a rustc should hold at least 1 token always.
avoids spawning 56 rustcs on a 8 core system.
btw I remain fairly unconvinced that rayon is a good fit for us, in general
much as I love rayon of course :)
I think it might eventually be a good fit, but I sort of suspect we might also do better with some simpler setup for the time being
I think we might get away with managing the thread pool ourselves but otherwise using rayon, though I don't know how possible/realistic that is