@nnethercote -- it occurred to me that you might be interested in helping to optimize the "single-threaded overhead" of rustc running in parallel mode?
/me tries to nerd-swipe :)
@nikomatsakis Possibly, but I've just done 2 solid months of rustc perf work and I kind of need to do some Firefox stuff for a bit :/
@nnethercote heh ok =)
How do I build a single-threaded parallel rustc?
parallel-compiler = true in
config.toml is obvious, but I can't see anything in there about the number of threads
rustc -Zthreads=N at runtime, perhaps
yes, the default is one thread today I believe, but I think it's good practice when benchmarking to always set -Zthreads
note that this only affects threads during query-parallelism, not codegen parallelism
Yes, default is 1 thread in librustc/session/config.rs, ethanks
I did some basic comparisons between serial and parallel with one thread.
packed-simd has the greatest slowdown for a
check-clean build, of about 7%
That 7% is dominated by the fact that
get_query is more expensive
This is mostly due to sharding (which in a serial compiler doesn't really do anything) of the query cache. In particular, when a key is looked up it now gets hashed twice: once in
get_shard_by_value to find the shard, and then again to look up the hash table within the shard.
There is also some locking overhead, unsurprisingly.
I think the duplicate hashing is the biggest source of overhead.
It would be nice if we could compute the hash first, then say "get the shard/value using this hash value", but I don't think
HashMap allows that.
RawEntryBuilderMut can be used for this?
from_key_hashed_nocheck might do the trick
Indeed, I think
RawEntryBuilder::from_key_hashed_nocheck does exactly what I want! And
Sharded already has
get_shard_by_hash. I will try this out tomorrow, see how it goes.
Did you look at wall-times? 7% seem low compared to the last time this was benchmarked, unless non-
check-clean builds have more overhead. Also which processor do you use?
try_get hashes the key 3 times now. I tried reducing the 2 hashes to 1 before, but LLVM didn't inline stuff well, so it was a performance regressions. Keys are usually integers too, which means hashing is pretty cheap.
Does every query type get its own QueryCache? I count 188 query types
And in the parallel compiler they are sharded into 32 pieces,
And each QueryCache has two hashmaps.
So that's a total of 188 * 32 * 2 = 12,032 hashmaps?
@nnethercote nice :)
@nikomatsakis: https://github.com/rust-lang/rust/pull/66013 is the PR that eliminates the repeated hashing
https://github.com/rust-lang/rust/pull/66012 is an interesting one -- I de-querified
trivial_dropck_outlives and it was a perf win for the serial compiler because the cost of querying was actually larger than the cost of re-computing. And the benefit for the parallel compiler could only be higher, because the cost of querying is higher in the parallel compiler
FWIW, here are my notes on parallel overhead from yesterday:
- Overhead is almost all in get_query()
- Lock (atomic.rs, parking_lot) (outweighs the reduction in RefCell use)
- codegen/regalloc due to more code in get_query
- memcpy: QueryJob is bigger (144 bytes), memcpy is needed to initialize it
- Note: 188 query kinds, 32 QueryCaches each = 5056, 2 HashMaps each: 12,032 HashMaps!
- __tls_get_addr: Rayon calls this a bit
Even in the serial compiler, get_query is hot, and just the instruction counts for entering/exiting it is often 2-3% of all instructions. But it's almost impossible to inline because it has hundreds (thousands?) of call sites, via the query getters
The whole query-based design of the compiler is a bit antithetical to parallelization -- everything centres around a giant chunk of shared state
the sharding helps, but still
I wonder if some per-thread caching might help
I definitely think we should explore whether e.g. lockless or eventually consistent hashmaps could be good for us -- a lot of the time, we're fine if we're not quite seeing the most current state for queries
(or at least most queries)
Am I correct that this is essentially append-only data ? This is not the worst kind of shared state because at least you don't have the concurrent memory reclamation hell to take care of.
Well, for the hashtable elements at least. Bucket management would be another story...
we never delete from the hash tables and so forth so I think we could plausibly be able to come up with something without too much trouble :)
One thing that troubles me about the sharding is that it's one-size-fits-all, even though the various query types have very different properties
Also, it's not respondent to the number of threads.
In contrast, per-thread caching is responsive to the number of threads.
I don't think sharding is very problematic. There doesn't seem to be much performance overhead to having too many shards, so we just have to make sure there's enough shards to reduce contention on core counts it make sense to run rustc with.
The query-based design is pretty bad for performance overall (expect for incremental performance). I wonder if we can have "thread-local" queries which use thread-local state instead of global state. It would be useful for cheap queries (like symbol names). I would have to think hard about how that would affect recursion detection and incremental things though.
To clarify something:
When we say that the query-based design is "bad" for parallelism, we are specifically referring to the centralized caches? It seems pretty obviously good for parallelism to me in the sense that the more we can compose the compiler into side-effect-free functions, the better? I find this a bit confusing. It seems like it's more a question of how to arrange the execution than the query-based design itself.