Stream: t-compiler/wg-parallel-rustc

Topic: optimizing single-thread overhead


nikomatsakis (Oct 29 2019 at 19:13, on Zulip):

@nnethercote -- it occurred to me that you might be interested in helping to optimize the "single-threaded overhead" of rustc running in parallel mode?

/me tries to nerd-swipe :)

nnethercote (Oct 29 2019 at 21:46, on Zulip):

@nikomatsakis Possibly, but I've just done 2 solid months of rustc perf work and I kind of need to do some Firefox stuff for a bit :/

nikomatsakis (Oct 30 2019 at 15:33, on Zulip):

@nnethercote heh ok =)

nnethercote (Oct 30 2019 at 21:55, on Zulip):

How do I build a single-threaded parallel rustc? parallel-compiler = true in config.toml is obvious, but I can't see anything in there about the number of threads

nnethercote (Oct 30 2019 at 21:56, on Zulip):

rustc -Zthreads=N at runtime, perhaps

simulacrum (Oct 30 2019 at 21:57, on Zulip):

yes, the default is one thread today I believe, but I think it's good practice when benchmarking to always set -Zthreads

simulacrum (Oct 30 2019 at 21:57, on Zulip):

note that this only affects threads during query-parallelism, not codegen parallelism

nnethercote (Oct 30 2019 at 22:00, on Zulip):

Yes, default is 1 thread in librustc/session/config.rs, ethanks

nnethercote (Oct 31 2019 at 04:48, on Zulip):

I did some basic comparisons between serial and parallel with one thread.

nnethercote (Oct 31 2019 at 04:49, on Zulip):

packed-simd has the greatest slowdown for a check-clean build, of about 7%

nnethercote (Oct 31 2019 at 04:49, on Zulip):

That 7% is dominated by the fact that get_query is more expensive

nnethercote (Oct 31 2019 at 04:53, on Zulip):

This is mostly due to sharding (which in a serial compiler doesn't really do anything) of the query cache. In particular, when a key is looked up it now gets hashed twice: once in get_shard_by_value to find the shard, and then again to look up the hash table within the shard.

nnethercote (Oct 31 2019 at 04:54, on Zulip):

There is also some locking overhead, unsurprisingly.

nnethercote (Oct 31 2019 at 04:54, on Zulip):

I think the duplicate hashing is the biggest source of overhead.

nnethercote (Oct 31 2019 at 04:57, on Zulip):

It would be nice if we could compute the hash first, then say "get the shard/value using this hash value", but I don't think HashMap allows that.

nnethercote (Oct 31 2019 at 05:01, on Zulip):

Unless RawEntryBuilderMut can be used for this?

nnethercote (Oct 31 2019 at 05:06, on Zulip):

from_key_hashed_nocheck might do the trick

nnethercote (Oct 31 2019 at 05:12, on Zulip):

Indeed, I think RawEntryBuilder::from_key_hashed_nocheck does exactly what I want! And Sharded already has get_shard_by_hash. I will try this out tomorrow, see how it goes.

Zoxc (Oct 31 2019 at 10:09, on Zulip):

Did you look at wall-times? 7% seem low compared to the last time this was benchmarked, unless non-check-clean builds have more overhead. Also which processor do you use?

Zoxc (Oct 31 2019 at 10:32, on Zulip):

try_get hashes the key 3 times now. I tried reducing the 2 hashes to 1 before, but LLVM didn't inline stuff well, so it was a performance regressions. Keys are usually integers too, which means hashing is pretty cheap.

nnethercote (Nov 01 2019 at 04:00, on Zulip):

Does every query type get its own QueryCache? I count 188 query types

nnethercote (Nov 01 2019 at 04:00, on Zulip):

And in the parallel compiler they are sharded into 32 pieces,

nnethercote (Nov 01 2019 at 04:00, on Zulip):

And each QueryCache has two hashmaps.

nnethercote (Nov 01 2019 at 04:01, on Zulip):

So that's a total of 188 * 32 * 2 = 12,032 hashmaps?

Zoxc (Nov 01 2019 at 04:25, on Zulip):

Yes

nikomatsakis (Nov 01 2019 at 20:39, on Zulip):

@nnethercote nice :)

nnethercote (Nov 01 2019 at 21:51, on Zulip):

@nikomatsakis: https://github.com/rust-lang/rust/pull/66013 is the PR that eliminates the repeated hashing

nnethercote (Nov 01 2019 at 21:52, on Zulip):

https://github.com/rust-lang/rust/pull/66012 is an interesting one -- I de-querified trivial_dropck_outlives and it was a perf win for the serial compiler because the cost of querying was actually larger than the cost of re-computing. And the benefit for the parallel compiler could only be higher, because the cost of querying is higher in the parallel compiler

nnethercote (Nov 01 2019 at 21:54, on Zulip):

FWIW, here are my notes on parallel overhead from yesterday:
- Overhead is almost all in get_query()
- Lock (atomic.rs, parking_lot) (outweighs the reduction in RefCell use)
- sharded.rs
- codegen/regalloc due to more code in get_query
- memcpy: QueryJob is bigger (144 bytes), memcpy is needed to initialize it
- Note: 188 query kinds, 32 QueryCaches each = 5056, 2 HashMaps each: 12,032 HashMaps!
- __tls_get_addr: Rayon calls this a bit

nnethercote (Nov 01 2019 at 22:09, on Zulip):

Even in the serial compiler, get_query is hot, and just the instruction counts for entering/exiting it is often 2-3% of all instructions. But it's almost impossible to inline because it has hundreds (thousands?) of call sites, via the query getters

nnethercote (Nov 06 2019 at 22:44, on Zulip):

The whole query-based design of the compiler is a bit antithetical to parallelization -- everything centres around a giant chunk of shared state

nnethercote (Nov 06 2019 at 22:44, on Zulip):

the sharding helps, but still

nnethercote (Nov 06 2019 at 22:45, on Zulip):

I wonder if some per-thread caching might help

simulacrum (Nov 07 2019 at 01:07, on Zulip):

I definitely think we should explore whether e.g. lockless or eventually consistent hashmaps could be good for us -- a lot of the time, we're fine if we're not quite seeing the most current state for queries

simulacrum (Nov 07 2019 at 01:07, on Zulip):

(or at least most queries)

Hadrien Grasland (Nov 07 2019 at 17:17, on Zulip):

Am I correct that this is essentially append-only data ? This is not the worst kind of shared state because at least you don't have the concurrent memory reclamation hell to take care of.

Hadrien Grasland (Nov 07 2019 at 17:18, on Zulip):

Well, for the hashtable elements at least. Bucket management would be another story...

simulacrum (Nov 07 2019 at 18:15, on Zulip):

we never delete from the hash tables and so forth so I think we could plausibly be able to come up with something without too much trouble :)

nnethercote (Nov 07 2019 at 22:44, on Zulip):

One thing that troubles me about the sharding is that it's one-size-fits-all, even though the various query types have very different properties

nnethercote (Nov 07 2019 at 22:44, on Zulip):

Also, it's not respondent to the number of threads.

nnethercote (Nov 07 2019 at 22:45, on Zulip):

In contrast, per-thread caching is responsive to the number of threads.

Zoxc (Nov 13 2019 at 00:35, on Zulip):

I don't think sharding is very problematic. There doesn't seem to be much performance overhead to having too many shards, so we just have to make sure there's enough shards to reduce contention on core counts it make sense to run rustc with.

Zoxc (Nov 13 2019 at 00:38, on Zulip):

The query-based design is pretty bad for performance overall (expect for incremental performance). I wonder if we can have "thread-local" queries which use thread-local state instead of global state. It would be useful for cheap queries (like symbol names). I would have to think hard about how that would affect recursion detection and incremental things though.

nikomatsakis (Nov 14 2019 at 15:41, on Zulip):

To clarify something:

When we say that the query-based design is "bad" for parallelism, we are specifically referring to the centralized caches? It seems pretty obviously good for parallelism to me in the sense that the more we can compose the compiler into side-effect-free functions, the better? I find this a bit confusing. It seems like it's more a question of how to arrange the execution than the query-based design itself.

Last update: Nov 17 2019 at 07:00UTC