Stream: t-compiler/wg-parallel-rustc

Topic: Gathering parallel data


nikomatsakis (May 13 2019 at 19:24, on Zulip):

So we talked about trying to gather various bits of parallel data -- do we think we'll be able to do that by this Friday? It seems ok if not, but we should just put off the meeting

simulacrum (May 14 2019 at 01:02, on Zulip):

@nikomatsakis I'm waiting on flags from @Zoxc; I hope to start the implementation groundwork tomorrow morning. I am optimistic for Friday having relevant stats, though maybe not all (depending in part on how quickly I can get it done and get eddyb and you to run relevant commands)

Zoxc (May 14 2019 at 01:06, on Zulip):

You can use -Z threads=n for rustc and -j n for cargo

simulacrum (May 14 2019 at 01:16, on Zulip):

@Zoxc Should I use both simultaneously? Or test with -j{1,2,4,8} and -Zthreads={1,2,4,8} with both varying independently?

simulacrum (May 14 2019 at 01:16, on Zulip):

er, simultaneously meaning "same n"

Zoxc (May 14 2019 at 02:33, on Zulip):

Good question. I think just setting -Z thread will suffice. Using -j1 might be more accurate if measuring overhead though

simulacrum (May 14 2019 at 03:03, on Zulip):

okay, will investigate

simulacrum (May 14 2019 at 03:04, on Zulip):

Depending on time will maybe try to do both

lwshang (May 14 2019 at 14:00, on Zulip):

Are we testing parallel-rustc on any arbitrary crates/rust projects? Or we can narrow the list to some representative ones? This question comes from a confusion of mine that when we say we need to do experiment on parallel-rusts, is that means we need to guarantee the parallel-rustc will not add significant overhead to compiling any crates? Or actually we primarily want to have the parallel-rustc to perform consistently on any kinds of hardware (CPU architectures, num of cores/threads, etc.). I believe we should do both. And those two are actually orthogonal with each other, so that we can do them separately.

simulacrum (May 14 2019 at 14:10, on Zulip):

We are testing it for now on perf.r-l.o and hopefully some other hardware (but that is a bit up in the air); with the set of crates we use there and the hardware we use there

lwshang (May 14 2019 at 14:12, on Zulip):

That make sense. I want to know the plan further. When we move on to the public experiment phases. I believe we want to have comparison data for both kinds of testing I just mentioned.

simulacrum (May 14 2019 at 14:16, on Zulip):

Oh, yeah, definitely -- I don't think we've established exactly what our goals are but we would want to test on different hardware/OS/etc and such

lwshang (May 14 2019 at 14:18, on Zulip):

Have we got enough confidence from perf.rlo so that we can move on to the following phases?

simulacrum (May 14 2019 at 14:28, on Zulip):

It's our best bet at this point for initial evaluation and should be relatively representative IMO

simulacrum (May 14 2019 at 14:28, on Zulip):

at least for these purposes

lwshang (May 14 2019 at 18:40, on Zulip):

@Zoxc What rustc/cargo have the parallel-rustc feature? I've tried nightly and local build of master branch, neither has the feature.

Zoxc (May 14 2019 at 18:42, on Zulip):

You need to set [rust] parallel-compiler = true in config.toml

lwshang (May 14 2019 at 18:58, on Zulip):

should I clean the previous build? I just tried with adding such config. And use ./x.py build -i --stage 1, the feature is still not available. I may get it work with a completely new compilation of rustc. But it takes long time (about 1 hour on my i5 desktop). I just want to avoid some redundant compilation.

Zoxc (May 14 2019 at 19:04, on Zulip):

Yeah, you need to do x.py clean

lwshang (May 14 2019 at 19:26, on Zulip):

While I'm waiting for the compilation, I just fill some info in the HackMD (https://hackmd.io/KmHulVmISKu7L2HmNgbPgg?edit) which was created by @nikomatsakis last week. Niko mentions "Mark" to provide some guide on configuration/setup information. Which Mark should we call? I'm new here and don't want to bother some one else.

Zoxc (May 14 2019 at 19:36, on Zulip):

That would be @simulacrum

simulacrum (May 14 2019 at 19:38, on Zulip):

I've verified that the parallel compiler works (i.e., can build all of perf) and have initial stats here https://perf.rust-lang.org/compare.html?start=80e7cde2238e837a9d6a240af9a3253f469bb2cf&end=6f087ac1c17723a84fd45f445c9887dbff61f8c0 but they're mostly meaningless (i.e., no -Zthreads and such was specified -- so I guess this is the -j4 with -Zthreads=1 but not confident that's correct). I will be collecting further stats over tonight and tomorrow hopefully

Zoxc (May 14 2019 at 19:42, on Zulip):

That should be -j8 + Zthreads=8, https://perf.rust-lang.org/compare.html?start=80e7cde2238e837a9d6a240af9a3253f469bb2cf&end=6f087ac1c17723a84fd45f445c9887dbff61f8c0&stat=wall-time

simulacrum (May 14 2019 at 19:47, on Zulip):

ah okay so the default is num_cpus for threads, got it

nikomatsakis (May 15 2019 at 19:15, on Zulip):

@simulacrum so is there something you want me or @eddyb to test?

nikomatsakis (May 15 2019 at 19:15, on Zulip):

and do we think we'll have the set of data we were looking for ready? (or some subset?)

simulacrum (May 15 2019 at 20:21, on Zulip):

We should have at the least a very good set of data

simulacrum (May 15 2019 at 20:21, on Zulip):

I want to confirm that the data I've currently gathered is correct (i.e., I'm gathering good data and not just spinning CPU cycles) and hope to get you two a script by tonight that'll dump relevant data in a directory

simulacrum (May 15 2019 at 20:22, on Zulip):

(But even if we don't get data from @eddyb or you I think we will have enough to discuss)

simulacrum (May 15 2019 at 23:37, on Zulip):

@nikomatsakis @eddyb If I could get you to run https://gist.github.com/Mark-Simulacrum/b5dd679b03ad3b979ffad2dfea2a7efc that'd be great; I would edit this line to go up to the number of cores you have

simulacrum (May 15 2019 at 23:38, on Zulip):

this will take approximately 2ish hours per "run" at least on perf.r-l.o's collector; the script as-is takes around 10 hours or so

simulacrum (May 15 2019 at 23:39, on Zulip):

if we don't get data though I've updated the doc with the data I've already collected

simulacrum (May 15 2019 at 23:39, on Zulip):

https://hackmd.io/KmHulVmISKu7L2HmNgbPgg?both#Measurements

simulacrum (May 15 2019 at 23:39, on Zulip):

That is not whole-crate-graph since I haven't had time to do that but is at least a start

mw (May 16 2019 at 10:03, on Zulip):

thanks for gathering that data, @simulacrum!

mw (May 16 2019 at 10:04, on Zulip):

I think we need whole-crate-graph data too, since that is the case where the jobserver comes into play

simulacrum (May 16 2019 at 11:59, on Zulip):

Yeah, I'm working on some manual runs

simulacrum (May 16 2019 at 12:02, on Zulip):

though I suspect the reality might be that whole-crate graph is quite rare since most people are incrementally recompiling 1-2 crates at the end

mw (May 16 2019 at 12:21, on Zulip):

for local builds yes, but for CI builds it's different

simulacrum (May 16 2019 at 12:24, on Zulip):

true -- I could imagine us having Cargo try to detect how many crates it's building and set different -Z flags based on that

simulacrum (May 16 2019 at 12:24, on Zulip):

(well, -Cthreads= I guess)

lwshang (May 16 2019 at 15:26, on Zulip):

Just read those compelling metrics. I'm thinking about a confusion about the threshold. Do we really care the overhead when run with "-Zthreads=1"? This question is somehow important when it comes to how will the compiler behave in the future.

In my envision, the compiler may automatically pick a sensible number of threads for parallel query. As long as the number is greater than 1, parallel-rustc will provide significant performance gain to users.

When rustc is running on an one-core/one-thread machine, can we make the compiler run as current "single" thread version instead of the real "-Zthread=1" which introduce overhead? Here, I'm only talking about the technical capability. Of course if the user explicitly set "-Zthreads=1", we should let the compiler run as the real parallel compiler even though only one thread can be used.

If that is the case, then I believe we can definitely move on to the public experimental phases. Users can only encounter bad overhead when they force the compiler to run with "-Zthreads=1" which we will notify the users in the announcement blog.

nikomatsakis (May 16 2019 at 18:00, on Zulip):

@lwshang we do care about the overhead with -Zthreads=1, if for no other reason than sometimes there is only one core avail

nikomatsakis (May 16 2019 at 18:00, on Zulip):

In order to eliminate that overhead, we'd effectively have to ship two copies of the compiler

nikomatsakis (May 16 2019 at 18:00, on Zulip):

One compiled to avoid locks etc

nikomatsakis (May 16 2019 at 18:00, on Zulip):

and one not

nikomatsakis (May 16 2019 at 18:00, on Zulip):

not likely feasible

nikomatsakis (May 16 2019 at 18:00, on Zulip):

besides, the more we can optimize that overhead, the better for everyone

nikomatsakis (May 16 2019 at 18:01, on Zulip):

@simulacrum i'm giving your executable a try :)

nikomatsakis (May 16 2019 at 18:01, on Zulip):

your script, that is

lwshang (May 16 2019 at 18:59, on Zulip):

As shown in the HackMD file, "-Zthreads=1" introduce a overhead about 10%~20%. Our consensus on the threshold suggest that such overhead should be smaller than 5%. Therefore, may I say that the parallel-rustc feature is not ready considering performance with "-Zthreads=1".

Further, I have several questions following. Should we now focus on improving the implementation? Do we know how to improve it? Can we complete such improvement under our current roadmap?

simulacrum (May 16 2019 at 23:17, on Zulip):

@Zoxc mentioned some global locks that I suspect we can eliminate or at least look at measuring and reducing contention

simulacrum (May 16 2019 at 23:18, on Zulip):

I plan to have rustc stage 0 timing data with/without parallel compiler as bootstrap and with various thread configurations; I am also currently doing for Cargo. Depending on when that completes, I will try to get servo, cranelift, and/or ripgrep started so that it's ready by tomorrow morning

nikomatsakis (May 17 2019 at 00:49, on Zulip):

@simulacrum presuming that this process worked (don't see any obvious errors) the data files are at http://smallcultfollowing.com/perf-data.tar.gz

simulacrum (May 17 2019 at 03:22, on Zulip):

Will get those uploaded tonight or tomorrow morning. Thanks!

simulacrum (May 17 2019 at 03:23, on Zulip):

I've retrieved the file if you want to take it down.

nikomatsakis (May 17 2019 at 13:49, on Zulip):

@simulacrum cool; is the data workable? :)

simulacrum (May 17 2019 at 13:50, on Zulip):

I think so -- working on getting links into the doc now

nikomatsakis (May 17 2019 at 13:50, on Zulip):

Nice

nikomatsakis (May 17 2019 at 13:50, on Zulip):

Thanks :heart: for all your help on this

nikomatsakis (May 17 2019 at 13:50, on Zulip):

Further, I have several questions following. Should we now focus on improving the implementation? Do we know how to improve it? Can we complete such improvement under our current roadmap?

BTW, @lwshang, I think this is exactly the kinds of questions I hope we will discuss this week! :)

simulacrum (May 17 2019 at 13:53, on Zulip):

it looks like some crates didn't compile on your machine for whatever reason (e.g., script servo) but we at least have some data

simulacrum (May 17 2019 at 13:53, on Zulip):

interestingly it looks like the 14 core machine is... slower? than perf's collection server

simulacrum (May 17 2019 at 13:53, on Zulip):

i.e. this is the perf vs your machine single-thread (i.e., current master) compiler https://perf.rust-lang.org/compare.html?start=single-threaded&end=niko-single-thread&stat=wall-time

mw (May 17 2019 at 13:55, on Zulip):

interesting

Zoxc (May 17 2019 at 13:55, on Zulip):

Spawning 28 threads probably account for the hello-world overhead. Clock speeds are likely lower too

Zoxc (May 17 2019 at 13:56, on Zulip):

Then there's some wins due to LLVM using 14 cores

simulacrum (May 17 2019 at 13:56, on Zulip):

maybe, yeah

Zoxc (May 17 2019 at 13:58, on Zulip):

Did niko only bench with 8 threads?

simulacrum (May 17 2019 at 13:58, on Zulip):

up to 8 threads

simulacrum (May 17 2019 at 13:58, on Zulip):

(but even that is quite a bit slower than on perf.rlo, so ...)

simulacrum (May 17 2019 at 13:59, on Zulip):

i.e. this is perf 8 threads vs Niko 8 threads: https://perf.rust-lang.org/compare.html?start=parallel-rustc-8&end=niko-parallel-rustc-8&stat=wall-time

Zoxc (May 17 2019 at 14:00, on Zulip):

There seems to be a 0.3s startup cost on niko's machine

Zoxc (May 17 2019 at 14:01, on Zulip):

I was mostly interested in contention though, probably should use 1 CGU and all cores for that

Zoxc (May 17 2019 at 14:03, on Zulip):

@mw Did you peek at https://github.com/rust-lang/rust/pull/60035 yet?

Zoxc (May 17 2019 at 14:03, on Zulip):

That would be helpful for incremental+parallel performance, but it's quite large

mw (May 17 2019 at 14:04, on Zulip):

only superficially. I figured I'd wait until the half the commits weren't called "wip" anymore :)

Zoxc (May 17 2019 at 14:04, on Zulip):

Well I can squash them all together to one =P

mw (May 17 2019 at 14:04, on Zulip):

perf results looked promising, but not entirely uncontroversial

mw (May 17 2019 at 14:05, on Zulip):

I did review the "preliminaries" PR though

Zoxc (May 17 2019 at 14:07, on Zulip):

The key idea is to use the same DepNodeIndex across session, this avoids having locks when marking nodes as green, since we no longer need to allocate new indices.

mw (May 17 2019 at 14:09, on Zulip):

If you think it's ready to review, I can take a look next week

lwshang (May 21 2019 at 21:50, on Zulip):

I'm wondering how -j flag of cargo interact with the parallel-rustc. Correct me if I have any wrong or inaccurate understanding below.

From my intuition,-j controls maximum number of concurrent jobs during cargo build. Such jobs refer to the crates that can be compiled at the same time (no inter-dependency). Meanwhile, parallel-rustc try to make the compilation inside one crate to be concurrent.

On a N-core machine, we run with cargo build -jN. If at some time point during the build, there are N dependencies can be compiled at the same time (told by dep graph), then roughly all the computation power of the machine is being used. In that case, we may not expect a performance gain by having parallel-rustc enabled and the overhead introduced by the feature may cause a performance regression. When the dep graph makes it impossible to have N concurrent jobs to run, we can expect the parallel-rustc to provide significant performance improvement.

mw (May 22 2019 at 08:25, on Zulip):

@lwshang yes, that is correct.

Last update: Nov 17 2019 at 06:55UTC