Stream: t-compiler

Topic: measuring parallel rustc


simulacrum (May 17 2019 at 14:45, on Zulip):

@pnkfelix I don't quite understand what you mean by that, no

pnkfelix (May 17 2019 at 14:45, on Zulip):

so you have a nested for loop

pnkfelix (May 17 2019 at 14:45, on Zulip):

with N cases in the outer loop and M cases in the inner one

pnkfelix (May 17 2019 at 14:45, on Zulip):

and my immediate reaction was

pnkfelix (May 17 2019 at 14:46, on Zulip):

its not clear that we/you get much value in short term from exploring all those cases

pnkfelix (May 17 2019 at 14:46, on Zulip):

instead, might be better to loop over all N thread values while varying job among { j=1, j=2 }, for example

simulacrum (May 17 2019 at 14:46, on Zulip):

ah, okay, yeah, that makes some sense

simulacrum (May 17 2019 at 14:47, on Zulip):

I think I'm also not quite sure yet what the "correct" / "interesting" -j/-t arguments are

pnkfelix (May 17 2019 at 14:47, on Zulip):

and then loop over all M job values while varying thread count over the most important cases (which are probably { t=1, t=4 }, is my guess)

pnkfelix (May 17 2019 at 14:47, on Zulip):

yeah I agree its not clear what the interesting cases are

pnkfelix (May 17 2019 at 14:47, on Zulip):

but we can dive in gathering more data later

pnkfelix (May 17 2019 at 14:47, on Zulip):

I just don't want to swamp your machine time gathering lots of data for one crate

simulacrum (May 17 2019 at 14:47, on Zulip):

I'm going to try to get full N x M data on Cargo with lower counts since that's pretty fast -- Cargo builds in a couple minutes

pnkfelix (May 17 2019 at 14:47, on Zulip):

when we want to spend less time and get data for more crates

pnkfelix (May 17 2019 at 14:48, on Zulip):

okay, I can understand that.

simulacrum (May 17 2019 at 14:48, on Zulip):

I think that should be done in an hour or so

simulacrum (May 17 2019 at 14:48, on Zulip):

From there it seems viable that we'll/I'll be able to get a better sense of interesting cases and switch to 2xN + 2xM or something like that

nikomatsakis (May 17 2019 at 14:54, on Zulip):

I think @simulacrum the high order bit is that the "order of exploration" matters -- we probably want "breadth-first" more than "depth-first"

nikomatsakis (May 17 2019 at 14:54, on Zulip):

in other words, we'd rather see the "single thread overhead" and "two thread win" for all crates

nikomatsakis (May 17 2019 at 14:54, on Zulip):

before seeing the "four thread win" for any crate

nikomatsakis (May 17 2019 at 14:54, on Zulip):

at least that's my guess

simulacrum (May 17 2019 at 14:54, on Zulip):

hm, that's a good point

nikomatsakis (May 17 2019 at 14:54, on Zulip):

I feel like if we're regressing on 2 threads, that's a problem

nikomatsakis (May 17 2019 at 14:55, on Zulip):

and presumably it'll only get better from there (...famous last words...)

simulacrum (May 17 2019 at 14:56, on Zulip):

that would make sense

simulacrum (May 17 2019 at 14:56, on Zulip):

So we're thinking maybe -j{1,2,8,16} and -t{0,1,2} where -t0 is the single-threaded compiler?

simulacrum (May 17 2019 at 14:56, on Zulip):

or even less data than that

simulacrum (May 17 2019 at 14:56, on Zulip):

I could imagine not trying to optimize -j since presumably users won't set that

simulacrum (May 17 2019 at 14:57, on Zulip):

i.e., just do -j16 and -t{0,1,2}

mw (May 17 2019 at 14:57, on Zulip):

that sounds reasonable

simulacrum (May 17 2019 at 14:58, on Zulip):

@mw which option -- the latter -j16 or the first bit?

mw (May 17 2019 at 14:58, on Zulip):

the latter

mw (May 17 2019 at 14:58, on Zulip):

t > j never makes sense anyway, right?

simulacrum (May 17 2019 at 14:59, on Zulip):

hm, presumably no

Zoxc (May 17 2019 at 14:59, on Zulip):

It does not

mw (May 17 2019 at 14:59, on Zulip):

except for testing the jobserver :)

simulacrum (May 17 2019 at 14:59, on Zulip):

okay, so the plan is ripgrep, servo, cranelift, maybe some more rustc measurements with -j16 and varying -t{0,1,2} for now

simulacrum (May 17 2019 at 15:00, on Zulip):

@pnkfelix @nikomatsakis does that sound to you? Should we add anything?

simulacrum (May 17 2019 at 15:04, on Zulip):

And to what extent do we want -opt, -check, -debug runs?

simulacrum (May 17 2019 at 15:05, on Zulip):

Currently I've just been doing -opt runs but those are both slower and maybe not as important?

simulacrum (May 17 2019 at 15:05, on Zulip):

i.e., measuring LLVM to an extent

nikomatsakis (May 17 2019 at 16:10, on Zulip):

@simulacrum that sounds good -- re: -opt, -check, and -debug, I think we definitely want more than just -opt -- but it's a good question. I guess I would prioritize doing;

in that order.

nikomatsakis (May 17 2019 at 16:10, on Zulip):

Mostly because we know check is the best case

simulacrum (May 17 2019 at 16:10, on Zulip):

Okay -- I'm putting data in https://docs.google.com/spreadsheets/d/1vadQWQQqTODU1_cAENnUjLyXM6cxms-tiCf2kCiNGGM/edit?usp=sharing

nikomatsakis (May 17 2019 at 16:11, on Zulip):

I'm not actually sure if the order I wrote makes sense, I guess I could see an arugment for opt, check, debug

nikomatsakis (May 17 2019 at 16:11, on Zulip):

worst, best, middle :)

simulacrum (May 17 2019 at 16:13, on Zulip):

mhm, makes sense

simulacrum (May 17 2019 at 16:14, on Zulip):

so far data collection is going fairly well

simulacrum (May 17 2019 at 16:15, on Zulip):

I also have potentially interesting non-timing data as a side-effect of using perf to collect:

cargo-t1-check:
 Performance counter stats for 'cargo +6f087ac1c17723a84fd45f445c9887dbff61f8c0 check' (2 runs):

     181022.616812      task-clock (msec)         #    4.248 CPUs utilized            ( +-  0.20% )
             17903      context-switches          #    0.099 K/sec                    ( +-  0.37% )
              4038      cpu-migrations            #    0.022 K/sec                    ( +-  0.35% )
           3348166      page-faults               #    0.018 M/sec                    ( +-  0.01% )
      669715893598      cycles                    #    3.700 GHz                      ( +-  0.19% )  (33.58%)
       87652401496      stalled-cycles-frontend   #   13.09% frontend cycles idle     ( +-  1.26% )  (33.51%)
       69564486867      stalled-cycles-backend    #   10.39% backend cycles idle      ( +-  0.33% )  (33.45%)
      595648635822      instructions              #    0.89  insn per cycle
                                                  #    0.15  stalled cycles per insn  ( +-  0.20% )  (33.38%)
      102489636524      branches                  #  566.170 M/sec                    ( +-  0.41% )  (33.27%)
        3548257867      branch-misses             #    3.46% of all branches          ( +-  0.14% )  (33.12%)
      315611174080      L1-dcache-loads           # 1743.490 M/sec                    ( +-  0.24% )  (33.05%)
         683586581      L1-dcache-load-misses     #    0.22% of all L1-dcache hits    ( +-  0.05% )  (33.05%)
                 0      LLC-loads                 #    0.000 K/sec                    (33.11%)
                 0      LLC-load-misses           #    0.00% of all LL-cache hits     (33.14%)
      145804411502      L1-icache-loads           #  805.449 M/sec                    ( +-  0.10% )  (33.20%)
       10244177108      L1-icache-load-misses                                         ( +-  0.49% )  (33.28%)
      312502970939      dTLB-loads                # 1726.320 M/sec                    ( +-  0.01% )  (33.37%)
         916415446      dTLB-load-misses          #    0.29% of all dTLB cache hits   ( +-  1.34% )  (33.44%)
      145367320817      iTLB-loads                #  803.034 M/sec                    ( +-  0.27% )  (33.45%)
         346043620      iTLB-load-misses          #    0.24% of all iTLB cache hits   ( +-  0.63% )  (33.48%)
           1437970      L1-dcache-prefetches      #    0.008 M/sec                    ( +-  0.02% )  (33.53%)
           3763609      L1-dcache-prefetch-misses #    0.021 M/sec                    ( +-  0.06% )  (33.61%)

      42.616428285 seconds time elapsed                                          ( +-  1.16% )
simulacrum (May 17 2019 at 16:15, on Zulip):

but not sure if any of it is useful/helpful so going to leave it out of spreadsheet for now

simulacrum (May 17 2019 at 20:00, on Zulip):

@Alex Crichton Do you have any insight into measuring jobserver "contention"? I'm interested in whether we can get better performance if we restrict rustcs invocations but still have all cores "available" to the internal threading within rustc

simulacrum (May 17 2019 at 20:01, on Zulip):

basically say that only 4 rustcs processes should run at once but each can use up to 4 cores (since I have 16)

Alex Crichton (May 17 2019 at 20:18, on Zulip):

@simulacrum not currently, the protocol is basically an IPC semaphore which means there's not really any data other than "can I get something at this point" and "I can release something at this point

Alex Crichton (May 17 2019 at 20:18, on Zulip):

but no data on like "how many waiters are there"

Alex Crichton (May 17 2019 at 20:19, on Zulip):

we'd have to instrument rustc for more information like that (or make a better protocol)

simulacrum (May 17 2019 at 20:19, on Zulip):

hm, okay -- I seemed to recall some graphs from you when we were just adding this but I guess I'm remembering the wrong thing or creating memories

simulacrum (May 17 2019 at 20:20, on Zulip):

thanks!

simulacrum (May 17 2019 at 21:12, on Zulip):

@nikomatsakis (and maybe others?) -- we have data for cargo, ripgrep, servo, and cranelift now (opt, debug, check, with -t{0,1,2} and -j16)

simulacrum (May 17 2019 at 21:12, on Zulip):

https://docs.google.com/spreadsheets/d/1vadQWQQqTODU1_cAENnUjLyXM6cxms-tiCf2kCiNGGM/edit#gid=0

simulacrum (May 17 2019 at 21:13, on Zulip):

Overall -t2 is pretty much a win, if slight, across the board

simulacrum (May 17 2019 at 21:13, on Zulip):

and -t1 is almost always a ~4% regression (ranging between ~1 second to 30 seconds at max)

simulacrum (May 17 2019 at 21:14, on Zulip):

the 30 second case is possibly spurious but I've gotten pretty consistent measurements around that point after ~4 runs so probably not, even though it's odd how much it stands out

simulacrum (May 17 2019 at 21:14, on Zulip):

it might point at some underlying problem that would be good to solve (cc @Zoxc )

simulacrum (May 18 2019 at 03:15, on Zulip):

Gathered data to build curves for essentially all meaningful values of -t while still keeping -j constant for Cargo

simulacrum (May 18 2019 at 03:15, on Zulip):

https://docs.google.com/spreadsheets/d/1vadQWQQqTODU1_cAENnUjLyXM6cxms-tiCf2kCiNGGM/edit#gid=1621301791&range=A1

mw (May 20 2019 at 09:05, on Zulip):

(I adapted the Y-axis in the charts to start at zero)

nikomatsakis (Jun 10 2019 at 19:59, on Zulip):

@simulacrum thanks, those charts are super interesting. The data there is raw time values?

simulacrum (Jun 10 2019 at 20:08, on Zulip):

@nikomatsakis Yes - in seconds.

nikomatsakis (Jun 10 2019 at 20:09, on Zulip):

OK. It doesn't look too impressive, does it? :)

nikomatsakis (Jun 10 2019 at 20:10, on Zulip):

But I guess these are the whole crate graph numbers, the "just tip crate" numbers look better

simulacrum (Jun 10 2019 at 20:45, on Zulip):

Yes, indeed

simulacrum (Jun 10 2019 at 20:45, on Zulip):

@nikomatsakis also, shaving 5-6 seconds off e.g. Cargo compilation isn't actually that bad

simulacrum (Jun 10 2019 at 20:46, on Zulip):

plus this is presumably still with "single-threaded" looking code (i.e., we have some global locks that Zoxc mentioned a while back)

Last update: Nov 22 2019 at 05:05UTC