Stream: t-compiler/wg-self-profile

Topic: perf integration


simulacrum (Sep 20 2019 at 00:56, on Zulip):

@Wesley Wiser So, if I want to store the self-profile results in what needs to be long-term backwards compatible -- should I store the JSON results of summarize summarize --json?

simulacrum (Sep 20 2019 at 00:56, on Zulip):

or is that a pretty unstable format, and I should load that up and try and parse it down to "what perf wants"

simulacrum (Sep 20 2019 at 00:57, on Zulip):

or is there some non-summary method of getting events out? Should we be storing the 'raw' event files?

simulacrum (Sep 20 2019 at 00:58, on Zulip):

I'll probably come back to this tomorrow or so

Wesley Wiser (Sep 20 2019 at 10:33, on Zulip):

I don't think we should store the raw events files because they're very large. The output of summarize summarize --json should be pretty stable so I think we can store it as is.

simulacrum (Sep 20 2019 at 10:53, on Zulip):

Okay, sounds good. Should be fairly easy to implement that.

simulacrum (Sep 21 2019 at 00:50, on Zulip):

Okay, I have an initial draft (no diff, just view) of the self-profile data live. Any recent (currently, I think last two) commits should have self-profile data recorded. If you go to the comparison page and pop open one of the drop downs, any of the links in the second column should work -- e.g. https://perf.rust-lang.org/detailed-query.html?commit=97e58c0d32bcb8730f8246d25f3d2fa8092b450a&benchmark=inflate-check&run_name=baseline%20incremental. Two problems I noted: we always have 0s in the incremental column and seem to get no reuse (i.e., cache hits are 0). I've not yet investigated if this is perf doing something wrong and not running in incremental mode (unlikely, given lower instruction counts) or self-profile data being incorrectly reported.

@Wesley Wiser @mw -- I've ticked the boxes in the tracking issue per the above. Tomorrow's agenda is to try and rework how perf stores and publishes data as right now we've ~20x increased each commit's data with the self-profile info, which is unacceptable (we've previously been getting closer to this, but only just now hit this point). I don't anticipate this being a concern for self-profile WG, though, as we can take the hit for now. I'm interested in hearing feedback on the table (it's sortable on every column!) and if we want to spend time implementing a diff view. I'm not sure how a diff view should look, either.

simulacrum (Sep 21 2019 at 00:51, on Zulip):

(the above will be live in 15 minutes or so once I finish compiling)

simulacrum (Sep 21 2019 at 00:53, on Zulip):

I suspect the lack of incremental might be a perf collection bug but I'll investigate tomorrow probably

simulacrum (Sep 21 2019 at 12:05, on Zulip):

Aha, cache hits are not recorded by default

simulacrum (Sep 21 2019 at 12:05, on Zulip):

I've now passed -Zself-profile-events=all as a (possibly temporary) stopgap

Wesley Wiser (Sep 21 2019 at 13:30, on Zulip):

Yeah, we found recording cache hits incurs a big performance hit

simulacrum (Sep 21 2019 at 14:53, on Zulip):

hm, interesting

simulacrum (Sep 21 2019 at 14:53, on Zulip):

I would not expect that

simulacrum (Sep 21 2019 at 14:54, on Zulip):

and based on the perf graphs it appears to be at most a few percent -- at least for perf, I think acceptable

simulacrum (Sep 21 2019 at 15:17, on Zulip):

hm, so it looks like I'm also hitting some ICEs when enabling all events -- e.g.

thread 'rustc' panicked at 'assertion failed: pos.checked_add(num_bytes).unwrap() <= self.mapped_file.len()', /cargo/registry/src/gi
thub.com-1ecc6299db9ec823/measureme-0.3.0/src/mmap_serialization_sink.rs:38:9
simulacrum (Sep 21 2019 at 15:17, on Zulip):

@Wesley Wiser is that because we're only allocating 1 GB of output? Seems like we should allocate more to avoid this

simulacrum (Sep 21 2019 at 15:18, on Zulip):

or come up with some sort of resizing / splitting scheme

Wesley Wiser (Sep 22 2019 at 14:33, on Zulip):

Wesley Wiser is that because we're only allocating 1 GB of output? Seems like we should allocate more to avoid this

Yeah, that seems likely to me. IIRC, there isn't a huge downside to allocating more space because we shrink the file to the correct size at the end after we've recorded all of the data.

Wesley Wiser (Sep 22 2019 at 14:35, on Zulip):

Ok, I'm not remembering correctly. For *nix, we mmap() an anonymous file and then create a real file at the end and dump everything to it. So yeah, increasing it is probably not a big deal.

Wesley Wiser (Sep 22 2019 at 14:36, on Zulip):

Since it's lazily allocated, it shouldn't actually cost us anything unless we end up using all of the space.

mw (Sep 23 2019 at 08:02, on Zulip):

Awesome work, @simulacrum!

mw (Sep 23 2019 at 08:02, on Zulip):

Wow, more than 1 GB of events :)

mw (Sep 23 2019 at 08:03, on Zulip):

Increasing that shouldn't be a problem, as @Wesley Wiser stated.

mw (Sep 23 2019 at 08:04, on Zulip):

However, we don't want to record all events, since the query cache hits don't contribute anything to the summarized data but increase the overhead by something like 400%.

mw (Sep 23 2019 at 08:06, on Zulip):

I think we should change the default event filter instead

mw (Sep 23 2019 at 08:07, on Zulip):

this would be part of my event-kind review (which I hope to get to later this week)

mw (Sep 23 2019 at 08:08, on Zulip):

until then it's fine if the incremental cache hits only shows zeros

mw (Sep 23 2019 at 08:10, on Zulip):

Tomorrow's agenda is to try and rework how perf stores and publishes data as right now we've ~20x increased each commit's data with the self-profile info, which is unacceptable

Can you elaborate on that? What is this additional data?

mw (Sep 23 2019 at 08:16, on Zulip):

Correction: Query cache hits do contribute to the summarized data. I wonder if it is worth the overhead though.

mw (Sep 23 2019 at 08:17, on Zulip):

And the possible skew we get from recording them: a query cache hit is a very cheap operation, so the recording overhead will be a much larger percentage of it's actual execution time then for other kinds of events.

mw (Sep 23 2019 at 08:41, on Zulip):

regarding the details view:

- awesome :)
- great that one can sort by column
- I think it would be good to have a column for percentage of total compile time
- would it make sense to automatically omit columns that are all zeros? Are those columns present in the JSON output?
- in the future, once we have it, it would be cool to also show some metadata about the benchmark, like total execution time (need to implement https://github.com/rust-lang/measureme/issues/15 for that)

mw (Sep 23 2019 at 08:48, on Zulip):

regarding a comparison view:

simulacrum (Sep 23 2019 at 11:41, on Zulip):

Tomorrow's agenda is to try and rework how perf stores and publishes data as right now we've ~20x increased each commit's data with the self-profile info, which is unacceptable

Can you elaborate on that? What is this additional data?

The self profile data itself -- basically all the string "labels" for the fields and such - since we're storing it as JSON

simulacrum (Sep 23 2019 at 11:41, on Zulip):

I didn't get a chance to investigate too much though (got distracted with other things) -- I don't think it's a problem for the time being though as I said

simulacrum (Sep 23 2019 at 11:42, on Zulip):

And the possible skew we get from recording them: a query cache hit is a very cheap operation, so the recording overhead will be a much larger percentage of it's actual execution time then for other kinds of events.

Sure -- I also think that the current query hits aren't really too useful since there's not much we can do about them (it's also unclear how many are due to "poor incremental" or what)

simulacrum (Sep 23 2019 at 11:42, on Zulip):

So I'll switch back to the default set of events I guess -- and probably drop the now-always-0 columns from the display

simulacrum (Sep 23 2019 at 11:43, on Zulip):

regarding a comparison view:

Would it be sufficient for now to just always compare with the same "benchmark + run" pair (basically, not support commit A, benchmark B, run C with commit A1, benchmark D, run E)? That should be relatively easy ("just another column")

simulacrum (Sep 23 2019 at 12:14, on Zulip):

Correction: Query cache hits do contribute to the summarized data. I wonder if it is worth the overhead though.

Hm, so actually -- if we're collecting a total invocation count, and cache misses, we get cache hits for free, right?

simulacrum (Sep 23 2019 at 12:15, on Zulip):

@mw In any case, though, displaying them is somewhat pointless -- they're equivalent to total count - misses or so

simulacrum (Sep 23 2019 at 12:15, on Zulip):

which I _think_ can be reached by sorting? not quite sure

simulacrum (Sep 23 2019 at 12:15, on Zulip):

I'll remove the hit columns for now though

mw (Sep 23 2019 at 14:41, on Zulip):

Would it be sufficient for now to just always compare with the same "benchmark + run" pair (basically, not support commit A, benchmark B, run C with commit A1, benchmark D, run E)? That should be relatively easy ("just another column")

@simulacrum, I don't think I understand what you mean here

simulacrum (Sep 23 2019 at 14:42, on Zulip):

The inputs to the query compare page can just be commit (A, B) + the run identifier (benchmark name, run name)

simulacrum (Sep 23 2019 at 14:42, on Zulip):

Right?

simulacrum (Sep 23 2019 at 14:45, on Zulip):

@mw anyway, here's what I have so far: https://perf.rust-lang.org/detailed-query.html?commit=4ff32c07da9d97e6dc315a4a5c9ffbb797cb27bb&base_commit=ef906d0e3c50ba0833c5a135d705ab4f6bd93aea&benchmark=syn-opt&run_name=clean

simulacrum (Sep 23 2019 at 14:47, on Zulip):

which is what you get when clicking the RHS of any compare page, LHS leads to a page that's basically what we had before (i.e., no deltas)

simulacrum (Sep 23 2019 at 14:47, on Zulip):

Sorting for the delta columns and color coding isn't implemented since both are based on percentages and percentages are a bit hard

mw (Sep 23 2019 at 14:48, on Zulip):

yes, this looks good. I think for now, it's fine that one is not able to compare e.g. clean vs patched incremental

simulacrum (Sep 23 2019 at 14:48, on Zulip):

yeah, that seemed ... a bit nonsensical to me, anyway -- like, they'd ~always have different queries and such

mw (Sep 23 2019 at 14:48, on Zulip):

it might make sense in the future (not sure) but for spotting regressions you'd always want to compare the same kind of run

simulacrum (Sep 23 2019 at 14:49, on Zulip):

in theory though you'd not use this to spot the regression, it'd be more so to say "what actually regressed", right?

mw (Sep 23 2019 at 14:49, on Zulip):

I think it would be good to show the deltas as percentages

simulacrum (Sep 23 2019 at 14:49, on Zulip):

i.e., the regression would be spotted on the compare or the graph page and then you click in for more info

mw (Sep 23 2019 at 14:49, on Zulip):

yes, "pinpoint the regression" would be more accurate

simulacrum (Sep 23 2019 at 14:50, on Zulip):

One concern with percentages is that I don't quite know how to show the times

simulacrum (Sep 23 2019 at 14:50, on Zulip):

right now the "raw data" is seconds + nanoseconds which we're converting to the x.xxx format; I show the "raw" data in the tooltip

simulacrum (Sep 23 2019 at 14:50, on Zulip):

with the percents idea I'm not quite sure how to show the "raw" data then

simulacrum (Sep 23 2019 at 14:50, on Zulip):

I guess I could just show both somehow?

mw (Sep 23 2019 at 14:51, on Zulip):

I would just show x.xxx, I think

mw (Sep 23 2019 at 14:52, on Zulip):

the measurements are not that accurate anyway, I guess?

simulacrum (Sep 23 2019 at 14:53, on Zulip):

probably? no idea :) higher variance, at least, I presume

mw (Sep 23 2019 at 14:54, on Zulip):

So, I would do any calculations with the raw data, but render things in x.xxx format

simulacrum (Sep 23 2019 at 14:55, on Zulip):

makes sense -- shouldn't be too hard

mw (Sep 23 2019 at 14:55, on Zulip):

I'm excited by how far along this already is, btw :D

mw (Sep 23 2019 at 14:56, on Zulip):

very nice!

mw (Sep 23 2019 at 14:56, on Zulip):

regarding the links that lead to the details view: I think the LHS and the RHS show both lead to a non-comparison page

simulacrum (Sep 23 2019 at 14:57, on Zulip):

@mw do we care about sorting by delta? currently we calculate them frontend so a bit harder to implement

mw (Sep 23 2019 at 14:57, on Zulip):

I would expect the LHS to give the details view of the baseline commit

mw (Sep 23 2019 at 14:57, on Zulip):

and the RHS to the details view of the new commit

simulacrum (Sep 23 2019 at 14:58, on Zulip):

It's "hard" to recover the baseline commit then -- right now the RHS will have a link to the details view for just the RHS (and just the LHS)

mw (Sep 23 2019 at 14:58, on Zulip):

and then there could be a compare details button to the right, or the change percentage could be the link

simulacrum (Sep 23 2019 at 15:01, on Zulip):

Yeah, we could make the subtraction the link to the diff, I like that

mw (Sep 23 2019 at 15:03, on Zulip):

regarding sorting by delta: It might be useful. but if it's significantly harder to have that then we don't need it initially

mw (Sep 23 2019 at 15:04, on Zulip):

color coding (or otherwise highlighting) the bigger changes becomes more important then

Wesley Wiser (Sep 23 2019 at 15:09, on Zulip):

I think it would also help readability a bit if the was styled a little more.

Wesley Wiser (Sep 23 2019 at 15:09, on Zulip):

For example

Wesley Wiser (Sep 23 2019 at 15:09, on Zulip):

pasted image

simulacrum (Sep 23 2019 at 15:09, on Zulip):

@mw okay, deltas are now percents and I made the 3 compare columns be links to 'just left' 'just right' and 'diff'

simulacrum (Sep 23 2019 at 15:10, on Zulip):

@Wesley Wiser I'm no great shakes at design -- presuming you did that by editing CSS though I'm happy to take a PR

simulacrum (Sep 23 2019 at 15:10, on Zulip):

(or a pasted diff :)

Wesley Wiser (Sep 23 2019 at 15:10, on Zulip):

I've still got it open in the Inspector... one sec

Wesley Wiser (Sep 23 2019 at 15:12, on Zulip):
perf.css
table {
+     border-collapse: collapse;
}
+ #primary-table td {
+     border: 1px lightgrey solid;
+ }
+ #primary-table tr:nth-child(2n+1) {
+     background-color: #FBFBFB;
}
Wesley Wiser (Sep 23 2019 at 15:13, on Zulip):

I'll open a PR.

simulacrum (Sep 23 2019 at 15:13, on Zulip):

ah, let's actually add it inline in the detailed-query.html

simulacrum (Sep 23 2019 at 15:13, on Zulip):

but looks good

simulacrum (Sep 23 2019 at 15:19, on Zulip):

@Wesley Wiser merged and deployed :)

Wesley Wiser (Sep 23 2019 at 15:19, on Zulip):

Thanks :)

Wesley Wiser (Sep 23 2019 at 15:20, on Zulip):

This is :fire: Really nice work @simulacrum!!

simulacrum (Sep 23 2019 at 15:21, on Zulip):

it'd be great to somewhat fast track the mmap size fix so that we stop breaking some of the benchmarks on perf, btw (or let me know that it's hard and I can disable self-profile on them, that's just painful work since it's going to get deleted in a bit)

Wesley Wiser (Sep 23 2019 at 15:21, on Zulip):

Do you think 4gb would be sufficient? If you have any kind of gut level feeling

simulacrum (Sep 23 2019 at 15:24, on Zulip):

heh, unfortunately, no

simulacrum (Sep 23 2019 at 15:24, on Zulip):

I think a stop the world pause as we push out events when we hit the GB marker would work though

simulacrum (Sep 23 2019 at 15:24, on Zulip):

Based on past efforts designing something like that is a bit non-trivial though

simulacrum (Sep 23 2019 at 15:25, on Zulip):

I would be cautious about allocating the 4gb in one big mmap, too, that seems likely to not work / be bad

Wesley Wiser (Sep 23 2019 at 15:25, on Zulip):

Gotcha

simulacrum (Sep 23 2019 at 15:25, on Zulip):

(although I guess mmap doesn't guarantee the _pages_ underlying your virtual memory are consecutive, so it's probably fine?)

Wesley Wiser (Sep 23 2019 at 15:26, on Zulip):

Maybe we can have a "standby" mmap and when we fill up, then we'll switch to that one. When we do the switch, we'll also allocate another mmap for standby.

simulacrum (Sep 23 2019 at 15:27, on Zulip):

sure, that'd work fine I think -- not sure if that's all that much easier to implement -- we want to avoid locks around the mmaps in the happy path

mw (Sep 23 2019 at 15:37, on Zulip):

so, massively overallocating shouldn't be a problem. that's what the MMU is for.

mw (Sep 23 2019 at 15:37, on Zulip):

but I also suspect that we'll stay under 1gb once we don't record cache hits anymore

mw (Sep 23 2019 at 15:39, on Zulip):

I don't want to do some kind of ring buffer because it requires locking

mw (Sep 23 2019 at 15:40, on Zulip):

if we decided to switch to something like a ring buffer though, then I wouldn't want to use mmaps anymore because there's no need for that then

simulacrum (Sep 23 2019 at 15:50, on Zulip):

Makes sense. I've reverted to just the default collection list for now, we'll see soon if that fixes the failures

mw (Sep 23 2019 at 15:52, on Zulip):

some observations:

mw (Sep 23 2019 at 15:54, on Zulip):

maybe everything below a certain threshold should just be grayed out anyway

mw (Sep 23 2019 at 15:55, on Zulip):

if it takes under 1 microsecond, I don't care :)

mw (Sep 23 2019 at 15:55, on Zulip):

Another note:

mw (Sep 23 2019 at 15:56, on Zulip):

/me needs to run :wave:

simulacrum (Sep 23 2019 at 22:03, on Zulip):

I arbitrarily chose 0.05 as minimum realistic absolute delta (i.e., anything less is noise) and we don't color any percents that are representing anything less than this delta. This isn't right for anything but time, but seems fine -- other measurements are counts, which are always going to be non-noisy since we should be deterministic, though possibly modulo parallelism, not sure

mw (Sep 24 2019 at 08:15, on Zulip):

sounds good! query invocations should be deterministic even with parallelism enable, yes

simulacrum (Sep 24 2019 at 11:05, on Zulip):

I was thinking we might have a mode of "race to completion" between two threads if both need results and they're not available but I guess there's probably not much point as we're waiting for the sameish amount of time anyway

Wesley Wiser (Sep 24 2019 at 21:09, on Zulip):

Are the diffs right? They kind of look backwards to me.

https://perf.rust-lang.org/detailed-query.html?commit=8be3622ad74755484fd9b9e401d0ee96837be244&base_commit=b4ba2a3953ea9ec28f01c314be315d46673bd782&benchmark=clap-rs-debug&run_name=clean

Reads like an increase of 46.9% but if you check, it's actually a decrease.

Wesley Wiser (Sep 24 2019 at 21:09, on Zulip):

Base time:

LLVM_emit_obj 7.779 4 4 0.000

New time:

LLVM_emit_obj 5.296 4 4 0.000

simulacrum (Sep 24 2019 at 21:11, on Zulip):

Uh, yes, that was a bug -- should be fixed now.

simulacrum (Sep 24 2019 at 21:11, on Zulip):

@Wesley Wiser ^

Wesley Wiser (Sep 24 2019 at 21:12, on Zulip):

Just as I was starting to look at the code... it's already fixed. Fast as always @simulacrum :)

simulacrum (Sep 24 2019 at 21:13, on Zulip):

I have no real idea if the percentages are "good" though -- I'm never sure what my denominator should be

simulacrum (Sep 24 2019 at 21:13, on Zulip):

I guess in part because it depends on what you want a percent _of_

Wesley Wiser (Sep 24 2019 at 21:14, on Zulip):

Yeah, I had the same question when I added the diff subcommand to summarize

Wesley Wiser (Sep 24 2019 at 21:14, on Zulip):

It just outputs the differences and orders results by the absolute values

Wesley Wiser (Sep 24 2019 at 21:14, on Zulip):

https://github.com/rust-lang/measureme/tree/master/summarize#the-diff-sub-command

simulacrum (Sep 24 2019 at 21:16, on Zulip):

The easy way out :)

pnkfelix (Sep 26 2019 at 08:36, on Zulip):

if you ask me, the problem is using percent difference rather than just presenting the ratio

pnkfelix (Sep 26 2019 at 08:39, on Zulip):

i.e., I personally would rather see 0.60 than -66.66% or -40%

pnkfelix (Sep 26 2019 at 08:42, on Zulip):

(this is me starting from a hypothetical change from 100.0 to 60.0)

pnkfelix (Sep 26 2019 at 08:43, on Zulip):

see also http://mathcentral.uregina.ca/QQ/database/QQ.09.06/s/carolyn1.html

pnkfelix (Sep 26 2019 at 08:43, on Zulip):

In particular "percentage change. This is the change from an early value to a later value, and by convention, it is done with respect to the earlier value."

simulacrum (Sep 26 2019 at 11:40, on Zulip):

Seems reasonable, and link is super handy -- I'll look into that and try and migrate perf over entirely

mw (Sep 26 2019 at 13:42, on Zulip):

@simulacrum Is perf.rlo doing self-profiling with the default event filter at the moment?

simulacrum (Sep 26 2019 at 13:43, on Zulip):

I believe so yes

simulacrum (Sep 26 2019 at 13:43, on Zulip):

I can switch that to a non-default though (maybe we want a dedicated one for perf?)

mw (Sep 26 2019 at 14:02, on Zulip):

Thanks for the info. The default is fine for now.

Last update: Nov 17 2019 at 07:15UTC