Stream: t-compiler/wg-self-profile

Topic: event filter / level of detail


mw (Apr 10 2019 at 09:47, on Zulip):

I think it makes sense to allow some control over what events the compiler records in order to keep profiling overhead as low as possible.

mw (Apr 10 2019 at 09:48, on Zulip):

one obvious example is query keys. recording those will have quite a bit of overhead.

mw (Apr 10 2019 at 09:51, on Zulip):

but another example that is already relevant in the current version is query cache hits. As @Wesley Wiser mentioned here they make up a huge portion of events overall, but for most kinds of analysis they are not needed.

Wesley Wiser (Apr 10 2019 at 11:39, on Zulip):

Yeah, I think we should offer some kind of control here. One option would be to allow complete control over the event types that will be captured a la -Z profiler-events=query,activity,incremental,etc. Another (better?) option would be some kind of "slider" which offers less control but doesn't require as much knowledge to use: -Z profiler-setting={fast|normal|detailed}.

mw (Apr 10 2019 at 11:56, on Zulip):

we could do both where fast, normal, detailed just map to a defined set of actual event-types

mw (Apr 10 2019 at 11:57, on Zulip):

e.g. -Z profiler-events=query,activity is exactly the same as -Z profiler-events=fast

mw (Apr 10 2019 at 11:58, on Zulip):

we'd have to decide what happens if one does -Z profiler-events=fast,incremental

mw (Apr 10 2019 at 11:58, on Zulip):

creating the union of all given sets would be a straightforward option

Wesley Wiser (Apr 10 2019 at 11:59, on Zulip):

"Adding" them together makes sense to me

Wesley Wiser (Apr 10 2019 at 11:59, on Zulip):

Yeah

Wesley Wiser (Apr 10 2019 at 12:00, on Zulip):

(btw, can I get you to sign off on https://github.com/rust-lang/measureme/pull/21 which bumps the version number and adds some missing Cargo.toml fields?)

simulacrum (Apr 10 2019 at 12:01, on Zulip):

@Wesley Wiser signed off

Wesley Wiser (Apr 10 2019 at 12:01, on Zulip):

Thanks :smile:

mw (Apr 10 2019 at 12:01, on Zulip):

merged

mw (Apr 11 2019 at 12:53, on Zulip):

here is a thought on the implementation: it would be good the filter mechanism could also take care of the case where the profiler is completely disabled.

mw (Apr 11 2019 at 12:54, on Zulip):

that is, instead of storing the profiler as Option<Arc<SelfProfiler>> it would something like SelfProfiler { filter_mask: FilterBitFlags, Arc<SelfProfilerData> }

mw (Apr 11 2019 at 12:54, on Zulip):

where filter_mask would be all zeros for the disabled-case

mw (Apr 11 2019 at 12:55, on Zulip):

that way we would avoid checking twice whether we should record an event

mw (Apr 11 2019 at 12:59, on Zulip):

in order to implement this we would need to be able to create a Profiler "null object", which is simple for the MmapSerializationSink (allocate the memory maps as usual, just don't write anything to disk at the end) but trickier for the FileSerializationSink because that expects to create the file on disk immediately

Wesley Wiser (Apr 11 2019 at 13:08, on Zulip):

On *nix, perhaps FileSerializationSink could write to /dev/null?

Wesley Wiser (Apr 11 2019 at 13:09, on Zulip):

I believe Windows has a NUL device as well

mw (Apr 11 2019 at 13:11, on Zulip):

Yes, I've thought of /dev/null. If Windows has an equivalent then that would work

Wesley Wiser (Apr 11 2019 at 13:13, on Zulip):

https://stackoverflow.com/questions/313111/is-there-a-dev-null-on-windows

mw (Apr 11 2019 at 13:19, on Zulip):

OK, this works on Linux: let _file = std::fs::File::create("/dev/null").unwrap();

mw (Apr 11 2019 at 13:19, on Zulip):

i.e. there's no error reported.

mw (Apr 11 2019 at 13:24, on Zulip):

libstd seems to use Path::new("NUL"): https://github.com/rust-lang/rust/blob/ee1474acc43fbf657b0fc910c139cf63cef34dc8/src/libstd/sys/windows/process.rs#L284-L301

simulacrum (Apr 11 2019 at 13:35, on Zulip):

Hm, so one concern with just writing to /dev/null is that we're still paying some cost for the event recording

simulacrum (Apr 11 2019 at 13:36, on Zulip):

Unless I misunderstood

Wesley Wiser (Apr 11 2019 at 13:39, on Zulip):

I think @mw is suggesting that when the profiler is disabled, that state will be the same as manually disabling all of the event types. So no events will actually be recorded. However, the FileSerializationSink creates its profile files when it is constructed. We can resolve that when the profiler is disabled by having it "write" to /dev/null in which case it doesn't matter because no events will be recorded.

simulacrum (Apr 11 2019 at 13:40, on Zulip):

Ah, okay, that sounds better. Though I still half expect that the cost of option.is_none() might be lower than some set of bit operations for each event

Wesley Wiser (Apr 11 2019 at 13:40, on Zulip):

Definitely something to test

mw (Apr 11 2019 at 13:44, on Zulip):

It might well be that the optimization is not worth it but the idea is to replace

if let Some(ref profiler) = xyz.profiler {
   if (profiler.filter_mask & SOME_EVENT) != 0 {
      profiler.record_some_event(...);
   }
}

with

   if (xyz.profiler.filter_mask & SOME_EVENT) != 0 {
      profiler.record_some_event(...);
   }

and set filter_mask to 0 when the profiler is disabled completely.

mw (Apr 11 2019 at 13:47, on Zulip):

it's probably not something we'd have to do in the initial implementation of event filtering :)

simulacrum (Apr 11 2019 at 13:48, on Zulip):

I would indeed be interested if that would be slower (technically, CPU etc could optimize if filter_mask is 0) and the & is quite cheap these days...

mw (Apr 11 2019 at 13:49, on Zulip):

I suspect it not to make much of a difference

mw (Apr 11 2019 at 13:49, on Zulip):

then again, doing an if that always evaluates to true is cheap too these days

mw (Apr 11 2019 at 13:50, on Zulip):

so maybe I'm trying to be a bit too clever here

simulacrum (Apr 11 2019 at 13:50, on Zulip):

eh, at least the code will be a bit clearer

mw (Apr 11 2019 at 13:51, on Zulip):

yes

mw (Apr 11 2019 at 13:51, on Zulip):

that might be more important

mw (Apr 11 2019 at 13:57, on Zulip):

one thing I do suggest doing in the first implementation is storing the filter_mask next to the Option<Arc<SelfProfiler>> instead of in SelfProfiler itself. That way the filtering can be done without having to dereference the pointer, leading to better cache behaviour

mw (Apr 11 2019 at 14:00, on Zulip):

or in other words, split SelfProfiler into something like

struct SelfProfilerInternal {
    ... stuff
}

#[derive(Clone, Send, Sync)]
pub struct SelfProfiler {
    event_filter_mask: u32,
    internal: Option<Arc<SelfProfilerInternal>>,
}
simulacrum (Apr 11 2019 at 14:11, on Zulip):

hm, I'd expect either to be about equally cached

mw (Apr 11 2019 at 14:21, on Zulip):

another thing that I'd like to have numbers on :)

mw (Apr 16 2019 at 10:51, on Zulip):

So, I actually ran some perf tests here: https://github.com/rust-lang/rust/pull/59915

mw (Apr 16 2019 at 10:52, on Zulip):

but the numbers are not very reliable because there are also other changes in there that might affect performance

mw (Apr 16 2019 at 10:53, on Zulip):

anyway, filtering itself is very helpful in keeping overhead low: https://perf.rust-lang.org/compare.html?start=99da733f7f38ce8fe68453e859b7ac96bf7caa0f&end=30ce13219eef2589a769b5d694bdc1c0aec0befc

mw (Apr 16 2019 at 10:56, on Zulip):

@Wesley Wiser, @simulacrum: how would you do the commandline interface for this? one option is to do it like the PR does:

mw (Apr 16 2019 at 10:57, on Zulip):

another option would be to

mw (Apr 16 2019 at 10:59, on Zulip):

I'd prefer the second option but then we'd have no way to specify the directory to store the profiling data in

Wesley Wiser (Apr 16 2019 at 13:20, on Zulip):

I'd like to have a "sensible default" which provides a useful set of events but doesn't have too much overhead. This would be the flag we'd use to get performance bug reports from users. I think having the ability to tweak the events captured is also a good idea so I'd propose something like this:

mw (Apr 17 2019 at 08:37, on Zulip):

that makes sense to me. I'll update the event filter PR later. @simulacrum, I imagine that perf.rlo will want the ability to specify the output directory right from the beginning, right?

mw (Apr 17 2019 at 08:39, on Zulip):

(side note: it would make sense to write data like this to a ram-disk)

simulacrum (Apr 17 2019 at 12:13, on Zulip):

@mw I don't think perf cares too much if the default is working directory

simulacrum (Apr 17 2019 at 12:13, on Zulip):

(or like /tmp)

simulacrum (Apr 17 2019 at 12:13, on Zulip):

so long as the filename is either settable or relatively obvious we should be fine

mw (Apr 17 2019 at 13:05, on Zulip):

OK, good to know.

simulacrum (Apr 17 2019 at 13:39, on Zulip):

@mw fwiw here's how the current profilers are integrated: https://github.com/rust-lang-nursery/rustc-perf/blob/master/collector/src/bin/rustc-perf-collector/execute.rs#L476 and https://github.com/rust-lang-nursery/rustc-perf/blob/master/collector/src/bin/rustc-fake.rs#L18

simulacrum (Apr 17 2019 at 13:39, on Zulip):

so something loosely similar would probably be wanted

Wesley Wiser (Apr 17 2019 at 13:54, on Zulip):

What kind of output would be easiest for rustc-perf to interpret? Json, csv, something else?

Wesley Wiser (Apr 17 2019 at 13:55, on Zulip):

I'm working on the analysis tool and looking to add machine readable output for collection

simulacrum (Apr 17 2019 at 14:07, on Zulip):

@Wesley Wiser Probably JSON? We already emit it so the necessary parts are there

simulacrum (Apr 17 2019 at 14:08, on Zulip):

but doesn't matter all that much since IIRC we'll be using a library to load the data anyway

Wesley Wiser (Apr 17 2019 at 14:08, on Zulip):

Ok. Easy enough

Wesley Wiser (Apr 17 2019 at 14:09, on Zulip):

Is there a prefered encoding for Durations? I was thinking of outputting nanoseconds

simulacrum (Apr 17 2019 at 14:09, on Zulip):

That seems reasonable -- what does Serde serialize them as?

simulacrum (Apr 17 2019 at 14:10, on Zulip):

https://github.com/serde-rs/serde/blob/master/serde/src/ser/impls.rs#L589

Wesley Wiser (Apr 17 2019 at 14:11, on Zulip):

I'll just serde on our end so it will match up

simulacrum (Apr 17 2019 at 14:11, on Zulip):

Hm, interestingly, nanoseconds are somewhat not lossless to convert back it seems

simulacrum (Apr 17 2019 at 14:11, on Zulip):

well, needs work, anyway

simulacrum (Apr 17 2019 at 14:40, on Zulip):

@Wesley Wiser Duration::new takes u32 for nanoseconds but duration can store up to u128 nanoseconds it seems

Wesley Wiser (Apr 17 2019 at 14:42, on Zulip):

Oh ok. It looks like serde calls the Duration::new(secs, nanos) ctor so that shouldn't be an issue
https://github.com/serde-rs/serde/blob/79a20e9e33a7cdb68d32cd990b638ce5357230d6/serde/src/de/impls.rs#L1921

simulacrum (Apr 17 2019 at 14:44, on Zulip):

yeah, seems fine because of the seconds part I think

Wesley Wiser (Apr 17 2019 at 14:46, on Zulip):

Yeah, nanos is only used for storing fractional seconds and 2^32 -1 ns is ~4 seconds so there shouldn't be any issue there.

simulacrum (Apr 17 2019 at 14:48, on Zulip):

sounds good

Wesley Wiser (Apr 18 2019 at 01:10, on Zulip):

PR for the summarization tool is up: https://github.com/rust-lang/measureme/pull/17

It also includes a --json flag to have it generate a json version of the analysis. Some sample output:

{
    "query_data": [
        {
            "label": "codegen 2i9de11lfqzk0ltu",
            "self_time": {
                "secs": 0,
                "nanos": 922041
            },
            "number_of_cache_misses": 1,
            "number_of_cache_hits": 0,
            "invocation_count": 1,
            "blocked_time": {
                "secs": 0,
                "nanos": 0
            },
            "incremental_load_time": {
                "secs": 0,
                "nanos": 0
            }
        },
        {
            "label": "CrateHash",
            "self_time": {
                "secs": 0,
                "nanos": 21245
            },
            "number_of_cache_misses": 20,
            "number_of_cache_hits": 1,
            "invocation_count": 21,
            "blocked_time": {
                "secs": 0,
                "nanos": 0
            },
            "incremental_load_time": {
                "secs": 0,
                "nanos": 0
            }
        }
    ],
    "total_time": {
        "secs": 13,
        "nanos": 351586715
    }

cc @mw @simulacrum

Last update: Nov 17 2019 at 07:20UTC