Stream: t-compiler/wg-mir-opt

Topic: Status of NRVO-like optimizations


ecstatic-morse (Apr 19 2020 at 20:09, on Zulip):

@Jonas Schievink @eddyb I was excited to see some recent movement towards copy propagation into the return place in #71003, but that PR seems to have pretty poor worst-case performance, and is now closed. Is the worst-case performance of the MIR transformation the only blocker for this class of optimizations? I saw that there are also some outstanding questions around MIR semantics now that the return place can be written to even if the function eventually unwinds, but I think this can be resolved independently?

Jonas Schievink (Apr 19 2020 at 20:11, on Zulip):

@eddyb also cited implementation complexity as a concern (which would arise if we tried to make this optimization fast enough), but I think it wouldn't be much more complicated than the generator transform, which does similar things.

Jonas Schievink (Apr 19 2020 at 20:12, on Zulip):

The MIR semantics issue is mostly resolvable independently, but I did need to modify const eval to avoid ICEs, which is in #71005

Jonas Schievink (Apr 19 2020 at 20:15, on Zulip):

One thing I wanted to play with is a borrow-aware liveness analysis, which is also used by the generator transform, and your dataflow framework changes to support backwards dataflow would help making that simpler, so I'm looking forward to that

ecstatic-morse (Apr 19 2020 at 20:17, on Zulip):

Yeah, that was one of my motivations. I'll try to push a bit harder to get that reviewed so you can make use of it. Lemme know if I can assist in other ways. I want to read through dag_nrvo.rs so I can be more helpful in this area.

ecstatic-morse (Apr 19 2020 at 20:17, on Zulip):

Thanks for writing the module summary BTW

Jonas Schievink (Apr 19 2020 at 20:18, on Zulip):

Awesome, thanks!

ecstatic-morse (Apr 21 2020 at 00:07, on Zulip):

I read through your PR today. It was much more like #47954 than I expected, basically a robust, global copy propagator. This is cool, but there must be a simpler approach to the NRVO specifically (as opposed to global copy propagation in general) that will work for most users.

ecstatic-morse (Apr 21 2020 at 00:12, on Zulip):

I'm going to try to come up with a more "bottom-up" approach to the NRVO; one that tries to propagate only the small subset of copies that assign to _0 and reach a Return. I'd be happy to have a discussion partner if you think this approach might have merit, but it might be more fruitful to spend time optimizing your existing approach if you think you're close.

ecstatic-morse (Apr 21 2020 at 00:31, on Zulip):

To elaborate, I don't care so much about propagating long chains of copies all the way into the return place. I expect that most functions that would benefit from NRVO look like the one in #62446, where a large value is assigned to a local, then that value is mutated in place before it is returned.

ecstatic-morse (Apr 21 2020 at 00:34, on Zulip):

So I'm thinking that the first step is computing the set of reaching definitions of the return place that are live at each Return terminator in the MIR. This can be done robustly with dataflow, or we can just traverse the predecessor graph and abort if we find a basic block with more than two predecessors before finding a copy/move into the return place (_0 = _1).

ecstatic-morse (Apr 21 2020 at 00:36, on Zulip):

In the simplest case, we will have MIR like:

_1 = [0; 128];
_1[_2] = 1;   // Mutate `_1` in place
_0 = _1;
return;
ecstatic-morse (Apr 21 2020 at 00:37, on Zulip):

I like to think of a return as an implicit "use" of _0. Writing that explicitly, we get:

_1 = [0; 128];
// Mutate `_1` in place
_0 = _1;
return _0;
ecstatic-morse (Apr 21 2020 at 00:40, on Zulip):

Those last two statements are the canonical form for copy propagation. Then the question is whether it's safe to rename _1 to _0 to eliminate the copy.

ecstatic-morse (Apr 21 2020 at 00:44, on Zulip):

In what I expect is the common case, all return statements will have a single reaching definition of the return place , all of which are a copy of the same local. In this case, you could pretend that return takes an arbitrary local and do the transformation the other way at all return terminators.

_1 = [0; 128];
// _1 = _1; // removed
return _1
ecstatic-morse (Apr 21 2020 at 00:47, on Zulip):

At that point, you can simply swap the names for _0 and _1, although I believe there will be no more uses of _0 remaining due to the way MIR is currently built.

Jonas Schievink (Apr 21 2020 at 11:30, on Zulip):

I think it can definitely be helpful to explore simpler approaches, yeah. My approach tried to simplify #47954 by only handling acyclic CFGs, which gets rid of all the complexity related to computing conflict matrices. The approach with reaching definitions sounds plausible, but it's hard for me to judge soundness of this approach (even my simplified NRVO pass needed a lot of iteration to work well, during which I've found some soundness bugs in the original approach as well).

Hanna Kruppe (Apr 21 2020 at 12:04, on Zulip):

I don't feel like it's a good idea to jump straight to ad-hoc approximations with no obvious relation to the soundness conditions already identified. Even in #71003 as it is, the part that really killed performance (according to comments there) is the particular way in which the "live ranges do not overlap" check is converted into a potentially-whole-CFG traversal for every single candidate. I don't know why this is done in the first place, instead of "just" (devil is in the details, of course) computing a representation of the live range of every relevant place once and then doing pairwise overlap checks on these representations. After this problem is addressed, it may still be too slow e.g. because too many candidate pairs are generated (only trying to unify locals with _0, not arbitrary pairs of locals, would help with that). But currently I get the impression that the immediate problem isn't too few approximations, but approximations that aren't actually faster than "the proper way".

ecstatic-morse (Apr 21 2020 at 15:54, on Zulip):

@Hanna Kruppe Noted. My concern was that, because NRVO is not useful in most cases, using a better-optimized global copy propagation pass to achieve it may cause too large a regression in compile time to ultimately worth it. I think it's valuable to explore alternatives, since propagating copies into the return place specifically has a greater benefit than doing the same for locals reside within our stack frame.

Hanna Kruppe (Apr 21 2020 at 16:10, on Zulip):

I share that concern and actually suggested the same restriction on discord a few days ago. Besides compile time, I also believe merging locals too eagerly could have negative effects on some LLVM optimizations (one reason why stack coloring is traditionally done very late). I was only objecting to starting by fiddling further with how we check if the optimization is legal, I see a risk of losing sight of what (and why) is actually approximated and zooming in on presumed optimizations that may be unnecessary if we took a step back first.

ecstatic-morse (Apr 21 2020 at 16:18, on Zulip):

That's fair. I'll put my energy into ensuring the soundness of the current approach while extending it to borrowed locals. I suspect that most functions that would benefit from NRVO borrow the local they ultimately return at some point. See the example in #62466.

Hanna Kruppe (Apr 21 2020 at 17:00, on Zulip):

Did you mean to link a different issue= #62466 is about trait impls for tuples with >12 elements.

ecstatic-morse (Apr 21 2020 at 17:09, on Zulip):

Whoops, yes #62446

ecstatic-morse (Apr 21 2020 at 21:11, on Zulip):

I'm trying put a finger on why I'm having a hard time coming up with the proper set of restrictions on the general case. The canonical example for copy propagation looks something like:

fn foo() {
    let B = 42;
    let A = B;    // Copy

    // No intervening definitions of A or B

    return A; // This use of A can safely be replaced with B
}

Note that while in this example, the use of A is a return statement, it could be any kind of use, e.g. let C = A.

Jonas Schievink (Apr 21 2020 at 21:13, on Zulip):

Is the necessary restriction more than "A and B are never live at the same time"?

Jonas Schievink (Apr 21 2020 at 21:14, on Zulip):

(plus some MIR-specific oddities, of course)

ecstatic-morse (Apr 21 2020 at 21:15, on Zulip):

What's somewhat unique about NRVO is that the name of the local that is returned is always the same. Previous attempts are actually trying to solve a slightly different problem:

fn foo()
    let B = 42; // Can B be replaced with A?

    // What are the restrictions on statements between the initial definition of B and the copy?

    let A = B;    // Copy

     // What about between the copy and the use of A?

    return A;
}
ecstatic-morse (Apr 21 2020 at 21:16, on Zulip):

So that depends on your definition of liveness. If B becomes "live" as soon as it is borrowed, then that's a deal-breaker

Jonas Schievink (Apr 21 2020 at 21:17, on Zulip):

It would, yes

ecstatic-morse (Apr 21 2020 at 21:17, on Zulip):

I want to look at "indirect definitions" (e.g. *p = 42 or arbitrary function calls) and consider them a "use" of a given local in the liveness analysis here only if a borrow to that local could exist at that point

Jonas Schievink (Apr 21 2020 at 21:18, on Zulip):

Yeah, that's what the original attempt tried to do

ecstatic-morse (Apr 21 2020 at 21:19, on Zulip):
fn foo() {
    let B = 42;

    let _ = &mut B; // B becomes borrowed here

    let A = B;

    // But there is no intervening definition, direct or indirect, here
    // foo();

    return A;
}
Jonas Schievink (Apr 21 2020 at 21:20, on Zulip):

But wouldn't you have to give up as soon as the borrow escapes the local stack frame?

ecstatic-morse (Apr 21 2020 at 21:20, on Zulip):

Right, so it's quite clear to me how that applies to the first example above. It's less clear how it applies to the second example.

ecstatic-morse (Apr 21 2020 at 21:20, on Zulip):

Our locals only exist as long as our stack frame is live

ecstatic-morse (Apr 21 2020 at 21:21, on Zulip):

Do you mean escapes into callees? Yes.

Jonas Schievink (Apr 21 2020 at 21:22, on Zulip):

Yeah. But if the reference is passed via a function call you wouldn't just have a single use, you'd potentially have continued uses (due to threads) until the local is deallocated.

Jonas Schievink (Apr 21 2020 at 21:22, on Zulip):

Or moved out of, but that isn't currently used by optimizations

ecstatic-morse (Apr 21 2020 at 21:23, on Zulip):

That's why we have to treat any function call as an indirect definition. As soon as we have a function call between the copy and the use, we can't do copy propagation unless we can prove that the local is not (mutably) borrowed.

Jonas Schievink (Apr 21 2020 at 21:25, on Zulip):

Ah

ecstatic-morse (Apr 21 2020 at 21:27, on Zulip):

I guess my question is: Is the same condition that is necessary to replace the use of A with one of B in the first example sufficient to justify replacing the definition of B with one of A in the second?

ecstatic-morse (Apr 21 2020 at 21:28, on Zulip):

The condition being: "no intervening definitions of A or B between the copy and the use of A"

ecstatic-morse (Apr 21 2020 at 21:28, on Zulip):

(I'm currently trying to find a counter-example)

ecstatic-morse (Apr 21 2020 at 21:46, on Zulip):

Obviously, as soon as you have other uses of B, you would need to rename these as well. I'm not sure what conditions that would make this renaming legal.

ecstatic-morse (Apr 21 2020 at 21:47, on Zulip):

Your mental model, checking whether live ranges overlap, works quite nicely because then there can be no problems coalescing two locals into one.

ecstatic-morse (Apr 21 2020 at 21:48, on Zulip):

So probably I need to adopt that one.

Jonas Schievink (Apr 21 2020 at 21:48, on Zulip):

Hmm, yeah, that's why we've used the model based on live ranges of the variables so far. Is there a reason you want to use a different condition?

Jonas Schievink (Apr 21 2020 at 21:48, on Zulip):

("that" being "it seems pretty clearly sound")

ecstatic-morse (Apr 21 2020 at 21:49, on Zulip):

I'm thinking it's not as precise in the presence of immutable borrows.

ecstatic-morse (Apr 21 2020 at 21:50, on Zulip):
fn foo() {
    let B = 42;
    let p = &B; // B becomes borrowed here
    let A = B;
    foo(p); // B is "used" here
    return A;
}
ecstatic-morse (Apr 21 2020 at 21:52, on Zulip):

Replacing the use of A with one of B is still sound in this case. Despite the fact that the live ranges of A and B overlap.

ecstatic-morse (Apr 21 2020 at 21:52, on Zulip):

(again with most definitions of "liveness")

ecstatic-morse (Apr 21 2020 at 21:54, on Zulip):

That's why I think the "nonoverlapping live ranges" mental model is not strictly better than the "no intervening definitions between copy and use" mental model

ecstatic-morse (Apr 21 2020 at 21:55, on Zulip):

Although one may be better in practice.

Jonas Schievink (Apr 21 2020 at 21:58, on Zulip):

That's true, but "no intervening definitions" does not seem sufficient for soundness due to unrelated uses of A and B in other places in the function

ecstatic-morse (Apr 21 2020 at 22:00, on Zulip):

The no intervening definitions approach is perfectly sound, but it only considers a single use of a variable at a time.

ecstatic-morse (Apr 21 2020 at 22:01, on Zulip):

Whereas "nonoverlapping live ranges" considers all uses/definitions of a pair of variables.

ecstatic-morse (Apr 21 2020 at 22:04, on Zulip):

So what I've been considering is a sort of two-phase approach. First is global copy propagation, where we consider return a "use" of the local _0 and allow a global copy propagation pass to replace those uses with other locals. Then we visit all possible return places to see which local is most commonly used as the return value, rename that local to _0 and then insert copies to _0 for all return statements that don't use the most common local as the return place.

ecstatic-morse (Apr 21 2020 at 22:53, on Zulip):

(I call this "traditional" because it's in https://www.amazon.com/Advanced-Compiler-Design-Implementation-Muchnick/dp/1558603204)

eddyb (Apr 27 2020 at 06:44, on Zulip):

argh this is a long thread

eddyb (Apr 27 2020 at 06:46, on Zulip):

but I think it wouldn't be much more complicated than the generator transform, which does similar things.

I just wanted to point that the generator transform is simpler because it doesn't care about accesses, only storage live ranges, and fundamentally it overlaps unrelated locals whereas NRVO needs to overlap locals with transfers between them (which requires they both be live at the same time)

eddyb (Apr 27 2020 at 06:47, on Zulip):

also every time you unify two locals, the conflict matrix needs to be updated to account for "shallow transitivity"

eddyb (Apr 27 2020 at 06:49, on Zulip):

also, NRVO and copy propagation aren't the same, they're more like...duals

eddyb (Apr 27 2020 at 06:50, on Zulip):

what is usually called "copy propagation" is "(forward) source propagation" while NRVO in the sense that I use it in is "(backwards) destination propagation"

eddyb (Apr 27 2020 at 06:53, on Zulip):

and unifying locals is not exactly "propagation", i.e. it's global and the resulting MIR has one less local, whereas "destination propagation" could be understood to be more local and skip a copy without getting rid of the intermediary local from other places in the MIR

eddyb (Apr 27 2020 at 06:55, on Zulip):

I think forward source propagation is harder in some respects, because you have to check that you can move an Rvalue all the way down to an use of a local

eddyb (Apr 27 2020 at 06:57, on Zulip):

so it's kind of like the conflict check of replacing b = a by unifying a and b when their live ranges don't overlap, except for swapping two statements

eddyb (Apr 27 2020 at 06:58, on Zulip):

but in that sense, b = a is a "local" check (after the live ranges have been computed), whereas the source propagation requires you to effectively swap statements the entire path between def and use

eddyb (Apr 27 2020 at 06:59, on Zulip):

you wouldn't literally implement it as such, but there's more to check overall IIUC

eddyb (Apr 27 2020 at 07:00, on Zulip):

for SSA-like locals though, you can do a better job at propagating them into uses, as long as the original is still live

eddyb (Apr 27 2020 at 07:00, on Zulip):

IMO this is most useful (ignoring consts which have their own propagation) for arguments

eddyb (Apr 27 2020 at 07:01, on Zulip):

and I suppose non-Use Rvalues if you want to do that

eddyb (Apr 27 2020 at 07:03, on Zulip):

MIR statements of the form local = local; can be optimized by either unifying the locals ("NRVO") or propagating the RHS to uses of the LHS, and you can't do the former if the RHS is an argument

eddyb (Apr 27 2020 at 07:03, on Zulip):

or, wait, you can?! did I get confused at some point in the past?

eddyb (Apr 27 2020 at 07:04, on Zulip):

just like you can only unify _0 with a local by renaming that local to _0, not the other way around, you can unify arguments with other locals by renaming those locals to the argument

eddyb (Apr 27 2020 at 07:04, on Zulip):

the difference being that _0 is usually a destination and arguments are usually a source

eddyb (Apr 27 2020 at 07:06, on Zulip):

so IOW, copyprop aka "source propagation" is most useful when you have non-Use Rvalues, if I'm not mistaken

eddyb (Apr 27 2020 at 07:06, on Zulip):

or I guess places with projections

eddyb (Apr 27 2020 at 07:09, on Zulip):

@ecstatic-morse also, MIR working with places rather than values means a bunch of textbook stuff (e.g. SSA) doesn't apply as cleanly, and some of the tradeoffs are different too

eddyb (Apr 27 2020 at 07:10, on Zulip):

I much prefer approaches like my "local unification" in how it completely eliminates reasoning about "code motion"

Félix Fischer (Apr 27 2020 at 16:42, on Zulip):

ecstatic-morse said:

Hanna Kruppe Noted. My concern was that, because NRVO is not useful in most cases, using a better-optimized global copy propagation pass to achieve it may cause too large a regression in compile time to ultimately be worth it. I think it's valuable to explore alternatives, since propagating copies into the return place specifically should have a greater benefit than doing the same for locals residing within our stack frame.

Maybe it is possible to compute a quick score/heuristic that tells us how likely it is that NRVO is going to be useful. That way we don't run it for most of the time, but when we do, it gives big perf wins. If that makes sense?

Jonas Schievink (May 07 2020 at 00:09, on Zulip):

One thing that might become a performance issue is that we need to know about liveness and "borrowed-ness" of all locals at every single statement to build the conflict matrix, but liveness is a backwards analysis, while "borrowed-ness" (and MaybeInitializedLocals) are forward analyses, so for at least one analysis we will have O(n²) visiting of all statements, at least with the current cursor API.

(Generators don't have this problem because they only use forward analyses to build their conflict matrix – liveness doesn't factor into which locals may or may not overlap in their state)

This is likely to be an issue at least for the tuple-stress benchmark, which has 1 basic block with 458745 statements.

@ecstatic-morse you might have some ideas for how to make the cursor API more performant for cases like this.

ecstatic-morse (May 07 2020 at 00:11, on Zulip):

Can you go once backward through the basic block and once forward?

Jonas Schievink (May 07 2020 at 00:12, on Zulip):

Oh, that's right. We just | the bits in the conflict matrix. Well that was easy.

ecstatic-morse (May 07 2020 at 00:13, on Zulip):

NP lemme know if anything else comes up.

Félix Fischer (May 07 2020 at 03:10, on Zulip):

Good job you guys! :D

Jonas Schievink (May 08 2020 at 20:45, on Zulip):

Heh, apparently the liveness analysis does not consider _0 live, ever. Maybe the MIR visitor code should treat return like a use of _0 to fix this, and potentially handling in other passes?

Jonas Schievink (May 08 2020 at 20:47, on Zulip):

Oh, no, that's not true either, _0 is live, but only at the start of the function? But it doesn't contain any value there...

Jonas Schievink (May 08 2020 at 20:50, on Zulip):
pub struct BigTest {
    arr: [u32; 128]
}

impl BigTest {
    pub fn new() -> BigTest {
        BigTest {
            arr: [123; 128],
        }
    }
}

yields:

// MIR for `<impl at nrvo.rs:9:1: 15:2>::new`  MergePlaces-dataflow

fn <impl at nrvo.rs:9:1: 15:2>::new() -> BigTest {
    let mut _0: BigTest;                 // return place in scope 0 at nrvo.rs:10:21: 10:28
    let mut _1: [u32; 128];              // in scope 0 at nrvo.rs:12:18: 12:28

    bb0: {
        // init: []
        // live: [_0]
        StorageLive(_1);                 // scope 0 at nrvo.rs:12:18: 12:28
        // init: []
        // live: [_0, _1]
        _1 = [const 123u32; 128];        // scope 0 at nrvo.rs:12:18: 12:28
                                         // ty::Const
                                         // + ty: u32
                                         // + val: Value(Scalar(0x0000007b))
                                         // mir::Constant
                                         // + span: nrvo.rs:12:19: 12:22
                                         // + literal: Const { ty: u32, val: Value(Scalar(0x0000007b)) }
        // init: [_1]
        // live: []
        (_0.0: [u32; 128]) = move _1;    // scope 0 at nrvo.rs:11:9: 13:10
        // init: [_0]
        // live: []
        StorageDead(_1);                 // scope 0 at nrvo.rs:13:9: 13:10
        // init: [_0]
        // live: []
        return;                          // scope 0 at nrvo.rs:14:6: 14:6
        // init: [_0]
        // live: []
    }
}
Jonas Schievink (May 08 2020 at 20:51, on Zulip):

(where init is MaybeInitializedLocals and live is MaybeLiveLocals)

Matthew Jasper (May 08 2020 at 20:55, on Zulip):

"Maybe the MIR visitor code should treat return like a use of _0 to fix this" - I would like to see this.

Matthew Jasper (May 08 2020 at 20:56, on Zulip):

Isn't "live before initialized" going to be a problem for anything initialized by an aggregate rvalue, which becomes a bunch of field assignments.

Jonas Schievink (May 08 2020 at 20:58, on Zulip):

(my printing code was a bit broken, I've edited the output)

Jonas Schievink (May 08 2020 at 20:59, on Zulip):

Certainly I wouldn't expect any locals to be live at the start of bb0 in any case, that implies a use of uninitialized data if I understand liveness correctly

Jonas Schievink (May 08 2020 at 21:00, on Zulip):

Ah, well except arguments of course

Matthew Jasper (May 08 2020 at 21:06, on Zulip):

"Live and initialized" is generally what's needed. Borrowck used to have an issue where uninitialized variables were considered to be live because there was a Drop of them along some unwind path (which drop elaboration would fix). Here liveness believes that assigning to _0.0 is a use, even though it isn't (this behavior is correct for borrowck).

Jonas Schievink (May 08 2020 at 21:27, on Zulip):

Oof, yeah, the interaction with aggregates is bad...

ecstatic-morse (May 08 2020 at 21:32, on Zulip):

This comes from marking MutatingUseContext::Projection as a "use". According to the docs, that context is used for both &mut x.y (which is indeed a use of x) and x.y = ... (which is not a use of x but is also not a def of the entirety of x). Looks like we'll need to stop relying on PlaceContext for MIR liveness.

Jonas Schievink (May 08 2020 at 21:33, on Zulip):

Yeah, it kinda does makes sense

Jonas Schievink (May 08 2020 at 21:34, on Zulip):

Somehow I'm also getting different liveness after a block's terminator when compared to after the block itself, in the function above...

ecstatic-morse (May 08 2020 at 21:34, on Zulip):

But yeah, we need to handle return properly. That's a bug and maybe caused the unexpected changes in #71956?

Jonas Schievink (May 08 2020 at 21:36, on Zulip):

I somewhat doubt that, as I don't think _0 can ever be live across a yield currently.

ecstatic-morse (May 08 2020 at 21:39, on Zulip):

I say this only because there's a mention of NRVO in the comment above one of the generator tests whose size changed (ui/generator/size-moved-locals.rs). I think it might be outdated; It seems wrong to me.

ecstatic-morse (May 08 2020 at 21:40, on Zulip):

Jonas Schievink said:

Somehow I'm also getting different liveness after a block's terminator when compared to after the block itself, in the function above...

Can you be more specific? Remember seek_after_primary_effect(terminator_loc) and seek_to_block_end are not the same for backward dataflow analyses. That's why the name of the first changed from seek_after.

Jonas Schievink (May 08 2020 at 21:43, on Zulip):

Ah, right, I've been printing the wrong dataflow state then

ecstatic-morse (May 08 2020 at 21:46, on Zulip):

Matthew Jasper said:

"Maybe the MIR visitor code should treat return like a use of _0 to fix this" - I would like to see this.

I also think this is a good idea.

Jonas Schievink (May 08 2020 at 21:51, on Zulip):

So, is this the correct way to print the dataflow state between statements?

PassWhere::BeforeLocation(loc) => {
    init.seek_before_primary_effect(loc);
    live.seek_after_primary_effect(loc);

    writeln!(w, "        // init: {:?}", init.get())?;
    writeln!(w, "        // live: {:?}", live.get())?;
}
PassWhere::AfterTerminator(bb) => {
    let loc = body.terminator_loc(bb);
    init.seek_after_primary_effect(loc);
    live.seek_before_primary_effect(loc);

    writeln!(w, "        // init: {:?}", init.get())?;
    writeln!(w, "        // live: {:?}", live.get())?;
}
Jonas Schievink (May 08 2020 at 21:53, on Zulip):

(it looks correct now, at least)

ecstatic-morse (May 08 2020 at 21:54, on Zulip):

Yes. That should give you the correct results (it's slow for liveness, but you already knew that)

Jonas Schievink (May 08 2020 at 22:32, on Zulip):

One issue with making return read from _0 is that MutVisitor receives a &mut Local, but this would be the only Local it cannot actually mutate.

Jonas Schievink (May 08 2020 at 22:32, on Zulip):

I can put an assertion in the visitor code, but maybe there's a better solution?

Jonas Schievink (May 08 2020 at 22:33, on Zulip):

(the assertion is needed, CopyProp started miscompiling things otherwise)

Jonas Schievink (May 09 2020 at 13:09, on Zulip):

Regarding the interaction of liveness with aggregates, maybe that could be fixed by making liveness act on places instead of locals?

Jonas Schievink (May 09 2020 at 13:11, on Zulip):

Or maybe my MergeLocals pass should run before the deaggregator. The generator transform already does that, interestingly. I wonder if it'd break if we run it afterwards.

Jonas Schievink (May 09 2020 at 13:20, on Zulip):

Jonas Schievink said:

Or maybe my MergeLocals pass should run before the deaggregator. The generator transform already does that, interestingly. I wonder if it'd break if we run it afterwards.

Ah, well this doesn't work since the pass then can't propagate destinations at all

Jonas Schievink (May 09 2020 at 23:33, on Zulip):

I've now hit a peculiar issue where my NRVO pass apparently misoptimizes join_generic_copy, but looking at how the MIR changes, it seems that the input to my pass contains an Operand::Copy that copies from a local of type &mut [T], and according to the docs that is not supposed to happen. Is that correct? If so, is there any way to run a sort of sanitizer pass between each optimization that could find these issues? Maybe the MIR typecheck pass?

Félix Fischer (May 10 2020 at 02:45, on Zulip):

Ohh

Félix Fischer (May 10 2020 at 02:45, on Zulip):

Yes

Félix Fischer (May 10 2020 at 02:45, on Zulip):

That's a good idea

Félix Fischer (May 10 2020 at 02:45, on Zulip):

I mean

Félix Fischer (May 10 2020 at 02:45, on Zulip):

We should never need them if the optimizations are correct

Félix Fischer (May 10 2020 at 02:45, on Zulip):

Like, sound

Félix Fischer (May 10 2020 at 02:46, on Zulip):

But maybe we could run that sanitizer while in testing mode

Félix Fischer (May 10 2020 at 02:46, on Zulip):

Or no. Even better: we could run that sanitizer in a special mode where we basically run crater

Félix Fischer (May 10 2020 at 02:46, on Zulip):

And we check that our transformations are sound

Jonas Schievink (May 10 2020 at 21:00, on Zulip):

This appears to be caused by the &*x => x transform made by InstCombine

Jonas Schievink (May 10 2020 at 23:45, on Zulip):

#72093 seems to fix the issues I was seeing nevermind

scottmcm (May 13 2020 at 18:48, on Zulip):

Jonas Schievink said:

This appears to be caused by the &*x => x transform made by InstCombine

You might be interested in https://github.com/rust-lang/rust/issues/46420

Jonas Schievink (May 13 2020 at 18:51, on Zulip):

Ah, indeed

Jonas Schievink (May 14 2020 at 20:07, on Zulip):

I wonder if it makes sense to create an A-mir-opt-nrvo label to track issues that would be fixed by this...

Wesley Wiser (May 14 2020 at 20:10, on Zulip):

I think that would be good. We have A-mir-opt-inlining which is similar

Jonas Schievink (May 14 2020 at 20:14, on Zulip):

Created one

Wesley Wiser (May 14 2020 at 20:15, on Zulip):

Looking through the tagged issues is also helpful when trying to stabilize the optimization because we can extract test cases.

ecstatic-morse (May 14 2020 at 20:19, on Zulip):

Yay! I find these kinds of things make writing test suites much easier.

Jonas Schievink (May 15 2020 at 20:20, on Zulip):

Status update: I've been hacking on dataflow-based NRVO in my spare time, but still have to work out some kinks. I'm hoping to make more progress over the weekend (and if not, next weekend).

oli (May 15 2020 at 20:40, on Zulip):

please sync with @eddyb because they also have such a pass

eddyb (May 15 2020 at 20:41, on Zulip):

ehm

eddyb (May 15 2020 at 20:41, on Zulip):

I'm pretty sure everyone here knows about my history with NRVO?

eddyb (May 15 2020 at 20:41, on Zulip):

at least @Jonas Schievink who I've talked extensively to about it

Jonas Schievink (May 15 2020 at 20:42, on Zulip):

yeah, @eddyb guided me through my last attempt :)

eddyb (May 15 2020 at 20:46, on Zulip):

(some of the recent experiments were based on my hopes that there are simpler ways to correctly approximate the conditions for NRVO than my original attempt at it which I don't feel great about due to the complexity, although now that I've discussed it with @Hanna Kruppe and @Jonas Schievink, I could see ways to break it up into understandable "standard compiler concept" pieces. sadly I'm again not working on optimizations so... yeah I can't do much about it atm)

ecstatic-morse (May 15 2020 at 21:31, on Zulip):

Jonas Schievink said:

Status update: I've been hacking on dataflow-based NRVO in my spare time, but still have to work out some kinks. I'm hoping to make more progress over the weekend (and if not, next weekend).

@Jonas Schievink Let me know if you see any room for improvement within the dataflow framework. You're basically the first person using it to write new stuff, so your feedback is very useful.

Jonas Schievink (May 24 2020 at 14:47, on Zulip):

@ecstatic-morse So, my destination propagation pass now makes your NRVO pass fail this assertion: https://github.com/rust-lang/rust/blob/7726070fa755f660b5da3f82f46e07d9c6866f69/src/librustc_mir/transform/nrvo.rs#L206

Jonas Schievink (May 24 2020 at 14:48, on Zulip):

Does your pass work when _0 is "just another local"?

ecstatic-morse (May 24 2020 at 15:13, on Zulip):

You can turn it off if you're already renaming the return place

Jonas Schievink (May 24 2020 at 15:29, on Zulip):

That is true I guess

Jonas Schievink (May 24 2020 at 15:30, on Zulip):

Hmm, can NRVO create Operand::Move after replacement, but have the moved place still be used afterwards?

Jonas Schievink (May 24 2020 at 15:32, on Zulip):

Seems like yes? If it propagates this:

_3 = _4;
(_2.0: i32) = move _3;
(_2.1: i32) = move _4;

It can produce

(_2.0: i32) = move _4;
(_2.1: i32) = move _4;

Seems bad if we do want to use Operand::Move vs. Operand::Copy for optimizations. This might also be true for the existing pass.

Jonas Schievink (May 24 2020 at 15:39, on Zulip):

Ah nevermind, it would just produce

(_2.0: i32) = _4;
nop;
(_2.1: i32) = move _4;
Félix Fischer (May 24 2020 at 19:12, on Zulip):

Oof, that's a relief

Félix Fischer (May 24 2020 at 19:12, on Zulip):

:)

Jonas Schievink (May 25 2020 at 22:19, on Zulip):

Jonas Schievink said:

Oh, that's right. We just | the bits in the conflict matrix. Well that was easy.

Turns out this is wrong, since we have to & the bits from liveness and initializedness before adding the result to the matrix. I guess I can collect the states into a Vec and & with that though.

Jonas Schievink (May 25 2020 at 23:30, on Zulip):

I've tried to make the dataflow traversal take linear time, but it still seems quadratic. https://github.com/rust-lang/rust/commit/f99570bd551a34f5864792ef8a685e0cff4c58d7

Probably just a brain fart on my side though, will revisit tomorrow.

Jonas Schievink (May 26 2020 at 10:55, on Zulip):

Ah, adding stuff to the conflict matrix has O(n²) performance too, gotta find a better way to do that (this code was copied straight from the generator transform, so it's possible that it leaves some performance behind as well).

ecstatic-morse (May 26 2020 at 15:02, on Zulip):

I guess I can collect the states into a Vec and & with that though.

I also don't see a better way than this. I forgot we needed the intersection of two dataflow analyses in different directions.

ecstatic-morse (May 26 2020 at 15:07, on Zulip):

You seem to have independently figured this out, but I think we should be specific about the n when talking about the big-O notation for destination propagation going forward. AFAICT, the best we can do is O(L^2 * S) in time and O(L^2 + S*L) in memory, where L is the number of locals and S is the number of statements in a basic block. Does this sound right? Obviously there's a very low constant factor for the L terms since they use bitsets.

ecstatic-morse (May 26 2020 at 15:11, on Zulip):

I guess it would actually be O(L^2 * S * B) in time for an entire function, where B is the number of basic blocks. And memory is O(L^2 + S*L + B*L), since we need to consider the entry set for each basic block (B*L) as well as the intermediate buffer used to do the forward/backward intersection (S*L).

oli (May 26 2020 at 15:18, on Zulip):

can the algorithm be degraded to just look at a few locals in case there are too many?

oli (May 26 2020 at 15:19, on Zulip):

Or can we prefilter the locals to just look at those that are relevant at all? We have a few super local heavy tests in the perf suite, they may actually notice such a quadratic algorithm

ecstatic-morse (May 26 2020 at 15:22, on Zulip):

@oli We can always bail out, or fall back to my dumb version, which is linear.

ecstatic-morse (May 26 2020 at 15:26, on Zulip):

(I don't immediately see a subset of locals you could ignore, but perhaps @Jonas Schievink can do better)

Jonas Schievink (May 26 2020 at 15:36, on Zulip):

So, if m out of n total locals are in conflict at any point, adding those m conflicts to the matrix takes O(m²) time (since it |s each m's row in the matrix with the m BitSet). I believe this means that the O(L^2 * S) figure is correct, since all locals can conflict at every point in the function.

I think I can replace the conflict matrix with a unification table though, which should only take almost-linear time for this "add conflicts" operation, unless I missed something. I already use one in another place, so this should simplify the code too.

ecstatic-morse (May 26 2020 at 15:48, on Zulip):

What's a "unification table"?

Jonas Schievink (May 26 2020 at 15:49, on Zulip):

This thing: https://docs.rs/ena/0.14.0/ena/unify/struct.UnificationTable.html

It's already used by the type inferencer

Jonas Schievink (May 26 2020 at 15:51, on Zulip):

Ah, maybe I should've just said union-find

ecstatic-morse (May 26 2020 at 15:51, on Zulip):

Does it require transitivity? The use of the phrase "union-find" suggests it does, but I'm not sure. I don't think you have that when detecting live-range overlaps, since "A overlaps with B" and "B overlaps with C" does not imply "A overlaps with C". Maybe you're thinking of applying it in a different way than I am, though.

Jonas Schievink (May 26 2020 at 15:52, on Zulip):

Ah, that is true

ecstatic-morse (May 26 2020 at 15:55, on Zulip):

I wouldn't worry too much the L^2 part, since you handle 64 locals per instruction. We can always bail out for massive ones as oli mentioned. I think you already ran into this in your earlier PR.

Jonas Schievink (May 26 2020 at 15:56, on Zulip):

Yeah, I'm benchmarking with said perf test (tuple-stress)

Jonas Schievink (May 26 2020 at 15:56, on Zulip):

It's true that we could just bail out when there's too many locals, just seems a bit unfortunate

ecstatic-morse (May 26 2020 at 15:58, on Zulip):

Frankly, if you write code that gets lowered to 100,000 MIR locals, that's on you, not on us :smiley:

ecstatic-morse (May 26 2020 at 15:58, on Zulip):

/me looks directly at keccak

ecstatic-morse (May 26 2020 at 16:03, on Zulip):

FWIW, I do think there are ways of doing this that are linear (or at least quasi-linear) in time and memory, but they will be more complex than the conflict matrix and have higher constant factors to boot.

simulacrum (May 26 2020 at 16:30, on Zulip):

I would agree that we should care more about getting good performance for low N, but we should also make sure to avoid exponentially increasing runtime (even if that means just switching some optimization off; eventually perhaps we can emit a warning on this "function has too many variables").

oli (May 26 2020 at 16:34, on Zulip):

simulacrum said:

perhaps we can emit a warning on this "function has too many variables").

I like this

Félix Fischer (May 26 2020 at 17:12, on Zulip):

XD

Félix Fischer (May 26 2020 at 17:13, on Zulip):

I dunno

Félix Fischer (May 26 2020 at 17:14, on Zulip):

From what I used to hear when I was part of the Swift community, the fact that their compiler gave up sometimes and said "this expression is too complex" left a bitter taste in _everyone_ who got to see it

Félix Fischer (May 26 2020 at 17:14, on Zulip):

But on the other hand that happened with fairly normal code

Félix Fischer (May 26 2020 at 17:15, on Zulip):

If you're telling me that this would happen only with code at like, 200 local variables per function...

Félix Fischer (May 26 2020 at 17:15, on Zulip):

Well, that seems actually quite fair

Félix Fischer (May 26 2020 at 17:15, on Zulip):

:P

Félix Fischer (May 26 2020 at 17:16, on Zulip):

Ah, wait. You're thinking of emitting a warning. Okay, my bad. Yeah, that seems pretty good :)

Félix Fischer (May 26 2020 at 17:17, on Zulip):

I thought you were talking about error-ing, but that was not the case at all :)

Wesley Wiser (May 26 2020 at 17:22, on Zulip):

I'm not sure we'd even want a warning on that. It's not like the code is wrong it's just bloated and not necessarily going to be optimized well.

I think something like optimization remarks where we provide // annotations if requested when emitting MIR files might be more appropriate. LLVM has a much more sophisticated system that we could steal ideas from: https://llvm.org/docs/Remarks.html

Wesley Wiser (May 26 2020 at 17:24, on Zulip):

Maybe something like:

// opt-remark: [NRVO] - gave up due to too many locals!
fn  bar(_1: u32, _2: &[u32]) -> (u32, u32, u32) {
    debug a => _1;                       // in scope 0 at src/lib.rs:16:5: 16:10
    debug w => _2;                       // in scope 0 at src/lib.rs:17:5: 17:6
    let mut _0: (u32, u32, u32);         // return place in scope 0 at src/lib.rs:18:6: 1
    ...
Félix Fischer (May 26 2020 at 19:04, on Zulip):

Ohh, I kinda like that!

Jonas Schievink (May 26 2020 at 21:28, on Zulip):

https://github.com/rust-lang/rust/pull/72632

RalfJ (May 30 2020 at 15:38, on Zulip):

oh nice :D

RalfJ (May 30 2020 at 15:39, on Zulip):

@Jonas Schievink if you want to get some extra checking of whether the MIR still makes sense, you could try

MIRI_TEST_FLAGS="-Z mir-opt-level=3" ./x.py test --stage 0 src/tools/miri
Jonas Schievink (May 30 2020 at 15:43, on Zulip):

Thanks, no failures with that though

Jonas Schievink (May 30 2020 at 15:45, on Zulip):

I've been trying to extend the MIR validation pass to check the MIR against the live and initialized locals from dataflow, and it reports a lot of errors. I'll try to narrow those down next.

Jonas Schievink (May 30 2020 at 18:50, on Zulip):

Whoa, just noticed this immense amount of test failures https://github.com/rust-lang/rust/pull/72635/checks?check_run_id=711136389

I can't reproduce any of that locally

Jonas Schievink (May 30 2020 at 22:15, on Zulip):

Great, my InstCombine changes break dest-prop and MaybeInitializedLocals #72797

Not entirely sure what to do, but I guess we can revert the Operand::Move change for now?

Jonas Schievink (May 31 2020 at 00:21, on Zulip):

Found another issue, this time with SimplifyArmIdentity: #72800

(that MIR validation pass I added is really starting to pay off!)

Félix Fischer (May 31 2020 at 06:50, on Zulip):

@Jonas Schievink I cannot thank you enough for that pass. Anything that helps to ensure that our opts are, at the end of the day, sound, is imo incredibly important.

Félix Fischer (May 31 2020 at 06:51, on Zulip):

Although MIR validation isn't fool-proof (you could make a transform that changes program behavior and still generates valid MIR), I still think it's a great tool

Félix Fischer (May 31 2020 at 06:53, on Zulip):

What I eventually would love to have is some sort of proof that the opts are all valid transformations over a MIR program. But that feels more like end-game stuff

Jonas Schievink (Jun 01 2020 at 22:01, on Zulip):

Ah, just found this gem, this isn't sound of course:

-        _44 = const std::option::Option::<ExpandError>::or(move _45, move _46) -> [return: bb20, unwind: bb49]; // scope 12 at crates/ra_mbe/src/mbe_expander/transcriber.rs:97:23: 97:32
+        _45 = const std::option::Option::<ExpandError>::or(move _45, move _46) -> [return: bb20, unwind: bb49]; // scope 12 at crates/ra_mbe/src/mbe_expander/transcriber.rs:97:23: 97:32
Jonas Schievink (Jun 01 2020 at 23:37, on Zulip):

Looks like we can do a lot more "must not overlap" validation in the MIR validator, this would have easily caught this case

Jonas Schievink (Jun 01 2020 at 23:38, on Zulip):

Luckily the rust-analyzer test suite started crashing, or I would have to bisect the entire rustc

Jonas Schievink (Jun 02 2020 at 19:32, on Zulip):

Okay, that fixed the rust-analyzer crash, but not the rustc miscompilation. In fact, there is now an additional spurious error emitted :(

RalfJ (Jun 20 2020 at 10:23, on Zulip):

@Jonas Schievink so regarding the overlap check, IMO before we add lots of static approximations of what the actual allowed and disallowed overlap behavior of assignments is, and certainly before we have any optimizations that rely on it, we should figure out the dynamic MIR semantics regarding overlap: https://github.com/rust-lang/rust/issues/68364

RalfJ (Jun 20 2020 at 10:24, on Zulip):

once we know those, we can determine if a static check is correct by seeing if it conservatively approximates the true dynamic overlap requirement

Jonas Schievink (Jun 20 2020 at 11:27, on Zulip):

Makes sense to me, but I don't know what's required to settle that. Your comment https://github.com/rust-lang/rust/issues/68364#issuecomment-614862820 makes it sound like there's a good solution already.

RalfJ (Jun 20 2020 at 11:46, on Zulip):

there's a reasonable candidate that is neither implemented nor ratified by the lang team

Félix Fischer (Jun 21 2020 at 20:04, on Zulip):

Is that candidate described somewhere @RalfJ? :3

RalfJ (Jun 21 2020 at 22:51, on Zulip):

@Félix Fischer yeah it's in jonas' message above: https://github.com/rust-lang/rust/issues/68364#issuecomment-614862820

Félix Fischer (Jun 22 2020 at 00:01, on Zulip):

Thank you @RalfJ :two_hearts:

Félix Fischer (Jun 22 2020 at 00:04, on Zulip):

While we're at it... is there a guide on MIR? Like...

Félix Fischer (Jun 22 2020 at 00:04, on Zulip):

MIR semantics

Félix Fischer (Jun 22 2020 at 00:04, on Zulip):

And

Félix Fischer (Jun 22 2020 at 00:04, on Zulip):

What the Rust~>MIR transform does

Félix Fischer (Jun 22 2020 at 00:05, on Zulip):

Because while reading these documents I feel like I don't quite understand what's going on, like, I'm missing a lot of context

Jonas Schievink (Jun 22 2020 at 00:28, on Zulip):

I only have superficial understanding of how MIR is built, but the docs at https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir_build/build/index.html have been helpful for me, in case you haven't read them yet

RalfJ (Jun 22 2020 at 07:22, on Zulip):

Félix Fischer said:

MIR semantics

this has been on my todo list since forever^^

RalfJ (Jun 22 2020 at 07:23, on Zulip):

It's something I hope the UCG will get to eventually

RalfJ (Jun 22 2020 at 07:23, on Zulip):

so far we just have some bits an pieces, and none of that is normative or ratified:
https://github.com/rust-lang/unsafe-code-guidelines/tree/master/wip

Jonas Schievink (Jun 22 2020 at 22:25, on Zulip):

RalfJ said:

Jonas Schievink so regarding the overlap check, IMO before we add lots of static approximations of what the actual allowed and disallowed overlap behavior of assignments is, and certainly before we have any optimizations that rely on it, we should figure out the dynamic MIR semantics regarding overlap: https://github.com/rust-lang/rust/issues/68364

Circling back to this – would it really be such a bad idea to have the validator check for cases that codegen already assumes today won't alias? I'm fairly sure this would have caught almost every recent destination propagation miscompile I had to debug, and if we figure out later that codegen is doing something wrong then we can always relax the validator too. (these assumptions wouldn't be exploited by optimization passes either, in fact I need to do the exact opposite right now)

RalfJ (Jun 23 2020 at 07:50, on Zulip):

it is probably a good idea to already check statically for things that we anyway rely on. I am just worried that we'll stop there, when I think we do need a precise dynamic description of what happens if we want to actually understand things.

RalfJ (Jun 23 2020 at 07:50, on Zulip):

(these assumptions wouldn't be exploited by optimization passes either, in fact I need to do the exact opposite right now)

interesting, I thought they could be useful but maybe we don't have such optimizations yes?

Jonas Schievink (Jun 23 2020 at 09:35, on Zulip):

In a way, codegen relying on this could count as an optimization. I don't think we have anything beyond that at the moment.

RalfJ (Jun 23 2020 at 16:56, on Zulip):

Jonas Schievink said:

In a way, codegen relying on this could count as an optimization.

fair

Jonas Schievink (Jun 24 2020 at 21:35, on Zulip):

#72632 should now be ready for review

Félix Fischer (Jun 25 2020 at 00:51, on Zulip):

Holy crap, good job!

Jonas Schievink (Sep 18 2020 at 23:04, on Zulip):

#72632 is now waiting on bors.

Jonas Schievink (Sep 18 2020 at 23:07, on Zulip):

And because I can't help myself, I wrote another copy propagation pass: #76723

This one propagates forwards, and only within individual basic blocks, but runs so fast that it doesn't regress compiler performance measurably. The result is faster incremental builds due to less MIR being stored in the cache, and faster CTFE. The pass is also pretty small and should be relatively easy to understand.

Last update: Jan 22 2021 at 13:15UTC