Stream: t-compiler

Topic: xLTO performance


Eric Rahm (Jul 30 2019 at 22:19, on Zulip):

Hey compiler team, we're seeing some interesting performance regressions when enabling cross-language ThinLTO in Firefox. The most prominent one is a 25% slowdown on a style benchmark (for stylo, one of our rust components). One of our devs, dmajor, tracked this down to a codegen issue for a specific function where before xLTO a 7-byte nop instruction was used to 16-byte align the following instruction. With xLTO enabled it appears to be generating 7 x 1-bytenop instructions instead. We were wondering if someone from the compiler team could help us look into this further (perhaps someone knows where this logic exists, at what point in compilation this happens, etc). Any help or pointers on who to talk to would be really great! If the answer is just mw, then we can wait, but it would be nice to get a head start!

simulacrum (Jul 30 2019 at 22:21, on Zulip):

@Eric Rahm Since it's almost certainly an LLVM issue, I would try to dump the LLVM IR with --emit=llvm-ir (not entirely sure how xLTO works, but presumably there's some point at which that can happen) and then try to manually run LLVM's opt tool on it

simulacrum (Jul 30 2019 at 22:22, on Zulip):

AFAIK we don't generate any assembly ourselves, we always go through LLVM, so it's probably an upstream bug

simulacrum (Jul 30 2019 at 22:24, on Zulip):

I don't really know the details of how firefox builds but I would also make sure that it's the same version of LLVM on both builds -- notably, we recently upgraded from 8 to 9

Eric Rahm (Jul 30 2019 at 22:24, on Zulip):

@simulacrum Thanks, looking at the IR is a great idea! For this build we're using clang8 + rust 1.36, so I think we're in sync on llvm 8

simulacrum (Jul 30 2019 at 22:25, on Zulip):

Another thing you can potentially try is see if you can reproduce with -Ccodegen-units=1 passed to rustc

simulacrum (Jul 30 2019 at 22:25, on Zulip):

(it's possible that xLTO is changing codegen unit boundaries some how, I guess?)

simulacrum (Jul 30 2019 at 22:26, on Zulip):

@Alex Crichton might have more experience tracking down this sort of thing, not sure. also cc @WG-llvm

Eric Rahm (Jul 30 2019 at 22:26, on Zulip):

@simulacrum yeah I feel like there was a discussion on that for xLTO stabilization, but maybe that was PGO...

simulacrum (Jul 30 2019 at 22:27, on Zulip):

codegen units definitely can have a major effect on perf for the better (though also major impact on compile times); however since here we're not actually seeing differences in "real IR" but rather assembly lowering I sort of expect that to not affect it?

simulacrum (Jul 30 2019 at 22:27, on Zulip):

(and the assembly, is, well, nops)

Mike Hommey (Jul 30 2019 at 22:31, on Zulip):

We do build with codegen-units=1.

nagisa (Jul 30 2019 at 22:33, on Zulip):

aligning the code occurs after LTO happens, so whatever generates the machine code from LTO IR is to blame here.

nagisa (Jul 30 2019 at 22:34, on Zulip):

Looking at generated LLVM IR won’t help in any way

simulacrum (Jul 30 2019 at 22:35, on Zulip):

@nagisa but presumably can help with getting smaller IR to then run LLVM on?

simulacrum (Jul 30 2019 at 22:35, on Zulip):

(perhaps in e.g. gdb or similar to track down the error)

nagisa (Jul 30 2019 at 22:35, on Zulip):

As for the logic, it will likely be the x86 backend in LLVM. I suspect that the linker plugin will simply have different options for the backend compared to rustc (e.g. lower backend optimisation level)

Eric Rahm (Jul 30 2019 at 22:36, on Zulip):

Interesting, dmajor found the relevant x86 ASM backend code So we're wondering if for some reason we loose track of the fact that we support long nop codes in LTO builds

nagisa (Jul 30 2019 at 22:37, on Zulip):

As for my last observation, I’m amused that this causes slowdowns, as all the modern Intel and AMD CPUs discard nops in instruction decoder, and the decoder can process up to 6 instructions per cycle depending on specific arch.

nagisa (Jul 30 2019 at 22:38, on Zulip):

So at most this should cause 1 extra cycle of delay every such nop chain.

nagisa (Jul 30 2019 at 22:40, on Zulip):

@Eric Rahm it is plausible that CPU features could get lost somewhere in the linker plugin.

nagisa (Jul 30 2019 at 22:40, on Zulip):

That would explain the slowdown better IMO as e.g. SSE would no longer be used by default either.

simulacrum (Jul 30 2019 at 22:43, on Zulip):

fwiw we do try to pass something into LLVM I think https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_ssa/back/linker.rs#L213

nagisa (Jul 30 2019 at 22:44, on Zulip):

/me reads the last comment on hex editing and finds it funny how other people were sceptical of the same thing as myself :D

nagisa (Jul 30 2019 at 22:45, on Zulip):

fwiw we do try to pass something into LLVM I think https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_ssa/back/linker.rs#L213

Well, again, its a question of what Firefox build system looks like. I do not believe they link a bunch of C++ code into Rust and rather do the reverse.

nagisa (Jul 30 2019 at 22:45, on Zulip):

At which point what rustc does in its backend when invoking the linker is irrelevant.

simulacrum (Jul 30 2019 at 22:46, on Zulip):

Ah, yeah, that makes sense

Eric Rahm (Jul 30 2019 at 22:46, on Zulip):

@Mike Hommey might know, but yeah I think it's the reverse

Mike Hommey (Jul 30 2019 at 22:48, on Zulip):

The non-xLTO setup builds a static library for rust code, but that code is LTOed by rust, but that shouldn't be involving the linker.

Alex Crichton (Jul 31 2019 at 14:09, on Zulip):

I don't have much to add to this that nagisa hasn't already said, I agree that it's very likely the code generator and settings passed to the code generator, which in this case is probably happening in LLD or the LLVM plugin for gold (or something like that). It may have to do with optimization levels configured for the code generator itself (which in LLVM IIRC can be separate from the IR)

Eric Rahm (Aug 06 2019 at 15:59, on Zulip):

Closing the loop here it looks like we tracked this down to rustc using reasonable defaults for cpu type when linking, but with xLTO we're calling lld directly which has different less optimal defaults. If we add flags to our lld call we see the regression go away.

Eric Rahm (Aug 06 2019 at 15:59, on Zulip):

@mw FYI, this is what we mentioned during our team meeting today

mw (Aug 07 2019 at 07:50, on Zulip):

yes, that explanation makes sense to me

Last update: Nov 16 2019 at 01:40UTC