Stream: t-compiler/wg-llvm

Topic: NaNs in windows-gnu float arthitmetic


Mateusz Mikuła (Oct 21 2019 at 22:56, on Zulip):

Hey,
Did my best about https://github.com/rust-lang/rust/issues/53254#issuecomment-544741814 on the Rust side.
How can I reduce it further to report it on LLVM bugzilla?

Nikita Popov (Oct 22 2019 at 08:13, on Zulip):

Looks like generated IR is the same apart from the GOT flag. In the backend it seems like everything is also the same up until fpow legalization, where an additional 32-byte call stack adjustment is emitted on pc-windows-gnu.

Mateusz Mikuła (Oct 22 2019 at 09:10, on Zulip):

When I replace transmute(0_i64) with __m64(0_i64) all platforms I tested give correct result.
I don't know enough about it but could it be alignment issue?

rkruppe (Oct 22 2019 at 11:52, on Zulip):

Do you have a godbolt link ready for the verson with __m64(0)?

Mateusz Mikuła (Oct 22 2019 at 11:53, on Zulip):

Sure, https://godbolt.org/z/Hmshca

rkruppe (Oct 22 2019 at 11:54, on Zulip):

So that doesn't use any MMX instructions. It just does a 64b scalar store to the stack (x86_mmx-typed) stack slot

rkruppe (Oct 22 2019 at 11:54, on Zulip):

More evidence that the MMX instructions are to blame

Mateusz Mikuła (Oct 22 2019 at 11:55, on Zulip):

Ah, it's mm0

Mateusz Mikuła (Oct 22 2019 at 11:56, on Zulip):

Could be another weird msvcrt issue, I'll see if I can try with ucrt.

gnzlbg (Oct 22 2019 at 12:32, on Zulip):

Honestly, I think the best fix would be to not use MMX intrinsics ever.

IIUC, the test programs are not using MMX, but the math library is, right ?

Mateusz Mikuła (Oct 22 2019 at 12:32, on Zulip):

Replacing msvcrt with ucrt doesn't help so it must be something else different between msvc and mingw.

gnzlbg (Oct 22 2019 at 12:33, on Zulip):

Ah no, @Mateusz Mikuła example uses pmovmskb

Mateusz Mikuła (Oct 22 2019 at 12:34, on Zulip):

Mingw powf just calls msvcrt pow, have fun guessing what it does internally.
pmovmskb comes from "unwrapped" _mm_movemask_pi8

gnzlbg (Oct 22 2019 at 12:35, on Zulip):

The thing is, the problem I reported happens even if you don't use MMX in Rust

gnzlbg (Oct 22 2019 at 12:35, on Zulip):

Your example uses MMX without calling emms

gnzlbg (Oct 22 2019 at 12:35, on Zulip):

We know such code doesn't work: https://github.com/rust-lang/rust/issues/57831

gnzlbg (Oct 22 2019 at 12:36, on Zulip):

It has UB of the form "calling a function with an incompatible ABI"

Mateusz Mikuła (Oct 22 2019 at 12:36, on Zulip):

All my code is extraction from corearch and packed_simd.

gnzlbg (Oct 22 2019 at 12:36, on Zulip):

The pmovmskb ? Then that's a bug in packed_simd (packed_simd should call emms when it uses it, or not use MMX at all like @rkruppe suggests)

gnzlbg (Oct 22 2019 at 12:37, on Zulip):

Notice that my original example does not use MMX though

gnzlbg (Oct 22 2019 at 12:37, on Zulip):

So it might be a different issue

Mateusz Mikuła (Oct 22 2019 at 12:38, on Zulip):

Original example has been fixed long time ago, I said it in one of the comments.
pmovmskb came from https://docs.rs/core_arch/0.1.3/x86_64-unknown-linux-gnu/src/core_arch/x86/sse.rs.html#2396-2398

Mateusz Mikuła (Oct 22 2019 at 12:38, on Zulip):

Now where _mm_movemask_pi8 did came from...

gnzlbg (Oct 22 2019 at 12:39, on Zulip):

Ah, I did not got that. I thought that some fix landed on LLVM that did not fix the original example, hence why the issue is still open.

gnzlbg (Oct 22 2019 at 12:39, on Zulip):

So yeah, that's a different bug.

gnzlbg (Oct 22 2019 at 12:39, on Zulip):

That code has undefined behavior, and the fix is either not to use MMX or to call emms

rkruppe (Oct 22 2019 at 12:41, on Zulip):

calling emms is not a fix tbh

rkruppe (Oct 22 2019 at 12:41, on Zulip):

I honestly think stdarch should just kill all the MMX intrinsics

rkruppe (Oct 22 2019 at 12:41, on Zulip):

Their usefulness was always controversial and they are actually impossible to use correctly

gnzlbg (Oct 22 2019 at 12:43, on Zulip):

They are not stable I think, so we could just do that.

gnzlbg (Oct 22 2019 at 12:44, on Zulip):

The problem is that there is no way for LLVM to generate reasonable codegen for when they are used in packed_simd, but at that point using inline assembly might be a better solution.

rkruppe (Oct 22 2019 at 12:44, on Zulip):

Why would they be used in packed_simd anyway?

gnzlbg (Oct 22 2019 at 12:44, on Zulip):

It supports 64-bit wide vectors

gnzlbg (Oct 22 2019 at 12:44, on Zulip):

e.g. f32x2

rkruppe (Oct 22 2019 at 12:44, on Zulip):

But why

rkruppe (Oct 22 2019 at 12:44, on Zulip):

Who uses those

gnzlbg (Oct 22 2019 at 12:45, on Zulip):

2D code using (x, y)

gnzlbg (Oct 22 2019 at 12:45, on Zulip):

where x and y are f32s

rkruppe (Oct 22 2019 at 12:46, on Zulip):

Does such code actually want to use MMX? Does it benefit from it vs scalar code or padding to f32x4?

gnzlbg (Oct 22 2019 at 12:46, on Zulip):

That code wants to efficiently test whether (x, y) < (z, w) returns (0, 0) or (0xff, 0xff)

rkruppe (Oct 22 2019 at 12:47, on Zulip):

That doesn't answer my question

gnzlbg (Oct 22 2019 at 12:47, on Zulip):

the scalar thing was much slower than the pmovmaskb operation, which does it at once

rkruppe (Oct 22 2019 at 12:47, on Zulip):

Ok, and what about the SSE equivalent?

rkruppe (Oct 22 2019 at 12:48, on Zulip):

because that's what you'd get with https://github.com/rust-lang/rust/issues/58168 anyway

gnzlbg (Oct 22 2019 at 12:48, on Zulip):

what would be the SSE equivalent ?

rkruppe (Oct 22 2019 at 12:49, on Zulip):

I don't know pmovmaskb or my SSE intrinsics well enough to tell you that, but I would be extremely surprised if there wasn't a reasonably short instruction sequence for it

gnzlbg (Oct 22 2019 at 12:49, on Zulip):

the three operations where this is used is the boolean m32x2::{all, any, none} operations where you can just get a mask, and test it - one can build that on top of m32x4, e.g., by using a shuffle to mix in the m32x2 with two other elements, and then do the SSE equivlaent

gnzlbg (Oct 22 2019 at 12:51, on Zulip):

e.g. for all you do a shuffle!(m32x2, {true, true});, for none you do a shuffle!(m32x2, {false, false}), etc. and then you follow both by a __mm_movmask_epi8 or similar

gnzlbg (Oct 22 2019 at 12:53, on Zulip):

@rkruppe wait.. _mm_movemask_pi8

gnzlbg (Oct 22 2019 at 12:53, on Zulip):

that's an SSE intrinsic..

gnzlbg (Oct 22 2019 at 12:54, on Zulip):

but it takes an MMX register as its argument..

rkruppe (Oct 22 2019 at 12:54, on Zulip):

lol

gnzlbg (Oct 22 2019 at 12:54, on Zulip):

I guess one can use __mm_movemask_ps instead there by doing the widening

rkruppe (Oct 22 2019 at 12:57, on Zulip):

To be clear, you are talking about a future where m32x2 is no longer mapped to the x86_mmx type, right? Because if that type is still involves, LLVM will still sometimes use mmN registers, which besides corrupting the x87 stack will cause unnecessary moves between register banks

rkruppe (Oct 22 2019 at 12:59, on Zulip):

Although I'm curious whether mapping it to e.g. i64 would be any better for performance.

rkruppe (Oct 22 2019 at 12:59, on Zulip):

Let's check what GCC can do

gnzlbg (Oct 22 2019 at 12:59, on Zulip):

@rkruppe m32x2 is not mapped to x86_mmx right now

rkruppe (Oct 22 2019 at 12:59, on Zulip):

It's not?

gnzlbg (Oct 22 2019 at 13:00, on Zulip):

no

gnzlbg (Oct 22 2019 at 13:00, on Zulip):

only i64x1 is... for some reason..

gnzlbg (Oct 22 2019 at 13:00, on Zulip):

but packed_simd does not support that type

rkruppe (Oct 22 2019 at 13:00, on Zulip):

But f32x2 is, right?

gnzlbg (Oct 22 2019 at 13:00, on Zulip):

no

gnzlbg (Oct 22 2019 at 13:00, on Zulip):

only __m64 is mapped to x86_mmx

gnzlbg (Oct 22 2019 at 13:00, on Zulip):

but that intrinsic happens to use it, so when the transmute happens, the x86_mmx type is used

rkruppe (Oct 22 2019 at 13:01, on Zulip):

Oh, does packed_simd only cast to __m64 internally and temporarily

gnzlbg (Oct 22 2019 at 13:01, on Zulip):

yes

gnzlbg (Oct 22 2019 at 13:01, on Zulip):

because that SSE intrinsic takes an __m64

rkruppe (Oct 22 2019 at 13:01, on Zulip):

Even better

gnzlbg (Oct 22 2019 at 13:01, on Zulip):

https://godbolt.org/z/CTv-VG

gnzlbg (Oct 22 2019 at 13:01, on Zulip):

so there I use _mm_movemask_ps instead and everything just works nicely

gnzlbg (Oct 22 2019 at 13:01, on Zulip):

at least for that intrinsic

rkruppe (Oct 22 2019 at 13:02, on Zulip):

Great

gnzlbg (Oct 22 2019 at 13:09, on Zulip):

For the m32x2::all operation things are a bit different

gnzlbg (Oct 22 2019 at 13:09, on Zulip):

https://godbolt.org/z/X35ZR2

gnzlbg (Oct 22 2019 at 13:09, on Zulip):

The none and any operation can make use of the fact that, in the calling convention, the 32x2 vector will contain another two zeroed 32x2 elements behind it

gnzlbg (Oct 22 2019 at 13:11, on Zulip):

for the all mask there are a couple of things one can do to map the m32x2 into a m32x4, but all of them appear to generate unnecessary instructions, although arguably the MMX code has UB

gnzlbg (Oct 22 2019 at 13:11, on Zulip):

so it is not a fair comparison either

rkruppe (Oct 22 2019 at 13:12, on Zulip):

Now that you mention it, looking at this in isolation might actually be a bit misleading. The calling convention for <float x 2> on x86+SSE targets is pretty funky and in reality you'd probably always have those operations inlined

rkruppe (Oct 22 2019 at 13:12, on Zulip):

Since you apparently have a concrete use case, a larger snippet that uses the any/all/none operations in a representative way might be more instructive

gnzlbg (Oct 22 2019 at 13:12, on Zulip):

Yeah, you have a point

gnzlbg (Oct 22 2019 at 13:13, on Zulip):

In isolation, this is the best: https://godbolt.org/z/iysHeL

gnzlbg (Oct 22 2019 at 13:13, on Zulip):
pub unsafe fn m32x2_allSSE(x: m32x2) -> bool {
    let x: u64 = transmute(x);
    x == u64::max_value()
}
rkruppe (Oct 22 2019 at 13:13, on Zulip):

lol that's fair

gnzlbg (Oct 22 2019 at 13:13, on Zulip):

I think packed_simd does that in many cases

rkruppe (Oct 22 2019 at 13:14, on Zulip):

but also particularly misleading in isolation since it doesn't look so great when the mask is produced from SSE instructions

gnzlbg (Oct 22 2019 at 13:14, on Zulip):

https://github.com/rust-lang-nursery/packed_simd/blob/master/src/codegen/reductions/mask/fallback_impl.rs#L66

gnzlbg (Oct 22 2019 at 13:15, on Zulip):

IIRC in some targets and for some vectors LLVM was able to generate quite good code for that, when used in larger examples.

gnzlbg (Oct 22 2019 at 13:15, on Zulip):

But there were also cases where it wasn't due to a variety of llvm bugs, and then workarounds were introduced

gnzlbg (Oct 22 2019 at 13:15, on Zulip):

https://github.com/rust-lang-nursery/packed_simd/blob/master/src/codegen/reductions/mask/x86/sse.rs#L51

rkruppe (Oct 22 2019 at 13:15, on Zulip):

Anyway, I suspect that for the actual application/benchmark, some higher level algorithmic changes could allow doing everything in SSE registers directly (loads, compares, and reductions).

gnzlbg (Oct 22 2019 at 13:16, on Zulip):

I think that not dealing with EMMS is worth a performance hit

rkruppe (Oct 22 2019 at 13:17, on Zulip):

That too, but best if there's no performance hit to begin with

gnzlbg (Oct 22 2019 at 13:17, on Zulip):

and if that is not acceptable, then we'd just use inline assembly instead

gnzlbg (Oct 22 2019 at 13:17, on Zulip):

but packed_simd should not use MMX registers anywhere

rkruppe (Oct 22 2019 at 13:17, on Zulip):

You mean inline asm for the any/all/none intrinsics?

gnzlbg (Oct 22 2019 at 13:17, on Zulip):

not in the general case, but as a workaround over an LLVM bug

gnzlbg (Oct 22 2019 at 13:17, on Zulip):

all that parent module is workarounds over llvm bugs

rkruppe (Oct 22 2019 at 13:18, on Zulip):

Sure but still, such small inline asm seems like it would likely cause more performance problems (blocking combines, blinding regalloc and scheduling) than it's worth

rkruppe (Oct 22 2019 at 13:18, on Zulip):

But shrug try it if you have to

rkruppe (Oct 22 2019 at 13:18, on Zulip):

I would love to be able to kill __m64

gnzlbg (Oct 22 2019 at 13:23, on Zulip):

I think there is an open issue to at least restrict that, but yeah, should be killed

rkruppe (Oct 22 2019 at 13:25, on Zulip):

Restrict how?

rkruppe (Oct 22 2019 at 13:26, on Zulip):

For that matter, restrict what?

gnzlbg (Oct 22 2019 at 13:33, on Zulip):

To restrict __m64 to be the only SIMD type that lowers to x86_mmx

gnzlbg (Oct 22 2019 at 13:34, on Zulip):

right now, any #[repr(simd)] ADT containing a single field, with type i64, is lowered to x86_mmx

rkruppe (Oct 22 2019 at 13:35, on Zulip):

Ah

gnzlbg (Oct 22 2019 at 13:36, on Zulip):

I'd be happier if that were restricted further, e.g., to types with a #[repr(simd(x86_mmx))] attribute or similar.

I wouldn't oppose just completely removing all support for x86_mmx, and either removing the MMX intrinsics, or implementing them on top of SSE2.

gnzlbg (Oct 22 2019 at 13:36, on Zulip):

But I'd rather just remove them

gnzlbg (Oct 22 2019 at 13:36, on Zulip):

They are not in the path towards stabilization anyways

rkruppe (Oct 22 2019 at 13:36, on Zulip):

yes pls

gnzlbg (Oct 22 2019 at 13:36, on Zulip):

And MMX causes more trouble than it solves, those who need it can use inline assembly instead if they so desire

gnzlbg (Oct 22 2019 at 13:37, on Zulip):

and that's just as unstable as the MMX intrinsics

rkruppe (Oct 22 2019 at 13:37, on Zulip):

or just leave the cursed algorithm in C land and call it through FFI

gnzlbg (Oct 22 2019 at 13:39, on Zulip):

https://github.com/rust-lang/stdarch/issues/823

Mateusz Mikuła (Oct 22 2019 at 13:48, on Zulip):

So in the end we should close https://github.com/rust-lang/rust/issues/53254, fix packed_simd and nuke mmx from orbit?

rkruppe (Oct 22 2019 at 13:48, on Zulip):

Different order (fix packed_simd, then close issue) but yes that would be my preferred approach

gnzlbg (Oct 22 2019 at 15:02, on Zulip):

I think the issue can be closed, the packed_simd bug is a different bug, only tangentially related to the issue

Mateusz Mikuła (Oct 22 2019 at 16:47, on Zulip):

Thank you everyone for the investigation and sorry for cluttering this wg.

gnzlbg (Oct 24 2019 at 09:05, on Zulip):

@Mateusz Mikuła thank you for all your work reducing the examples and tracking this down too!

gnzlbg (Oct 24 2019 at 09:06, on Zulip):

That packed_simd bug has been opened for a long time, and now we can finally close it!

Last update: Nov 15 2019 at 09:45UTC