Stream: t-compiler

Topic: fast-math


Alexander Droste (Sep 27 2019 at 19:26, on Zulip):

Hi all!

I was wondering if there's any ongoing work on -ffast-math support in Rust?

This would be a great addition, as fast-math can lead to significant performance gains, e.g. in DSP code. In particular it is sometimes crucial to vectorize code, such as the following snippet:

pub fn sum(arr: &[f32; 512]) -> f32 {
    let mut result = 0.0;
    for idx in 0..arr.len() {
        result += arr[idx];
    }
    result
}

Being aware of std::intrinsics::{fadd_fast, fdiv_fast,...}, those can easily get verbose compared to single character operators (+,-,..).

Josh Triplett (Sep 27 2019 at 22:14, on Zulip):

I don't think we will ever want an option exactly equivalent to C -ffast-math that combines safe and unsafe optimizations. But there's interest in providing access to faster floating-point in a smoother manner than individual intrinsics.

centril (Sep 27 2019 at 22:43, on Zulip):

I would be fine with e.g. providing other types than f32 and f64 (e.g. r32 and r64) that would provide such faster FP. I am however against introducing global flags or changing the behavior of the existing types.

Josh Triplett (Sep 27 2019 at 23:19, on Zulip):

I'm aware of that position. I personally feel there should be an option to change algorithms written in terms of the standard floating point types to use safe optimizations, with careful definition of "safe".

centril (Sep 27 2019 at 23:21, on Zulip):
#[cfg(feature = "fast_math")]
type fp32 = r32;

#[cfg(not(feature = "fast_math"))]
type fp32 = f32;

and then --feature fast_math

centril (Sep 27 2019 at 23:21, on Zulip):

I think that satisfies "option to change algorithms"

Josh Triplett (Sep 27 2019 at 23:29, on Zulip):

Not if you have to put that at the top of every file using floating point, and in every crate.

centril (Sep 27 2019 at 23:42, on Zulip):

That code only needs to be in one crate, the main work is in Cargo.toml instead. Cargo could also add --global-feature fast_math to propagate the cfg. Alternatively this could be built on option_env!(..)

centril (Sep 27 2019 at 23:48, on Zulip):

I could see a global flag for r32 to switch off the default fast-math behavior, but not a flag to switch it on for f32

Josh Triplett (Sep 27 2019 at 23:50, on Zulip):

Why does that distinction matter given that it's opt-in either way? I can see an argument for why f32 couldn't default to it and r32 could (though I don't necessarily agree with that argument), but I don't see the argument against an opt-in for f32.

centril (Sep 27 2019 at 23:53, on Zulip):

Because it becomes opt-in for the crate author, who was the one who made the assumptions, not the end user of the crate somewhere deep in the dependency graph.

Also, I think there should be a type which is deterministic-ish and standards compliant and since f32 is that today, it seems to me it should stay that way.

centril (Sep 27 2019 at 23:53, on Zulip):

I also don't think floating point numbers are so special that they deserve so much more attention from the language. There are other things which could get global flags.

simulacrum (Sep 28 2019 at 00:56, on Zulip):

I think an attribute might make sense, which could then be applied crate/module/function/block level, modulo hygiene interactions with macros and such possibly.

centril (Sep 28 2019 at 01:06, on Zulip):

Scoped attributes seem like costly solutions spec-wise

simulacrum (Sep 28 2019 at 01:27, on Zulip):

Hm, I don't see how so? Like, attributes are pretty simple I feel

centril (Sep 28 2019 at 01:30, on Zulip):

How do they interact with generic functions and inlining?
They may be relatively easy to implement in a stateful compiler, but that doesn't mean it's easy to specify.
Moreover, to my knowledge, we do not have any such "propagate runtime behavior" attribute at the moment, so it would "open a new axis" in the language design.

centril (Sep 28 2019 at 01:31, on Zulip):

Meanwhile, types are a known concept which are not fundamentally different and which we know how to specify

simulacrum (Sep 28 2019 at 02:24, on Zulip):

rustc_inherit_overflow_checks :)

simulacrum (Sep 28 2019 at 02:24, on Zulip):

but true.

centril (Sep 28 2019 at 02:24, on Zulip):

That was bound to be brought up lol :D

simulacrum (Sep 28 2019 at 02:25, on Zulip):

But "syntactically in the attributed scope" seems feasible?

centril (Sep 28 2019 at 02:27, on Zulip):

I think it's implementable (rustc_inherit_overflow_checks does so) but I don't think it's a good idea

simulacrum (Sep 28 2019 at 02:27, on Zulip):

My understanding is that fast math is purely for operators, basically - from the type system POV it's no different to adding unsafe(?) methods

simulacrum (Sep 28 2019 at 02:28, on Zulip):

Is that right? Like, we could expose the relevant llvm intrinsics and people could newtype implement this today

centril (Sep 28 2019 at 02:30, on Zulip):

taking the expose-intrinsics route seems sensible

centril (Sep 28 2019 at 02:30, on Zulip):

at least as a start

simulacrum (Sep 28 2019 at 02:30, on Zulip):

And f32 isn't special cases anywhere, right? Other than having primitive syntax (1.0) for it.

centril (Sep 28 2019 at 02:31, on Zulip):

that probably needs to be checked to answer with authority =)

centril (Sep 28 2019 at 02:31, on Zulip):

but it seems likely

centril (Sep 28 2019 at 02:31, on Zulip):

aside from pattern matching but that's going out

simulacrum (Sep 28 2019 at 02:31, on Zulip):

Right yeah

simulacrum (Sep 28 2019 at 02:32, on Zulip):

Do people want this as a global switch?

simulacrum (Sep 28 2019 at 02:33, on Zulip):

Like, to me that seems not great

simulacrum (Sep 28 2019 at 02:33, on Zulip):

From a usability - even ignoring our desire to avoid it in the language - standpoint

centril (Sep 28 2019 at 02:37, on Zulip):

I believe these are the options (some mutually compatible):

  1. Global switch for f32 [opt-in, out-out]

  2. Expose LLVM intrinsics but use f32 as types.

    a) Newtype f32 with those.

  3. Introduce a primitive type r32

    a) Global switch for r32 [opt-in, out-out]

    b) Add ability to cargo to have --global-feature fast_math
    c) Use option_env!(...) instead with enhanced CTFE

  4. Scoped attributes

centril (Sep 28 2019 at 02:38, on Zulip):

I'm OK with 2. and 3.

centril (Sep 28 2019 at 02:38, on Zulip):

I do believe many people want 1.

Josh Triplett (Sep 28 2019 at 09:58, on Zulip):

@simulacrum Clarification: this isn't just for operators, it's also for code generation between functions and similar.

Josh Triplett (Sep 28 2019 at 09:59, on Zulip):

So, for instance, you could do multiple operations and not truncate/round between them.

Josh Triplett (Sep 28 2019 at 10:00, on Zulip):

That isn't just about addition or multiplication (though that's important), it's also about any other floating point function.

Alexander Droste (Sep 28 2019 at 10:20, on Zulip):

Thanks everyone for giving this some thought!

I much agree that it would be nice to scope fast-math and not have it globally enabled.

Looking at the options @centril listed:
1) Nah, I agree, a global switch would most likely apply fast-math to code where not anticipated, breaking things
2) Would this require casting when interacting with f32?
3) still processing this variant :)
4) As far as I can tell, I'd probably prefer this variant as it would be explicit which part of the code is using fast-math. fast-math could be seen as a property which a function or module could be designed for/robust against.

rkruppe (Sep 28 2019 at 10:25, on Zulip):

Newtypes are not great for this. "fast/deterministic" already has problems (some discussed above; I'd also add the extra annotation burden for both library code that tries to be generic over the different types and its users) but that binary choice is an illusion. In reality there are at least four different kinds of fast-math flags (FMFs) that should be offered independently: contraction, algebraic rewrites, approximating built-in functions, existence of nans/infs/signed zeros. Many of those can usefully subdivided further (for reference, LLVM currently has 7 and would have more if not for storage space limitations).

So the newtyping approach leads to a combinatorial explosion of types. This can cause various problems but I am especially worried about the code size impact it has. Besides multiplying any generic code that handles float values without caring for the fast-math flags (e.g., Vec and slice methods), we'd also probably need a massive (also generic, but massively monomorphized) matrix of operator overloads to allow some interoperability between values computed with different FMFs. And this will be imperfect, e.g. if cond { /* compute f32 with some FMFs */ } else { /* compute f32 with slightly different FMFs */} will need an explicit cast.

There's also a related problem with libms, which mostly have a monomorphic interface in terms of f32 and f64 but the optimizer should see FMFs applied to those calls (so it can perform optimizations) even if the libm never even sees the FMFs. There's no good way to expose this with intrinsics or newtypes AFAIK.

rkruppe (Sep 28 2019 at 10:30, on Zulip):

This is not to say I endorse a special new language feature such as global or scoped (sets of) flags. Propagating through function boundaries is necessary and trying to do that with ad-hoc flags doesn't play nicely with the compilation model (and also abstraction boundaries, only some functions want to inherit FMFs). But if a satisfactory solution to that can be found, it would solve the problems newtypes have.

centril (Sep 28 2019 at 20:46, on Zulip):

2) Would this require casting when interacting with f32?

Since it's a newtype struct R32(pub f32); you'd need to wrap manually. The main benefit of a primitive would be to feel more built-in. Though maybe we should consider user defined literals...

I'd also add the extra annotation burden for both library code that tries to be generic over the different types and its users) but that binary choice is an illusion.

Ostensibly f32 and f64 already incur such a burden cause some authors would want to deal with both so it's not a novel burden but it is likely that the desire to be generic would increase. I think we can reduce that burden through impl Trait, associated_type_defaults, associated_type_bounds, trait_aliases, and so on to make the syntactic overhead substantially smaller.

In reality there are at least four different kinds of fast-math flags (FMFs) that should be offered independently: contraction, algebraic rewrites, approximating built-in functions, existence of nans/infs/signed zeros. [...]

That's a lot of complexity. I'm not sure all should be offered independently except as perhaps exposed intrinsics.

So the newtyping approach leads to a combinatorial explosion of types.

The newtype / r32 approach is intended as a fairly blunt instrument that says "prioritize performance at the cost of standards compliance and reliability within the limits soundness provides". I'm not sure we have to provide control over every niche trough types. We could provide some global flags to tweak the semantics of r32 if we communicate from the get-go that crate authors using these types should be ready for such semantic changes through flags.

rkruppe (Sep 29 2019 at 07:15, on Zulip):

In reality there are at least four different kinds of fast-math flags (FMFs) that should be offered independently: contraction, algebraic rewrites, approximating built-in functions, existence of nans/infs/signed zeros. [...]

That's a lot of complexity. I'm not sure all should be offered independently except as perhaps exposed intrinsics.

I am quite sure they should. Most users will still default to enabling all or none for convenience (as they do in C and Fortran, where these distinctions exist), but greater level of control is absolutely needed sometimes -- some aspects of "fast-math" break your algorithm but others are essential for its performance. The step from "use this newtype" to "write ALL your math out as intrinsics" seems too steep to be acceptable, so I fear people will either fall back to no FMFs at all (bad since they won't get C-competitive performance) or use the blunt "fast-math" hammer and try to work around the issues it causes as they go (bad since they jeopardize their program's reliability).

I also don't think it is that much complexity. No matter what approach we take precisely, there will definitely be aliases for common groups of FMFs (so UX is about the same for those who don't care) and on the implementation side it's just a bitset instead of a boolean to track. Also, alternative implementations that don't care can always collapse (e.g., into a binary fast/precise) or outright ignore FMFs. Embracing finger-grained FMFs does shift the balance towards somewhat more complex ways for users to choose FMFs, but personally I think those approaches are strong candidates even with a binary switch (e.g., due to the "libm problem" I described earlier).

Another consideration is that fast-math is not the only "modification of float rules" knob, there's also much space to go in the opposite direction and constrain optimizations more to allow users to change rounding mode, inspect and modify fp exception flags, install non-default fp exception handling, and preserve NaN bit patterns. Nobody has even sketched a design of what this would look like in Rust, but since it similarly needs to affect all primitive operations and some function calls within a certain scope or program slice, it seems plausible to me we could use the same mechanism as for FMFs.

rkruppe (Sep 29 2019 at 07:27, on Zulip):

The newtype / r32 approach is intended as a fairly blunt instrument that says "prioritize performance at the cost of standards compliance and reliability within the limits soundness provides". I'm not sure we have to provide control over every niche trough types. We could provide some global flags to tweak the semantics of r32 if we communicate from the get-go that crate authors using these types should be ready for such semantic changes through flags.

A bit tangential but global flags (as in, affecting all crates in the crate graph) would be useless IMO. The whole reason for finer-grained flags is to allow different parts of the program different sets of optimizations. If it's a whole-program switch, all code using r32 has to be correct under the full set of FMFs, so it might as well use those and get more optimization potential. But of course, in reality library authors won't be perfect about this, so leaving this choice to the user who compiles the crate graph causes the same problems (to a lesser degree) as a global flag that changes all f32-using code to "fast-math mode".

Note that even C and Fortran let you choose on a file-by-file bases. For contraction, there is even a pragma in the C standard (#pragma STDC FP_CONTRACT <on/off>) that allows one to choose on a statement-by-statement basis.

gnzlbg (Oct 01 2019 at 09:26, on Zulip):

@Alexander Droste as you probably already noticed that would be a new language feature and would need an RFC

gnzlbg (Oct 01 2019 at 09:33, on Zulip):

It is possible to make correct unsafe Rust code exhibit undefined behavior by changing floating point arithmetic, so solving the problem isn’t as easy as just adding compiler flags

Alexander Droste (Oct 01 2019 at 12:02, on Zulip):

@gnzlbg Makes sense and thanks for the heads up.

Josh Triplett (Oct 01 2019 at 12:45, on Zulip):

You can make correct unsafe code exhibit unsafe behavior by changing any defined feature in the language.

Josh Triplett (Oct 01 2019 at 12:45, on Zulip):

Just by having a conditional testing the existing behavior and performing undefined behavior if it goes the other way.

centril (Oct 01 2019 at 12:46, on Zulip):

I don't see how that negates @gnzlbg's point.

Josh Triplett (Oct 01 2019 at 12:47, on Zulip):

I don't think that, by itself, implies we can never ever change any behavior. The question is whether anyone is relying on that behavior, as well as whether that behavior was actually defined in practice or whether the documentation said one thing but the code said another.

Josh Triplett (Oct 01 2019 at 12:48, on Zulip):

For instance, the documentation says one thing about floating point precision, but if you rely on that as an ironclad spec, you will find different behavior on 32-bit x86 (for instance).

centril (Oct 01 2019 at 12:48, on Zulip):

then there's a bug in 32-bit x86

pnkfelix (Oct 01 2019 at 12:48, on Zulip):

Indeed; we have other cases (like integer-overflow) where one can observe different behaviors based on compiler; but we have specified the range of behaviors there.

Josh Triplett (Oct 01 2019 at 12:49, on Zulip):

In practice, it isn't actually something you can rely on.

Josh Triplett (Oct 01 2019 at 12:49, on Zulip):

You could say there's a bug in 32-bit x86. Or you could say there's a bug in the docs.

rkruppe (Oct 01 2019 at 12:49, on Zulip):

I think you'd actually be hard-pressed to find any explicitly specified behavior that is inconsistent with -ffast-math or x87 precision issues. To some degree this is annoying language lawyering but e.g. we never actually say anywhere in the reference that fp addition is correctly rounded

pnkfelix (Oct 01 2019 at 12:50, on Zulip):

@Josh Triplett can you be more specific about what the docs say w.r.t. floating point precision that is contradicted by 32-bit x86?

pnkfelix (Oct 01 2019 at 12:50, on Zulip):

(not that I doubt you; at this point I'm just curious since you seem to have something concrete in mind)

Josh Triplett (Oct 01 2019 at 12:50, on Zulip):

I would say there's a bug in the docs, and a cautious and correct programmer would observe the actual behavior.

rkruppe (Oct 01 2019 at 12:51, on Zulip):

nitpick: it's only the tier 1 i586-* target where these issues crop up, all the mainstream 32 bit x86 targets require SSE2 and thus have correctly rounded f32 and f64 arithmetic

rkruppe (Oct 01 2019 at 12:51, on Zulip):

Before anyone tries to reproduce this with an i686-* rustc

Josh Triplett (Oct 01 2019 at 12:52, on Zulip):

Sure. The specs don't quite actually say, but implicitly imply, that you won't get excess precision. Because they define the floating-point types by reference to IEEE 754, and 754 doesn't prohibit excess precision but a strict reading of 754 says you should have exactly the specified precision.

Josh Triplett (Oct 01 2019 at 12:53, on Zulip):

And yes, sorry, i586. Though you'd get the same behavior with i686 if you actually build for a target where you can't assume SSE.

Josh Triplett (Oct 01 2019 at 12:53, on Zulip):

(deleted)

Josh Triplett (Oct 01 2019 at 12:54, on Zulip):

My point is, when the spec and the implementation differ, sometimes the spec is wrong, sometimes the implementation is wrong, and sometimes you could argue either case.

Last update: Nov 16 2019 at 01:05UTC