Stream: project-safe-transmute

Topic: performance


Ryan Levick (Dec 10 2019 at 15:40, on Zulip):

One of the concerns with the current proposal is that it will not be performant enough for many use cases and thus people will be forced to resort back to unsafe transmute instead.

It does seem like in a large amount of cases the compiler is sufficiently smart enough to boil away the runtime checks through constant folding since most of the checks are const. For example, the compiler is able to boil away all runtime checks between two types that are not trivially related (i.e., transmuting works only because they happen to share the same layout). There is at least one case however where not all runtime checks are eliminated despite there actually being no need for them.

Given this, there might be sufficient reason to reject the current proposal. Not only does the user often have to deal with unwrapping results from casting where it's obvious that the cast cannot fail, but also the compiler cannot always guarantee that unnecessary runtime checks are eliminated. Thoughts and/or questions?

nikomatsakis (Dec 10 2019 at 16:01, on Zulip):

This doesn't worry me very much. I'd be curious to see numbers.

Reasons that it doesn't worry me:

However, it seems like the first two arguments are ones that are best made with experience and examples.

I guess I also hope that we will be able to grow the language to address things like the UB that can result from uninitialized memory and padding, which I imagine is responsible for a lot of the cases where this API would "almost" apply but doesn't quite (is that inaccurate?).

Ryan Levick (Dec 10 2019 at 16:07, on Zulip):

The counter argument to that is that if performance is not of utmost concern to you, why not just copy? I would imagine that those who want to use this API would like for it to have the same runtime characteristics as std::mem::transmute with the addition of runtime checks when those checks would be necessary in the unsafe code. I think this is a perfect example of a "zero-cost abstraction". Ideally there would be no better way to write this by hand using std::mem::transmute. There is definitely a way for us to have an API that does not require runtime checks unless completely necessary. The only question is can we make it nice to use.

The pre-RFC in its current form doesn't really _directly_ address uninitialized memory or padding, it just ensures that the user does not accidentally transmute to or from memory that would cause UB.

nikomatsakis (Dec 10 2019 at 16:29, on Zulip):

@Ryan Levick I guess it depends on the scope too =)

I'm assuming that there might be something like Vec<T> => Vec<U> conversions, which seems very powerful

Lokathor (Dec 10 2019 at 20:06, on Zulip):

It's both powerful and also fairly easy. Crates already exist that can do it (though I cannot guarantee they do it with 0 overhead).

Yato (Dec 10 2019 at 20:45, on Zulip):

My crate vec-utils maps from Vec<T0>, Vec<T1>, Vec<T2>, ... to Vec<U> (when given a mappi g functuon) which reuses the largest valid allocation and also provides the functionality to reuse the allocation of a Vec with 0 extra cost (sound Vec transmute).

Lokathor (Dec 10 2019 at 20:45, on Zulip):

@Ryan Levick In the godbolt you linked, example and beispiel are identical. If you simply replace the body of example2 and beispiel2 with calls to example and beispiel then the "2" functions will also end up with identical code. You can also remove a jump from the example2 that you posted by replacing the panic!("This can't happen") with unreachable!.

Basically the optimizer is (and always will be) a quirky beast, but I don't think that means that the design itself has any inherent flaws.

comex (Dec 11 2019 at 01:26, on Zulip):

There is at least one case however where not all runtime checks are eliminated despite there actually being no need for them.

In this case, the only runtime check that isn't eliminated is the check to see if the value is > 1. This happens because rustc only communicates the <= 1 assumption to LLVM when loading from a bool-typed reference, whereas in your example the load doesn't happen until after the conversion.

comex (Dec 11 2019 at 01:27, on Zulip):

But that seems like an edge case. The only time that the > 1 check would be redundant is when converting from bool to bool, as in your example, but (a) that's useless, (b) to handle situations where it might come up in generic code, the standard library could add a specialization to skip the checks when the source and target types are the same.

comex (Dec 11 2019 at 01:31, on Zulip):

That said, I also think that the idea of an overridable trait method that validates the slice data (i.e. from_bytes) is not super useful as a whole. What is it useful for other than &[bool]?

rkruppe (Dec 11 2019 at 09:16, on Zulip):

The case where source and target type have the same invariants does matter. It doesn't only happen when the types are identical (which is indeed a corner case and can be optimized transparently by specialization) but also whenever newtypes are involved. Being able to safely type-pun in those cases without performance overhead would be a great enabler for type-safer abstractions (as the status quo is needing to copy potentially large amounts of memory or needing unsafe). And libstd can't specialize those cases to be zero-cost as they involve downstream types and user-defined invariants.

Taking this consideration to the logical extreme probably leads to a design where "validate whether these bytes are OK for that type" is not the central operation (maybe one that happens occasionally at system boundaries). I'm not saying such a design should be adopted (don't have to solve every use case at once) but there are use cases for such conversions.

Ryan Levick (Dec 11 2019 at 10:07, on Zulip):

@Lokathor @comex I don't believe this is an edge case. This also applies to other conversions (e.g., numbers to c-style enums). I will spend some more time thinking of examples to look over. In any case, given the performance oriented nature of this feature, I think it's important that we try to eliminate performance foot-guns. The code as it reads right now does not make it obvious that runtime checks will not be performed in many cases _and_ it certainly does not make it obvious where the compiler can eliminate all runtime checks and where it cannot. Ideally bool_new_type.cast::<bool>() would not return a Result and would be guaranteed to not perform any runtime checks.

Ryan Levick (Dec 11 2019 at 10:27, on Zulip):

For posterity here's a non-trivial example of Rust doing the right thing with regards to eliminating unnecessary runtime checks (only bool is checked when it has to be): https://godbolt.org/z/4M3dGp

Ryan Levick (Dec 11 2019 at 10:34, on Zulip):

And here's an example where two structs are structurally identical to each other and should therefore be directly transmutable to and from each other but Rust is not smart enough to see that and so it inserts unnecessary runtime checks: https://godbolt.org/z/evPTtK

rkruppe (Dec 11 2019 at 10:41, on Zulip):

For posterity here's a non-trivial example of Rust doing the right thing with regards to eliminating unnecessary runtime checks (only bool is checked when it has to be): https://godbolt.org/z/4M3dGp

I'm not sure what this example is supposed to illustrate. The B -> A direction goes through a manual impl that does nothing but check the bool (so there's nothing to optimize out in those examples), while the B -> A direction only has size and align checks to optimize out and it's not news that those can be optimized out very well. The more interesting case would be if (1) B also had a bool field in the same place, and (2) the impls were more mechanical, closer to how they would actually be generated by rustc/derives. In particular note that the impl FromBytes for A is missing size and alignment checks and hard-codes the knowledge that fields a and c are FromAnyBytes.

Ryan Levick (Dec 11 2019 at 11:23, on Zulip):

It was just meant to be the same example as we’ve seen before just in a non-trivial case to show that the compiler can easily see through aggregate types. If that’s obvious to you than there’s no need to look at it.

Ryan Levick (Dec 11 2019 at 11:23, on Zulip):

I’ll fix the second example. I don’t expect it to change anything.

Ryan Levick (Dec 11 2019 at 11:24, on Zulip):

@rkruppe can you explain what you mean by “more mechanical “?

rkruppe (Dec 11 2019 at 11:33, on Zulip):

I would expect from the pre-RFCs that the impls for aggregates that require some validation would be auto-generated by straight-forwardly composing calls to validate each field in turn. The impl in your example, on the other hand, looks like a lot of thought was put into identifying the least amount of work to do, from only considering some fields to manually computing the offset of each field to inlining the validity check for it. Even if some compiler magic is used to ensure e.g. absence of padding for ToBytes types, I don't think we should work under the assumption that the generation of the trait impls happens late enough in the compiler to be able to modify its behavior based on e.g. whether field types are FromAnyBytes or use layout computation results.

Ryan Levick (Dec 11 2019 at 12:36, on Zulip):

I believe that's a limitation of the current proposal that was meant to be addressed by what Josh posted (and what you, @rkruppe helped with I believe). Without that addendum, I believe each from_bytes implementation would need to be generated separately. How that would be done is not specified in the proposal so far. This add more support for the need for a separate validation function IMO.

rkruppe (Dec 11 2019 at 13:31, on Zulip):

FWIW I think it would be interesting to see if just doing the conversion for each field and checking if it succeeds (like the first option in Josh's post) does optimize well. But separate validation has other advantages so maybe it's not worth the trouble.

Ryan Levick (Dec 11 2019 at 13:41, on Zulip):

The example I posted above is incorrect because the A struct has padding (a fact I didn't recognize at first and a good reminder of why this RFC is needed). It would not implement ToBytes and thus cannot be used to cast from. I'll change that

Ryan Levick (Dec 11 2019 at 13:46, on Zulip):

This should be a correct version: https://godbolt.org/z/3fZwMQ This shows that there is a runtime cost to casting between types that require validation even when the validation is the same between those types and a sufficiently smart compiler could remove those checks.

Ryan Levick (Dec 11 2019 at 13:47, on Zulip):

@rkruppe I will try the per field validation and see what happens.

Ryan Levick (Dec 11 2019 at 13:58, on Zulip):

@rkruppe looks like doing validation per field optimizes away: https://godbolt.org/z/4uG-mY

Ryan Levick (Dec 11 2019 at 14:00, on Zulip):

So far the only runtime cost you seem to pay above what could be written by hand is when two types require validation and they are structurally equal.

rkruppe (Dec 11 2019 at 17:05, on Zulip):

That makes sense and I would expect it to generalize. When the to_bytes + from_bytes pair is fully inlined (necessary for low overhead anyway) then for the from_bytes code all the sizes being compared are constants and the alignment checks are all on constant offsets from the source reference (whose ABI alignment is usually known) and thus amendable to peephole arithmetic simplifications that LLVM is good at.

That just leaves the bitstring validation. I don't think it's realistic to expect that to be optimized away, so if the costs of redundant validation matter then some deeper redesign of the traits is probably required.

Lokathor (Dec 11 2019 at 17:44, on Zulip):

Honestly, I wouldn't have otherwise expected that types with any special bit pattern requirements would be validat all for using with a safe transmute.

I would compare this to get_unchecked instead of index. Indexing isn't really free, but you can bypass the safety regulations if you want extra speed in some contexts, and then it's your fault if you do it wrong.

Ryan Levick (Dec 11 2019 at 18:08, on Zulip):

Is the right thing to do to suggest people model their types different (e.g., booleans as u8), if they don't want runtime checks at all? I think it's a valid question to ask if we should support types that have special validation requirements at all, but it feels like the right thing to do - The performance hit is not so great that it represents a potential performance cliff that people could accidentally run off it.

@rkruppe one thing I'm not sure about is that in most relevant use cases if the length and alignment will be known. zerocopy has a length and alginment verified slice which allows them to skip those checks once they've been performed. For type to type casting, I believe inlining should always make that obvious to LLVM, but I can imagine some scenarios where it might be harder for rustc/LLVM to know. For instance, when reading from different parts of a buffer. Requiring this makes the API much more complex though...

rkruppe (Dec 11 2019 at 18:16, on Zulip):

Good point, I was only thinking about the type casting use case. In the general case I don't expect from_bytes with full (size+align+...) validation per field will be optimized so well, since that requires propagating information from branch conditions into the downstream code, which LLVM isn't good at.

Ryan Levick (Dec 11 2019 at 18:44, on Zulip):

I'll try to make a worse case scenario test case.

comex (Dec 11 2019 at 23:09, on Zulip):

I don't believe this is an edge case. This also applies to other conversions (e.g., numbers to c-style enums).

For numbers to C-style enums, though, you usually need a runtime check, just like numbers to bools. So not optimizing it out wouldn't be a problem.

comex (Dec 11 2019 at 23:10, on Zulip):

But @rkruppe's mention of newtypes is more convincing.

comex (Dec 11 2019 at 23:11, on Zulip):

I'm definitely interested in changing the API to not just be about bytes.

comex (Dec 11 2019 at 23:13, on Zulip):

To quote myself from the Discourse thread about a possible alternative:

trait CastSliceTo<T> {
    fn cast_slice(slice: &[Self]) -> Result<&[T], TransmuteError>;
}

The name isn't the best, but basically T: FromAnyBytes would be equivalent to u8: CastSliceTo<T>, and T: ToBytes would be equivalent to T: CastSliceTo<u8>.

comex (Dec 11 2019 at 23:13, on Zulip):

But really there should probably be infallible and fallible variants.

comex (Dec 11 2019 at 23:17, on Zulip):

Then you could imagine that a #[repr(transparent)] struct could somehow derive the impls to infallibly cast between it and its base type. (Not sure what the syntax would be; it would need to produce two separate impls since the trait would be directional. It needs to be directional since e.g. it's safe to cast from [u8] to [Struct] but not vice versa if there's padding.)

comex (Dec 11 2019 at 23:20, on Zulip):

zerocopy has a length and alginment verified slice which allows them to skip those checks once they've been performed.

Interesting. I like this design.

Lokathor (Dec 12 2019 at 04:45, on Zulip):

Regarding the C-style enums: One thing that a friend is demoing into bytemuck (hope to have something to show soon!) is the concept of a "ContiguousEnum", which you often see in both Rust and C. It's some integer values that are all named, with a min..=max range that's valid where all value in that range are used, and all other integer values are simply invalid. In this specific subcase, if you have a raw value you can get it into the enum type with a check that it's in range and then a transmute. You don't have to match on the value with 6 cases or 20 cases or however many (which LLVM will sometimes do right, but usually not handle well).

I think a lot of the transmutations might be of the general form, "how many claims can we make about this type? the most we know the faster we can make the conversion." This might seem obvious of course, but I think that we should keep in mind that something like a specialization mechanism might make a big difference here when the compiler is doing all these generations of conversions. If we can do that it's good because we can jump on a chance to make code faster, but also it can lead to those accidental performance cliffs that Ryan mentioned if it's not obvious to the user that a seemingly small change to a struct (like adding a bool) will silently give them a different derived impl.

Incidentally, is the latest version of any proposal still the discourse thread? Perhaps we should open an issue or PR or similar in the new repo now that it's created.

Ryan Levick (Dec 12 2019 at 13:32, on Zulip):

For numbers to C-style enums, though, you usually need a runtime check, just like numbers to bools. So not optimizing it out wouldn't be a problem.

A runtime check is not needed when converting between two types that require validation where both types only have values that are valid for the other type. For instance between bool and #[repr(u8)] enum Foo { Bar, Baz } . Rust is not capable of eliminating that unnecessary runtime check.

Ryan Levick (Dec 12 2019 at 13:37, on Zulip):

To quote myself from the Discourse thread about a possible alternative:

trait CastSliceTo<T> {
    fn cast_slice(slice: &[Self]) -> Result<&[T], TransmuteError>;
}

How would you represent &mut [Self] to &mut [T] conversions?

Also, this proposal requires all pairs of convertible types to be matched together. That requires a lot impls. Would this have a negative impact on compiler performance?
Also, what happens if you want to cast between a type not in your crate and some type in our crate? I don't believe this is possible because of orphan rules.

Ryan Levick (Dec 12 2019 at 13:39, on Zulip):

Incidentally, is the latest version of any proposal still the discourse thread? Perhaps we should open an issue or PR or similar in the new repo now that it's created.

Yes it is, I need to work on updating it a bit with some of the things we've learned since then.

Ryan Levick (Dec 12 2019 at 14:47, on Zulip):

I'm going to remove safe unions from the initial proposal (though I will mention them as a likely future addition in a follow up RFC). Any objections? cc @Josh Triplett

Josh Triplett (Dec 12 2019 at 15:38, on Zulip):

No objection.

Josh Triplett (Dec 12 2019 at 15:39, on Zulip):

Also, the discourse thread is not quite the latest, as we have some additional improvements.

Ryan Levick (Dec 12 2019 at 15:40, on Zulip):

@Josh Triplett we do have additional improvements but those are only captured in comments, the discourse thread is the latest formal full write up. I'm working on making a new version with the additional improvements.

gnzlbg (Dec 13 2019 at 12:13, on Zulip):

Also, this proposal requires all pairs of convertible types to be matched together. That requires a lot impls. Would this have a negative impact on compiler performance?

Yes. I think a constraint on a solution to this problem is not having this issue.

gnzlbg (Dec 13 2019 at 12:13, on Zulip):

Alternative proposals like the Compatible<T> RFC do not have this problem.

gnzlbg (Dec 13 2019 at 12:14, on Zulip):

Also, what happens if you want to cast between a type not in your crate and some type in our crate? I don't believe this is possible because of orphan rules.

I think not having this issue should also be a constrain on the solution.

gnzlbg (Dec 13 2019 at 12:15, on Zulip):

Since this is a performance-oriented feature, I think that guaranteed no-runtime overhead should also be a constraint.

gnzlbg (Dec 13 2019 at 12:16, on Zulip):

If a runtime over-head is acceptable, then there are arguably many alternatives to solving this problem, some of which do not require any unsafe code or trait impls at all.

Ryan Levick (Dec 13 2019 at 12:28, on Zulip):

Alternative proposals like the Compatible<T> RFC do not have this problem.

@gnzlbg Why would Compatible<T> not have this issue as well? You also need a trait impl for every type pair that can be converted between (i.e., if type the user wants to convert between type A and type B there must be a impl Compatible<B> for A. I would also imagine this quickly gets out of hand. Also, as far as I understand it, it has the same issues with the orphan rule as the other proposal.

gnzlbg (Dec 13 2019 at 12:37, on Zulip):

@gnzlbg Why would Compatible<T> not have this issue as well?

@Ryan Levick That's explained in the Compatible<T> RFC: https://gist.github.com/gnzlbg/4ee5a49cc3053d8d20fddb04bc546000

gnzlbg (Dec 13 2019 at 12:37, on Zulip):

A trait bound of the form T: Compatible<U> is "satisfied" iff given any T: Compatible<U0> there is a type sequence [U_0, U_1, ..., U_N] such that for i in range [1, N) the query U_{i}: Compatible<U_{i+1}> is satisfied and there is a impl of Compatible<U> for U_N. Notice that multiple such sequences could exist, but it suffices that one exists for the query to be satisfied.

gnzlbg (Dec 13 2019 at 12:40, on Zulip):

Notice that for Compatible<T> it doesn't matter which conversion sequence is picked since they are all zero-overhead. However, for TryCompatible, the RFC does not provide this guarantee and requires the user to pick a conversion sequence.

gnzlbg (Dec 13 2019 at 12:41, on Zulip):

We could try to be cleverer for TryCompatible, and pick one of the shortest sequences, but I'm wary of doing that.

Last update: Jan 28 2020 at 00:45UTC