Stream: general

Topic: how much should I worry about alignment on x86?


Jake Goulding (Jul 28 2019 at 17:43, on Zulip):

XXHash (and thus twox-hash) derives a large part of its speed from treating &[u8] as &[u64] (or &[u32]). I recently rewrote it to take advantage of slice::align_to, which made the code much nicer (hooray!). However, it also tanked the performance for unaligned input data (e.g. 10.5 Gb/sec -> 4.5 Gb/sec).

Would it be (a) correct and (b) memory safe to ignore this alignment on X86 / X86_64 only, falling back to the align_to version on other platforms?

RalfJ (Jul 28 2019 at 18:05, on Zulip):

Doing an unaligned load is UB in Rust no matter the target architecture

RalfJ (Jul 28 2019 at 18:06, on Zulip):

we set the appropriate attributes so LLVM will optimize assuming pointers are aligned

RalfJ (Jul 28 2019 at 18:06, on Zulip):

I don't know exactly what this entails, @rkruppe would likely know more -- but for now, the answer to your question is "no" on both accounts

RalfJ (Jul 28 2019 at 18:07, on Zulip):

are you saying that XXHash/twox-hash currently performs such unaligned loads? Or why did things get slowed with align_to?

rkruppe (Jul 28 2019 at 18:17, on Zulip):

Using ptr::read_unaligned instead of references/normal accesses is necessary to the avoid UB Ralf mentioned. That will kill some generic peephole optimizations which probably aren't important for this kind of code, and for x86 it shouldn't influence the final machine code (except the trivial movaps -> movups if any SIMD is introduced).

RalfJ (Jul 28 2019 at 18:18, on Zulip):

@rkruppe I guess one question here is how costly it would be to make the "unaligned deref is UB" be a target-dependent property and not require this for x86

rkruppe (Jul 28 2019 at 18:19, on Zulip):

Making it target-dependent would be pretty gross

RalfJ (Jul 28 2019 at 18:24, on Zulip):

fair. I am asking mostly because I saw code in the wild in rust-lang crates (rand) that assumed it actually was target-dependent... so this is likely a common misconception.

rkruppe (Jul 28 2019 at 18:25, on Zulip):

I've been thinking about this for a couple minutes and none of my concerns are about LLVM optimizations (well, except one which is presumably temporary and avoidable)

RalfJ (Jul 28 2019 at 18:30, on Zulip):

one issue with allowing it on x86 is that if people only run Miri on x86, they wouldn't get the error, thus missing this UB-- of course this applies to all target-dependent UB but this seems like a more subtle one. the issue in rand was only found because I ran it in Miri.

rkruppe (Jul 28 2019 at 18:31, on Zulip):

Some things that worry me are: the effect this will have on the ecosystem (code that's non-portable in undetectable ways -- currently, even if unaligned accesses are not "miscompiled" on x86, miri and lints and docs can steer away from it unconditionally); how people would probably request structs to be packed automatically on such targets; what that would do to type-pun-ability between atomics and normal data (since atomics do require alignment even on x86); the fact that in today's software unaligned accesses are rather limited so there might still be some non-obvious performance cliffs in some hardware

rkruppe (Jul 28 2019 at 18:33, on Zulip):

Oh also, I can feel in my bones that as soon as we open this pandora's box, people are going to request it as option for other targets too, including those where only some chips support unaligned loads/stores (efficiently). e.g. doing this for the baseline risc-v targets is probably not a good idea because implementations that trap & emulate on unaligned accesses are encouraged, but if someone's targeting only chips that support it efficiently and natively, they'll want the non-UB as much as x86 people

Jake Goulding (Jul 28 2019 at 18:43, on Zulip):

are you saying that XXHash/twox-hash currently performs such unaligned loads

Version 1.3 took a &[u8] and got a *const u64, then did ptr::read. I guess that's incorrect and should have been read_unaligned, but it was presumably close to being right.

Version 1.4 switched to slice::align_to, which should be easier to verify as correct. The problem is that the unaligned data "percolates" through the entire computation because it gets buffered internally. It can only speed up again if we finish that internal buffer and the remainder is aligned (not super likely).

Jake Goulding (Jul 28 2019 at 18:44, on Zulip):

However, XXHash is C code. It's relatively readable, but I'm not 100% I'll state what it does correctly. I think it basically says "am I x86? if so, just cast the buffer"

Jake Goulding (Jul 28 2019 at 18:51, on Zulip):

It sounds like I need to change my abstraction to a platform-specific type that, given a &[u8], can return an iterator of leading u8, an iterator of u64, and an iterator of trailing u8. I can implement that with align_to on non-x86, and raw pointers and read_unaligned on x86.

nagisa (Jul 28 2019 at 18:54, on Zulip):

I’m surprised align_to would cause a 50% perf hit, unless your test buffers are super small.

nagisa (Jul 28 2019 at 18:54, on Zulip):

Aligning u8 -> u64 is at most 10 cycles of instructions.

RalfJ (Jul 28 2019 at 18:55, on Zulip):

@rkruppe I generally agree. Those are some good arguments. Also see https://github.com/rust-lang/rust/issues/53871 and https://github.com/rust-lang/rust/issues/54915 which would be undermined by this.
Though we could do what we do for overflow, which is to say that on some platforms an unaligned access will either panic or succeed -- but not cause UB.

RalfJ (Jul 28 2019 at 18:56, on Zulip):

@Jake Goulding

Version 1.3 took a &[u8] and got a *const u64, then did ptr::read. I guess that's incorrect and should have been read_unaligned, but it was presumably close to being right.

not sure what your measure of distance is here, but this is indeed an unaligned access and hence UB. but using read_unaligned instead is probably an easy fix (and maybe "how hard is it to fix" was your measure of distance here?)

nagisa (Jul 28 2019 at 18:58, on Zulip):

FWIW x86 does not universally support unaligned loads/stores transparently once you get to SSE/AVX. If the compiler thinks that it should be using aligned sse load to load from an address and that address ends up being insufficiently aligned, code will not run.

oli (Jul 28 2019 at 19:00, on Zulip):

there's also ptr::align_offset which gives you the basics of align_to if align_to is overkill for your situation

nagisa (Jul 28 2019 at 19:01, on Zulip):

In general I haven’t found a situation where align_to would be inferior to align_offset. Compiler will aggressively eliminate portions of align_to that end up computing unnecessary stuff

nagisa (Jul 28 2019 at 19:02, on Zulip):

(such as would be the case if you did not use the last slice of unaligned elements)

nagisa (Jul 28 2019 at 19:06, on Zulip):

My suspicion why the perf hit occurred is that the implementation ends up calling align_to in a fairly hot loop..

nagisa (Jul 28 2019 at 19:07, on Zulip):

https://github.com/shepmaster/twox-hash/commit/da2c45433688eb1117bf84f4f3532397b089198f#diff-0fddf2104b2e6fa6ab0ec32025ee9f1bR166 for example appears to be using align_to mostly to "just" extract the aligned portion of the data that’s expected to be 4*u64.

Jake Goulding (Jul 28 2019 at 19:08, on Zulip):

I’m surprised align_to would cause a 50% perf hit, unless your test buffers are super small.

Basically, the algo wants to deal with [u64; 4] chunks. If we read from a &[u8]that is one byte misaligned from that, then we slurp in X bytes to fill the buffer, then process it. The remaining data is still misaligned by one byte, so the process repeats.

In the old ptr::read situation, if we had data in the buffer, we could fill and finish the buffer and then treat the remainder of bytes as big chunks.

Jake Goulding (Jul 28 2019 at 19:10, on Zulip):

read_unaligned instead is probably an easy fix (and maybe "how hard is it to fix" was your measure of distance here?

Indeed.

nagisa (Jul 28 2019 at 19:10, on Zulip):

Ah, I see. Yeah, to use align_to well, you must have means to process the bytes in head and tail independently.

Jake Goulding (Jul 28 2019 at 19:11, on Zulip):

Right. AFAIK XXHash simply doesn't have that, instead opting to always buffer the data up.

Now, 5GB is nothing to sneeze at, but knowing that I'm leaving another 5GB/sec behind is aggravating :wink:

nagisa (Jul 28 2019 at 19:12, on Zulip):

I think in your case @Jake Goulding doing a reverse is what you want to do. memcpy your bytes into [u64; 4] basically.

Jake Goulding (Jul 28 2019 at 19:13, on Zulip):

FWIW x86 does not universally support unaligned loads/stores transparently [...] code will not run.

Could you expand a bit on what you mean by "not run"? There are various ways I could assume that presents.

Jake Goulding (Jul 28 2019 at 19:14, on Zulip):

Do you mean to always copy into the buffer? I believe that's effectively what the code does now.

nagisa (Jul 28 2019 at 19:14, on Zulip):

well, there're separate instructions for aligned and unaligned load/store in SSE and the CPU will trap if an aligned instruction is used with an mis-aligned address.

nagisa (Jul 28 2019 at 19:15, on Zulip):

That’s probably the most plausible way assumed-to-be-aligned-but-really-misaligned-reads-UB can manifest itself on x86.

Jake Goulding (Jul 28 2019 at 19:31, on Zulip):

Just to clarify for myself, on x86, I can do &[u8] -> *const u8 -> *const [u64; 4] and then ptr::read_unaligned and this shouldn't trigger UB (so long as I stay in bounds of the &[u8])?

Jake Goulding (Jul 28 2019 at 19:33, on Zulip):

read_unaligned doesn't make mention of platform-specific things; what happens if I use it on a platform that is more strict about alignment?

nagisa (Jul 28 2019 at 19:33, on Zulip):

Just to clarify for myself, on x86, I can do &[u8] -> *const u8 -> *const [u64; 4] and then ptr::read_unaligned and this shouldn't trigger UB (so long as I stay in bounds of the &[u8])?

Yes, that should work.

nagisa (Jul 28 2019 at 19:35, on Zulip):

read_unaligned doesn't make mention of platform-specific things; what happens if I use it on a platform that is more strict about alignment?

The compiler will have to split reads into aligned parts and stitch the result back into what you want. This could be done as a memcpy under the covers.

Jake Goulding (Jul 28 2019 at 19:36, on Zulip):

hmm. So if I can't do something better than what the compiler does, it might be better for me to use read_unaligned on all the platforms anyway...?

Jake Goulding (Jul 28 2019 at 19:40, on Zulip):

Since I'm basically buffering unaligned data in a way that is reminiscent of memcpy

nagisa (Jul 28 2019 at 21:35, on Zulip):

@Jake Goulding probably. If you had no bounds on space, you could allocate a properly aligned buffer and copy everything in one go

nagisa (Jul 28 2019 at 21:35, on Zulip):

which would be significantly faster in certain scenarios

RalfJ (Jul 28 2019 at 21:43, on Zulip):

seems like a bad API for this case if you have to re-align every 32-byte buffer, instead of being able to process some prefix and then a sequence of properly aligned 32-byte buffers

RalfJ (Jul 28 2019 at 21:44, on Zulip):

I mean the algorithm already has to support smaller chunks, right? what if I hash a u16?

Jake Goulding (Jul 28 2019 at 22:41, on Zulip):

if you have to re-align every 32-byte buffer

@RalfJ certainly, but as I understand it, the algorithm itself was designed assuming x86 semantics. Thus the API itself has some of these assumptions baked-in. The algorithm effectively requires that it be cheap to "realign"

what if I hash a u16?

That's fine, but there's special code both for "finalizing" the hash and for the specific case where you only hashed data of length less-than the internal buffer. AFAIK, we can't just "finish" the buffer and then keep adding on afterwards.

Jake Goulding (Jul 28 2019 at 22:42, on Zulip):

allocate a properly aligned buffer and copy everything in one go

@nagisa I should probably add that to the documentation for those on other architectures.

Jake Goulding (Jul 29 2019 at 00:26, on Zulip):

Thank you, RalfJ, rkruppe, and nagisa!

Last update: Nov 20 2019 at 13:50UTC