Stream: t-lang/wg-unsafe-code-guidelines

Topic: UnsafeCell


Carl Lerche (Oct 14 2019 at 19:50, on Zulip):

Hi, I'm not sure if this is the best place to ask (or even how to use zulip). I've used UnsafeCell<AtomicU32> in order to perform unsynchronized loads when I can guarantee that other threads only load from that atomic (never store). Is this safe? https://github.com/tokio-rs/tokio/blob/3a941e99a112f0058d2ac1f295da29491ba4b232/tokio-executor/src/loom/atomic_u32.rs#L25-L27

Carl Lerche (Oct 14 2019 at 19:50, on Zulip):

I am pretty sure it is, but got a comment on HN and i was not sure how to respond: https://news.ycombinator.com/item?id=21251778

Hadrien Grasland (Oct 14 2019 at 19:55, on Zulip):

It is UB, and I got some actual miscompilation on one of my programs while trying something similar so it's not a theoretical concern.

Hadrien Grasland (Oct 14 2019 at 19:56, on Zulip):

Further, UnsafeCell<AtomicXyz> is redundant because AtomicXyz already contains an UnsafeCell<Xyz>.

Carl Lerche (Oct 14 2019 at 19:57, on Zulip):

@Hadrien Grasland what part is UB?

Hadrien Grasland (Oct 14 2019 at 19:57, on Zulip):

Loading from an UnsafeCell some data which may have ever been touched by another thread.

Carl Lerche (Oct 14 2019 at 19:58, on Zulip):

@Hadrien Grasland by touched, you mean mutated? In this case, there are no mutations from other threads

Hadrien Grasland (Oct 14 2019 at 19:58, on Zulip):

The problem is that even with UnsafeCell, the compiler can assume that if it knows about all memory accesses in the current thread, it knows about all memory accesses, period.

Hadrien Grasland (Oct 14 2019 at 19:58, on Zulip):

So it can optimize out "redundant" loads and so on.

Carl Lerche (Oct 14 2019 at 19:58, on Zulip):

@Hadrien Grasland in this case, it does know about all stores in the thread

Hadrien Grasland (Oct 14 2019 at 19:58, on Zulip):

And there are no stores from other threads and have never been?

Carl Lerche (Oct 14 2019 at 19:59, on Zulip):

@Hadrien Grasland all stores are guarded by locks and the unsync_load happens within the lock

Hadrien Grasland (Oct 14 2019 at 19:59, on Zulip):

Oh, if you have a lock, then that is your synchronization.

Hadrien Grasland (Oct 14 2019 at 19:59, on Zulip):

In that case, you don't need Atomic, you could use UnsafeCell<u32> directly.

Hadrien Grasland (Oct 14 2019 at 20:00, on Zulip):

(unless you sometimes want to access the atomic without going through the lock)

Hadrien Grasland (Oct 14 2019 at 20:00, on Zulip):

By the way, what's wrong with just doing atomic loads and stores with Relaxed ordering?

Carl Lerche (Oct 14 2019 at 20:00, on Zulip):

100% of mutations are in the lock, only loads are outside of the lock

Hadrien Grasland (Oct 14 2019 at 20:01, on Zulip):

100% of mutations are in the lock, only loads are outside of the lock

But how do you make sure that the mutations are propagated to the threads doing the load?

Hadrien Grasland (Oct 14 2019 at 20:03, on Zulip):

At the hardware level, load(Relaxed) should compile down to a normal load instructions on all current CPUs. But at the compiler level, it serves as a warning that the value may have changed as a result of another thread modifying it, and the compiler should go check it from time to time.

Hadrien Grasland (Oct 14 2019 at 20:04, on Zulip):

(Currently, LLVM pessimistically interprets this as a command to go check the value every single time, but most other experts I know agree with me that this is an LLVM optimizer bug)

Carl Lerche (Oct 14 2019 at 20:06, on Zulip):

There is heavier synchronization else where that guarantees it

Hadrien Grasland (Oct 14 2019 at 20:06, on Zulip):

When you acquire a lock (or use an atomic with Acquire ordering, which is what the lock implementation does under the hood), the compiler automatically treats every shared variable which you're going to manipulate inside of the lock like this. But without a lock, you need to hint the compiler at the necessity of refreshing its local cache of the shared variable.

Hadrien Grasland (Oct 14 2019 at 20:07, on Zulip):

Oh, if you have heavier synchronization then you should probably be fine.

Carl Lerche (Oct 14 2019 at 20:07, on Zulip):

@Hadrien Grasland the details are in this post: https://tokio.rs/blog/2019-10-scheduler/ (including how other threads are notified)

Hadrien Grasland (Oct 14 2019 at 20:10, on Zulip):

Reading through...

// safety: this is the **only** thread that updates this cell.
let tail = self.tail.unsync_load();

I agree with the safety note stating that this should work, but I would just use load(Relaxed) unless I have a benchmark proving that there's a performance benefit in not doing so, for the sake of sticking with familiar abstractions.

Hadrien Grasland (Oct 14 2019 at 20:11, on Zulip):

If I had to do it, then I would do it by transmuting &AtomicU32 into &UnsafeCell<u32> or just plain *const u32 (I believe there are some repr(transparent) in the right places that makes those transmutes okay).

Carl Lerche (Oct 14 2019 at 20:13, on Zulip):

I've run benchmarks where the unsync load is measurably faster than a relaxed load

Hadrien Grasland (Oct 14 2019 at 20:13, on Zulip):

Most likely the aforementioned LLVM optimizer bug, sigh.

Carl Lerche (Oct 14 2019 at 20:13, on Zulip):

there was a previous thread w/ @RalfJ on a similar optimization in the bytes crate

Carl Lerche (Oct 14 2019 at 20:13, on Zulip):

:(

Carl Lerche (Oct 14 2019 at 20:14, on Zulip):

(the one in bytes was definitely UB though... )

Hadrien Grasland (Oct 14 2019 at 20:14, on Zulip):

Well, then again, casting &AtomicU32 to *const u32 followed by a ptr::read() should work if you are the only thread writing there.

Hadrien Grasland (Oct 14 2019 at 20:18, on Zulip):

The unsync_load() in pop() should work for the same reason.

Hadrien Grasland (Oct 14 2019 at 20:20, on Zulip):

Do you have a pointer to the steal() impl? It's not featured in the blog post...

Hadrien Grasland (Oct 14 2019 at 20:27, on Zulip):

(PS: I agree with what you say in the blog post. Sometimes, people who discover atomics undergo a bit of a "kid in a candy store" phase and forget that locking an uncontended mutex is just an atomic swap and unlocking an uncontended mutex is just an atomic store, both of which are really cheap. As the Preshing blog says, well-implemented locks are not slow, contention is.)

Hadrien Grasland (Oct 14 2019 at 20:28, on Zulip):

(Be it contention on the lock, or contention on the CPU cache lines that it protects, which are invalidated on every write)

Carl Lerche (Oct 14 2019 at 20:31, on Zulip):

@Hadrien Grasland https://github.com/tokio-rs/tokio/blob/new-scheduler/tokio-executor/src/thread_pool/queue/local.rs#L201-L240

Hadrien Grasland (Oct 14 2019 at 20:32, on Zulip):

dst.tail.unsync_load() looks wrong.

Carl Lerche (Oct 14 2019 at 20:32, on Zulip):

@Hadrien Grasland it's confusing, but self & dst are probably ordered wrong

Carl Lerche (Oct 14 2019 at 20:32, on Zulip):

from an API pov

Carl Lerche (Oct 14 2019 at 20:32, on Zulip):

dst is owned by the calling thread

Carl Lerche (Oct 14 2019 at 20:32, on Zulip):

where as self is the remote thread

Hadrien Grasland (Oct 14 2019 at 20:33, on Zulip):

Yup, that's definitely confusing.

Hadrien Grasland (Oct 14 2019 at 20:34, on Zulip):

Consider changing it so that self is the receiver and the input parameter is an src queue that is being stolen from.

Carl Lerche (Oct 14 2019 at 20:34, on Zulip):

@Hadrien Grasland my thought would be to make steal a free fn

Hadrien Grasland (Oct 14 2019 at 20:34, on Zulip):

Works as well.

Carl Lerche (Oct 14 2019 at 20:34, on Zulip):

or, it could be steal_from(self, src)

Hadrien Grasland (Oct 14 2019 at 20:35, on Zulip):

Also, consider replicating your safety note on this unsync_load() (you did make this function unsafe, right?)

Carl Lerche (Oct 14 2019 at 20:36, on Zulip):

unsync_load is unsafe

Hadrien Grasland (Oct 14 2019 at 20:38, on Zulip):

I am a bit surprised by this:

            let src_head = self.head.load(Acquire);
            let src_tail = self.tail.load(Acquire);

            // Number of available tasks to steal
            let n = src_tail.wrapping_sub(src_head);
            let n = n - n / 2;

            if n == 0 {
                return 0;
            }

            if n > self.buffer.len() as u32 / 2 {
                atomic::spin_loop_hint();
                // inconsistent, try again
                continue;
            }
Hadrien Grasland (Oct 14 2019 at 20:39, on Zulip):

When do you expect the if to fail?

Hadrien Grasland (Oct 14 2019 at 20:46, on Zulip):

Oh, well, I'm being reminded that I should probably go to sleep because it's getting late here and I have a job interview to take care of early tomorrow.

Hadrien Grasland (Oct 14 2019 at 20:46, on Zulip):

But I would definitely enjoy spending more time reviewing this later on, hopefully will find some time tomorrow.

Carl Lerche (Oct 14 2019 at 20:57, on Zulip):

good luck!

RalfJ (Oct 14 2019 at 21:00, on Zulip):

so the high-level bit is that if all the racing accesses are reads, then all is fine and there's no UB. (I mean, that's a pattern that can even be written in safe code.)

RalfJ (Oct 14 2019 at 21:01, on Zulip):

but you'll need some mechanism to ensure a happens-before edge between any critical section (that mutates the data) and any read

RalfJ (Oct 14 2019 at 21:01, on Zulip):

@Carl Lerche it'd help if you had a "minimized example" (even in pseudo-code) of the kind of pattern you are worried about

Carl Lerche (Oct 14 2019 at 21:24, on Zulip):

@RalfJ from a concurrency POV, i'm fairly confident it is correct. The original HN comment was saying that the compiler could insert random data in the field though... which did not sound correct to me.

RalfJ (Oct 14 2019 at 21:25, on Zulip):

uh... that sounds weird

RalfJ (Oct 14 2019 at 21:25, on Zulip):

I usually take Reddit comments serious but ignore HN comments. ;)

Hadrien Grasland (Oct 14 2019 at 23:21, on Zulip):

Meh. Looks like my stupid stressed out brain doesn't want to sleep tonight :(

Hadrien Grasland (Oct 14 2019 at 23:22, on Zulip):

Then I guess I'll just either a/take a look at this code further or b/try to draft some kind of pre-RFC document for atomic volatile. Hmmm...

Hadrien Grasland (Oct 15 2019 at 13:53, on Zulip):

Got time to take another look. Overall, what you have here looks pretty reasonable, but would probably be better with a bit of documentation here and there so that things need to be inferred less often.

Hadrien Grasland (Oct 15 2019 at 13:55, on Zulip):

For example, Local::mask should be documented as a power-of-2 modulo optimization for head/tail % buffer.len() computations.

Hadrien Grasland (Oct 15 2019 at 13:59, on Zulip):

It is not immediately obvious that the ptr::read(ptr) in pop() isn't racy. It actually is because both push() and pop() may only be called in the local queue's owner thread. The documentation of that ptr::read could just point that out.

Hadrien Grasland (Oct 15 2019 at 14:10, on Zulip):

The concurrent correctness of steal2() is also far from obvious. Intuitively, it seems that this can happen:

- Thread A is the owner of the queue, Thread B is the thief.
- Thread B starts stealing, enters steal2(), reads the head/tail info of thread A's local work queue, then for some strange reason stupid OS task scheduler decides to put it to sleep and schedule something else.
- Thread A furiously pushes and pops tasks on its local queue until the tail (pushing end) gets close to the point where thread B measured that thread A's head (popping end was).
- Thread B wakes up and starts reading what it assumes to be unprocessed thread A tasks, which thread A is simultaneously in the process of overwriting the corresponding memory region. This is a data race.

Hadrien Grasland (Oct 15 2019 at 14:15, on Zulip):

Now, Thread B has a good chance of figuring out that this happened and restarting the transaction later on, when it will try to claim the tasks with a CAS of thread A's head pointer. I say good chance, because there is a tiny odd of an ABA issue masking the problem if thread A managed to make its head counter overflow and do a complete round trip. With AtomicU32, the odds of this happening are probably too small to bother considering them, but his shows that the size of the head and the tail pointers are correctness-critical in this algorithm, which is not documented anywhere for now.

Hadrien Grasland (Oct 15 2019 at 14:17, on Zulip):

And all this does not eliminate the fact that even if thread B eventually figures out that something went wrong and restarts the transaction, a data race has still occured, so the program has undefined behavior.

Hadrien Grasland (Oct 15 2019 at 14:17, on Zulip):

Basically, you are trying to implement semantics akin to those of a seqlock, and I'm not sure we actually have the right tools for building those without UB in Rust at this point in time.

Hadrien Grasland (Oct 15 2019 at 14:21, on Zulip):

Given that this is a super edge case, I would personally be fine with a comment saying "this is UB, but we don't have the right language tools for building it correctly and efficiently right now". But right now, this comment is missing.

Hadrien Grasland (Oct 15 2019 at 14:31, on Zulip):

Again, overall, good job! I've tried to write a work-stealing scheduler myself in the past, and I'm well aware of how much painful it can get when things like blocking come into play. All I am saying is that perhaps you should document your achievement a little bit better so that future reviewers of the code get a more solid grasp of what's going on and why.

comex (Oct 16 2019 at 01:54, on Zulip):

maybe not AtomicU8, but an AtomicU64 loop would be decently fast

comex (Oct 16 2019 at 01:54, on Zulip):

I don't understand why atomic memcpy is needed here though

Hadrien Grasland (Oct 16 2019 at 05:23, on Zulip):

Because ptr::read racing with ptr::write is generally not allowed, even if you discard the result after the fact.

Hadrien Grasland (Oct 16 2019 at 05:25, on Zulip):

(IIUC, it's actually OK at the LLVM layer since they use a "data race returns undef" localized UB formalism rather than a notion of global UB destroying the whole program, so another way to handle this would be to integrate similar data race semantics to Rust... but I'm not sure if we are prepared to do that)

Hadrien Grasland (Oct 16 2019 at 06:03, on Zulip):

In a way, this is somewhat related to my thread about padding bytes and memcpy: our current rules have a nasty tendency to turn some specific memory read and write operations into global program UB, and I'm not sure if this is actually fine.

Hadrien Grasland (Oct 16 2019 at 06:06, on Zulip):

(in the padding bytes case, it's more egregious since making it UB to observe padding bytes makes it UB to move a Rust object in an efficient/obvious way... but I think this case is no less interesting)

Hadrien Grasland (Oct 16 2019 at 06:16, on Zulip):

At the same time, it's definitely true that specific load/store operations are global program UB. Dereferencing a null pointer is, for example. So differencing between local and global UB would likely end up making the rules more complex ;)

RalfJ (Oct 16 2019 at 17:41, on Zulip):

Because ptr::read racing with ptr::write is generally not allowed, even if you discard the result after the fact. Whereas with a form of atomic memcpy the result is "well-defined": you get randomly torn garbage data.

I dont think we have anything in the Rust semantics that gives you randomly turn garbage data. that would be freeze and we dont have it.

Hadrien Grasland (Oct 16 2019 at 18:12, on Zulip):

Let me reformulate. It's not randomly torn. It's the perfectly reasonable result of a sequence of atomic reads targeting different memory locations, which happen tor return the values that were atomically written by different sequences of writes.

Hadrien Grasland (Oct 16 2019 at 18:13, on Zulip):

And from a specification point of view, that's totally different.

Hadrien Grasland (Oct 16 2019 at 18:13, on Zulip):

From a developer's point of view, however, it's randomly torn garbage data.

RalfJ (Oct 16 2019 at 18:14, on Zulip):

I see. fair :)

Last update: Nov 19 2019 at 18:35UTC