Hi, I'm not sure if this is the best place to ask (or even how to use zulip). I've used UnsafeCell<AtomicU32> in order to perform unsynchronized loads when I can guarantee that other threads only load from that atomic (never store). Is this safe? https://github.com/tokio-rs/tokio/blob/3a941e99a112f0058d2ac1f295da29491ba4b232/tokio-executor/src/loom/atomic_u32.rs#L25-L27
I am pretty sure it is, but got a comment on HN and i was not sure how to respond: https://news.ycombinator.com/item?id=21251778
It is UB, and I got some actual miscompilation on one of my programs while trying something similar so it's not a theoretical concern.
Further, UnsafeCell<AtomicXyz>
is redundant because AtomicXyz
already contains an UnsafeCell<Xyz>
.
@Hadrien Grasland what part is UB?
Loading from an UnsafeCell some data which may have ever been touched by another thread.
@Hadrien Grasland by touched, you mean mutated? In this case, there are no mutations from other threads
The problem is that even with UnsafeCell, the compiler can assume that if it knows about all memory accesses in the current thread, it knows about all memory accesses, period.
So it can optimize out "redundant" loads and so on.
@Hadrien Grasland in this case, it does know about all stores in the thread
And there are no stores from other threads and have never been?
@Hadrien Grasland all stores are guarded by locks and the unsync_load happens within the lock
Oh, if you have a lock, then that is your synchronization.
In that case, you don't need Atomic, you could use UnsafeCell<u32> directly.
(unless you sometimes want to access the atomic without going through the lock)
By the way, what's wrong with just doing atomic loads and stores with Relaxed
ordering?
100% of mutations are in the lock, only loads are outside of the lock
100% of mutations are in the lock, only loads are outside of the lock
But how do you make sure that the mutations are propagated to the threads doing the load?
At the hardware level, load(Relaxed)
should compile down to a normal load instructions on all current CPUs. But at the compiler level, it serves as a warning that the value may have changed as a result of another thread modifying it, and the compiler should go check it from time to time.
(Currently, LLVM pessimistically interprets this as a command to go check the value every single time, but most other experts I know agree with me that this is an LLVM optimizer bug)
There is heavier synchronization else where that guarantees it
When you acquire a lock (or use an atomic with Acquire
ordering, which is what the lock implementation does under the hood), the compiler automatically treats every shared variable which you're going to manipulate inside of the lock like this. But without a lock, you need to hint the compiler at the necessity of refreshing its local cache of the shared variable.
Oh, if you have heavier synchronization then you should probably be fine.
@Hadrien Grasland the details are in this post: https://tokio.rs/blog/2019-10-scheduler/ (including how other threads are notified)
Reading through...
// safety: this is the **only** thread that updates this cell. let tail = self.tail.unsync_load();
I agree with the safety note stating that this should work, but I would just use load(Relaxed)
unless I have a benchmark proving that there's a performance benefit in not doing so, for the sake of sticking with familiar abstractions.
If I had to do it, then I would do it by transmuting &AtomicU32
into &UnsafeCell<u32>
or just plain *const u32
(I believe there are some repr(transparent)
in the right places that makes those transmutes okay).
I've run benchmarks where the unsync load is measurably faster than a relaxed load
Most likely the aforementioned LLVM optimizer bug, sigh.
there was a previous thread w/ @RalfJ on a similar optimization in the bytes
crate
:(
(the one in bytes was definitely UB though... )
Well, then again, casting &AtomicU32
to *const u32
followed by a ptr::read() should work if you are the only thread writing there.
The unsync_load() in pop() should work for the same reason.
Do you have a pointer to the steal() impl? It's not featured in the blog post...
(PS: I agree with what you say in the blog post. Sometimes, people who discover atomics undergo a bit of a "kid in a candy store" phase and forget that locking an uncontended mutex is just an atomic swap and unlocking an uncontended mutex is just an atomic store, both of which are really cheap. As the Preshing blog says, well-implemented locks are not slow, contention is.)
(Be it contention on the lock, or contention on the CPU cache lines that it protects, which are invalidated on every write)
@Hadrien Grasland https://github.com/tokio-rs/tokio/blob/new-scheduler/tokio-executor/src/thread_pool/queue/local.rs#L201-L240
dst.tail.unsync_load()
looks wrong.
@Hadrien Grasland it's confusing, but self & dst are probably ordered wrong
from an API pov
dst
is owned by the calling thread
where as self
is the remote thread
Yup, that's definitely confusing.
Consider changing it so that self
is the receiver and the input parameter is an src
queue that is being stolen from.
@Hadrien Grasland my thought would be to make steal
a free fn
Works as well.
or, it could be steal_from(self, src)
Also, consider replicating your safety note on this unsync_load() (you did make this function unsafe, right?)
unsync_load is unsafe
I am a bit surprised by this:
let src_head = self.head.load(Acquire); let src_tail = self.tail.load(Acquire); // Number of available tasks to steal let n = src_tail.wrapping_sub(src_head); let n = n - n / 2; if n == 0 { return 0; } if n > self.buffer.len() as u32 / 2 { atomic::spin_loop_hint(); // inconsistent, try again continue; }
When do you expect the if to fail?
Oh, well, I'm being reminded that I should probably go to sleep because it's getting late here and I have a job interview to take care of early tomorrow.
But I would definitely enjoy spending more time reviewing this later on, hopefully will find some time tomorrow.
good luck!
so the high-level bit is that if all the racing accesses are reads, then all is fine and there's no UB. (I mean, that's a pattern that can even be written in safe code.)
but you'll need some mechanism to ensure a happens-before edge between any critical section (that mutates the data) and any read
@Carl Lerche it'd help if you had a "minimized example" (even in pseudo-code) of the kind of pattern you are worried about
@RalfJ from a concurrency POV, i'm fairly confident it is correct. The original HN comment was saying that the compiler could insert random data in the field though... which did not sound correct to me.
uh... that sounds weird
I usually take Reddit comments serious but ignore HN comments. ;)
Meh. Looks like my stupid stressed out brain doesn't want to sleep tonight :(
Then I guess I'll just either a/take a look at this code further or b/try to draft some kind of pre-RFC document for atomic volatile. Hmmm...
Got time to take another look. Overall, what you have here looks pretty reasonable, but would probably be better with a bit of documentation here and there so that things need to be inferred less often.
For example, Local::mask should be documented as a power-of-2 modulo optimization for head/tail % buffer.len() computations.
It is not immediately obvious that the ptr::read(ptr) in pop() isn't racy. It actually is because both push() and pop() may only be called in the local queue's owner thread. The documentation of that ptr::read could just point that out.
The concurrent correctness of steal2() is also far from obvious. Intuitively, it seems that this can happen:
- Thread A is the owner of the queue, Thread B is the thief.
- Thread B starts stealing, enters steal2(), reads the head/tail info of thread A's local work queue, then for some strange reason stupid OS task scheduler decides to put it to sleep and schedule something else.
- Thread A furiously pushes and pops tasks on its local queue until the tail (pushing end) gets close to the point where thread B measured that thread A's head (popping end was).
- Thread B wakes up and starts reading what it assumes to be unprocessed thread A tasks, which thread A is simultaneously in the process of overwriting the corresponding memory region. This is a data race.
Now, Thread B has a good chance of figuring out that this happened and restarting the transaction later on, when it will try to claim the tasks with a CAS of thread A's head pointer. I say good chance, because there is a tiny odd of an ABA issue masking the problem if thread A managed to make its head counter overflow and do a complete round trip. With AtomicU32, the odds of this happening are probably too small to bother considering them, but his shows that the size of the head and the tail pointers are correctness-critical in this algorithm, which is not documented anywhere for now.
And all this does not eliminate the fact that even if thread B eventually figures out that something went wrong and restarts the transaction, a data race has still occured, so the program has undefined behavior.
Basically, you are trying to implement semantics akin to those of a seqlock, and I'm not sure we actually have the right tools for building those without UB in Rust at this point in time.
Given that this is a super edge case, I would personally be fine with a comment saying "this is UB, but we don't have the right language tools for building it correctly and efficiently right now". But right now, this comment is missing.
Again, overall, good job! I've tried to write a work-stealing scheduler myself in the past, and I'm well aware of how much painful it can get when things like blocking come into play. All I am saying is that perhaps you should document your achievement a little bit better so that future reviewers of the code get a more solid grasp of what's going on and why.
maybe not AtomicU8, but an AtomicU64 loop would be decently fast
I don't understand why atomic memcpy is needed here though
Because ptr::read racing with ptr::write is generally not allowed, even if you discard the result after the fact.
(IIUC, it's actually OK at the LLVM layer since they use a "data race returns undef" localized UB formalism rather than a notion of global UB destroying the whole program, so another way to handle this would be to integrate similar data race semantics to Rust... but I'm not sure if we are prepared to do that)
In a way, this is somewhat related to my thread about padding bytes and memcpy: our current rules have a nasty tendency to turn some specific memory read and write operations into global program UB, and I'm not sure if this is actually fine.
(in the padding bytes case, it's more egregious since making it UB to observe padding bytes makes it UB to move a Rust object in an efficient/obvious way... but I think this case is no less interesting)
At the same time, it's definitely true that specific load/store operations are global program UB. Dereferencing a null pointer is, for example. So differencing between local and global UB would likely end up making the rules more complex ;)
Because ptr::read racing with ptr::write is generally not allowed, even if you discard the result after the fact. Whereas with a form of atomic memcpy the result is "well-defined": you get randomly torn garbage data.
I dont think we have anything in the Rust semantics that gives you randomly turn garbage data. that would be freeze
and we dont have it.
Let me reformulate. It's not randomly torn. It's the perfectly reasonable result of a sequence of atomic reads targeting different memory locations, which happen tor return the values that were atomically written by different sequences of writes.
And from a specification point of view, that's totally different.
From a developer's point of view, however, it's randomly torn garbage data.
I see. fair :)