I haven't managed to reproduce it yet, but I just had rustc hang completely while doing a parallel build. I was testing with increasing numbers of threads, so I could plot threads vs user time and see the growth rate. And while doing an 88-way build, rustc just completely hung. I was doing builds of syn, and it hung with two crate builds in progress. One is hung on
read of the pipe. The other is hung on a futex.
Looking further, some threads are hung in
jobserver::imp::Client::acquire (by way of rayon,
rustc_rayon_core::registry::Registry::acquire_thread called from
rustc_rayon_core::sleep::Sleep::no_work_found), and others are hung in
Some backtraces would be useful, but 88 threads would probably make things a bit confusing.
Happy to provide them; my message above was based on looking through backtraces.
This might help:
$ grep '#1\>' bt | sed 's/buf=0x[^,]*/buf=(omitted)/' | sort | uniq -c 86 #1 __GI___libc_read (fd=4, buf=(omitted), nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:24 3 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0x563a85f2ced0, cond=0x563a85f2cf00) at pthread_cond_wait.c:502
(the sed was to handle all the different buffer addresses)
In this 88-way build, that rustc process has 89 rustc threads, of which 86 are blocked in read on the pipe and the other 3 are blocked on
86 #6 0x00007efe03bb8019 in jobserver::Client::acquire_raw (self=<optimized out>) at /home/josh/.cargo/registry/src/github.com-1ecc6299db9ec823/jobserver-0.1.16/src/lib.rs:363 3 #6 rustc_rayon_core::sleep::Sleep::sleep (self=<optimized out>, worker_index=<optimized out>, registry=0x563a85f2cf80) at /home/josh/.cargo/registry/src/github.com-1ecc6299db9ec823/rustc-rayon-core-0.3.0/src/sleep/mod.rs:335
@Zoxc The hung build is still around; what other information can I gather?
Are the cond vars all in Rayon?
Maybe just do that command for the 1-7 slots for both processes?
In that process, yes, all the cond vars are in rayon.
I just grabbed a backtrace for the other process; checking it now.
...the other process is a lot more interesting.
2 #1 0x00007fdc02953089 in parking_lot_core::thread_parker::imp::ThreadParker::futex_wait (self=<optimized out>, ts=...) at /home/josh/.cargo/registry/src/github.com-1ecc6299db9ec823/parking_lot_core-0.6.2/src/thread_parker/linux.rs:111 72 #1 0x00007fdc029536b9 in parking_lot_core::thread_parker::imp::ThreadParker::futex_wait (self=<optimized out>, ts=...) at /home/josh/.cargo/registry/src/github.com-1ecc6299db9ec823/parking_lot_core-0.6.2/src/thread_parker/linux.rs:111 1 #1 0x00007fdc0295a5a5 in jobserver::imp::Client::acquire (self=0x7fdbf4000b90) at /home/josh/.cargo/registry/src/github.com-1ecc6299db9ec823/jobserver-0.1.16/src/lib.rs:548 13 #1 __GI___libc_read (fd=4, buf=(omitted), nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:24 1 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0x560d777f9e80, cond=0x560d777f9eb0) at pthread_cond_wait.c:502
Why don't I send you the backtraces of both processes?
Can you filter backtraces to only show symbols so we could get only the unique backtraces?
(both obtained by attaching with gdb and running
thread apply all bt full)
What do you mean by "only show symbols"? Not addresses?
Or not parameters?
Not parameters, those could vary
I don't know how to hide those in gdb, but I can sed them away easily enough.
OK, I managed to get the unique backtraces.
In the less interesting process, there are only two unique backtraces, both as you'd expect.
The more interesting process has no two backtraces identical.
It kind of looks like the first processes is waiting for a jobserver token. Maybe the other process grabbed all 88 tokens? Or some tokens are leaking somehow
It would be nice if you could reliable recreate it, so we could see if it would happen with no jobserver used
I tried many times and didn't recreate it. That's why I'm keeping it around.
It does look like the other process is in the midst of all the work.
I managed to common them up a bit more.
Well thread 89 looks interesting already since it's in monomorphization and not Rayon
So it's probably some deadlock there
(Well, it is in Rayon, but it's doing actual parallel work, and then taking a lock.)
Everything is in Rayon thread, but the lock it's stuck on is not a Rayon lock
That's the unique backtraces in the second process, for frames 0-13.
That shows which ones are in monomorphization without getting into the non-unique frames.
I need to go to bed. I'll leave the processes running until tomorrow morning. If you think of any other information you want from them, let me know.
Oh, it's only 72? I was expecting more
I think this should suffice to track it down, unless there's multiple bugs
This was run with
Note that since I only have 72 cores, perhaps only 72 were getting to run?
Thanks for digging into this!
I guess cargo would limit to 72, unless you did -j88 too?
72 would make sense then
The problem is that
is_inlining_candidate inside the iterator, which is enumerated inside the lock. That should instead be called when appending the item to
That makes sense.