Stream: t-compiler/wg-parallel-rustc

Topic: Hang


Josh Triplett (Dec 19 2019 at 07:44, on Zulip):

I haven't managed to reproduce it yet, but I just had rustc hang completely while doing a parallel build. I was testing with increasing numbers of threads, so I could plot threads vs user time and see the growth rate. And while doing an 88-way build, rustc just completely hung. I was doing builds of syn, and it hung with two crate builds in progress. One is hung on read of the pipe. The other is hung on a futex.

Josh Triplett (Dec 19 2019 at 07:49, on Zulip):

Looking further, some threads are hung in jobserver::imp::Client::acquire (by way of rayon, rustc_rayon_core::registry::Registry::acquire_thread called from rustc_rayon_core::sleep::Sleep::no_work_found), and others are hung in futex_wait_cancelable via std::sync::condvar::Condvar::wait via rustc_rayon_core::sleep::Sleep::no_work_found.

Zoxc (Dec 19 2019 at 07:50, on Zulip):

Some backtraces would be useful, but 88 threads would probably make things a bit confusing.

Josh Triplett (Dec 19 2019 at 07:51, on Zulip):

Happy to provide them; my message above was based on looking through backtraces.

Josh Triplett (Dec 19 2019 at 07:58, on Zulip):

This might help:

$ grep '#1\>' bt | sed 's/buf=0x[^,]*/buf=(omitted)/' | sort | uniq -c
     86 #1  __GI___libc_read (fd=4, buf=(omitted), nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:24
      3 #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x563a85f2ced0, cond=0x563a85f2cf00) at pthread_cond_wait.c:502
Josh Triplett (Dec 19 2019 at 07:59, on Zulip):

(the sed was to handle all the different buffer addresses)

Josh Triplett (Dec 19 2019 at 07:59, on Zulip):

In this 88-way build, that rustc process has 89 rustc threads, of which 86 are blocked in read on the pipe and the other 3 are blocked on pthread_cond_wait.

Josh Triplett (Dec 19 2019 at 08:00, on Zulip):
     86 #6  0x00007efe03bb8019 in jobserver::Client::acquire_raw (self=<optimized out>) at /home/josh/.cargo/registry/src/github.com-1ecc6299db9ec823/jobserver-0.1.16/src/lib.rs:363
      3 #6  rustc_rayon_core::sleep::Sleep::sleep (self=<optimized out>, worker_index=<optimized out>, registry=0x563a85f2cf80) at /home/josh/.cargo/registry/src/github.com-1ecc6299db9ec823/rustc-rayon-core-0.3.0/src/sleep/mod.rs:335
Josh Triplett (Dec 19 2019 at 08:01, on Zulip):

@Zoxc The hung build is still around; what other information can I gather?

Zoxc (Dec 19 2019 at 08:01, on Zulip):

Are the cond vars all in Rayon?

Zoxc (Dec 19 2019 at 08:03, on Zulip):

Maybe just do that command for the 1-7 slots for both processes?

Josh Triplett (Dec 19 2019 at 08:04, on Zulip):

In that process, yes, all the cond vars are in rayon.

Josh Triplett (Dec 19 2019 at 08:04, on Zulip):

I just grabbed a backtrace for the other process; checking it now.

Josh Triplett (Dec 19 2019 at 08:04, on Zulip):

...the other process is a lot more interesting.

Josh Triplett (Dec 19 2019 at 08:06, on Zulip):
      2 #1  0x00007fdc02953089 in parking_lot_core::thread_parker::imp::ThreadParker::futex_wait (self=<optimized out>, ts=...) at /home/josh/.cargo/registry/src/github.com-1ecc6299db9ec823/parking_lot_core-0.6.2/src/thread_parker/linux.rs:111
     72 #1  0x00007fdc029536b9 in parking_lot_core::thread_parker::imp::ThreadParker::futex_wait (self=<optimized out>, ts=...) at /home/josh/.cargo/registry/src/github.com-1ecc6299db9ec823/parking_lot_core-0.6.2/src/thread_parker/linux.rs:111
      1 #1  0x00007fdc0295a5a5 in jobserver::imp::Client::acquire (self=0x7fdbf4000b90) at /home/josh/.cargo/registry/src/github.com-1ecc6299db9ec823/jobserver-0.1.16/src/lib.rs:548
     13 #1  __GI___libc_read (fd=4, buf=(omitted), nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:24
      1 #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x560d777f9e80, cond=0x560d777f9eb0) at pthread_cond_wait.c:502
Josh Triplett (Dec 19 2019 at 08:09, on Zulip):

Why don't I send you the backtraces of both processes?

Josh Triplett (Dec 19 2019 at 08:10, on Zulip):

bt1.gz

Josh Triplett (Dec 19 2019 at 08:10, on Zulip):

bt2.gz

Zoxc (Dec 19 2019 at 08:11, on Zulip):

Can you filter backtraces to only show symbols so we could get only the unique backtraces?

Josh Triplett (Dec 19 2019 at 08:11, on Zulip):

(both obtained by attaching with gdb and running thread apply all bt full)

Josh Triplett (Dec 19 2019 at 08:11, on Zulip):

What do you mean by "only show symbols"? Not addresses?

Josh Triplett (Dec 19 2019 at 08:11, on Zulip):

Or not parameters?

Zoxc (Dec 19 2019 at 08:12, on Zulip):

Not parameters, those could vary

Josh Triplett (Dec 19 2019 at 08:16, on Zulip):

I don't know how to hide those in gdb, but I can sed them away easily enough.

Josh Triplett (Dec 19 2019 at 08:22, on Zulip):

One moment...

Josh Triplett (Dec 19 2019 at 08:26, on Zulip):

OK, I managed to get the unique backtraces.

Josh Triplett (Dec 19 2019 at 08:29, on Zulip):

In the less interesting process, there are only two unique backtraces, both as you'd expect.

Josh Triplett (Dec 19 2019 at 08:30, on Zulip):

bt1-uniq

Josh Triplett (Dec 19 2019 at 08:32, on Zulip):

The more interesting process has no two backtraces identical.

Zoxc (Dec 19 2019 at 08:37, on Zulip):

It kind of looks like the first processes is waiting for a jobserver token. Maybe the other process grabbed all 88 tokens? Or some tokens are leaking somehow

Zoxc (Dec 19 2019 at 08:38, on Zulip):

It would be nice if you could reliable recreate it, so we could see if it would happen with no jobserver used

Josh Triplett (Dec 19 2019 at 08:39, on Zulip):

I tried many times and didn't recreate it. That's why I'm keeping it around.

Josh Triplett (Dec 19 2019 at 08:39, on Zulip):

It does look like the other process is in the midst of all the work.

Josh Triplett (Dec 19 2019 at 08:41, on Zulip):

I managed to common them up a bit more.

Zoxc (Dec 19 2019 at 08:47, on Zulip):

Well thread 89 looks interesting already since it's in monomorphization and not Rayon

Zoxc (Dec 19 2019 at 08:49, on Zulip):

So it's probably some deadlock there

Josh Triplett (Dec 19 2019 at 08:51, on Zulip):

(Well, it is in Rayon, but it's doing actual parallel work, and then taking a lock.)

Zoxc (Dec 19 2019 at 08:53, on Zulip):

Everything is in Rayon thread, but the lock it's stuck on is not a Rayon lock

Josh Triplett (Dec 19 2019 at 08:53, on Zulip):

bt2-uniq

Josh Triplett (Dec 19 2019 at 08:53, on Zulip):

That's the unique backtraces in the second process, for frames 0-13.

Josh Triplett (Dec 19 2019 at 08:54, on Zulip):

That shows which ones are in monomorphization without getting into the non-unique frames.

Josh Triplett (Dec 19 2019 at 08:56, on Zulip):

I need to go to bed. I'll leave the processes running until tomorrow morning. If you think of any other information you want from them, let me know.

Zoxc (Dec 19 2019 at 08:56, on Zulip):

Oh, it's only 72? I was expecting more

Zoxc (Dec 19 2019 at 08:57, on Zulip):

I think this should suffice to track it down, unless there's multiple bugs

Josh Triplett (Dec 19 2019 at 08:59, on Zulip):

This was run with -Zthreads=88.

Josh Triplett (Dec 19 2019 at 09:00, on Zulip):

Note that since I only have 72 cores, perhaps only 72 were getting to run?

Josh Triplett (Dec 19 2019 at 09:00, on Zulip):

Thanks for digging into this!

Zoxc (Dec 19 2019 at 09:07, on Zulip):

I guess cargo would limit to 72, unless you did -j88 too?

Zoxc (Dec 19 2019 at 09:07, on Zulip):

72 would make sense then

Zoxc (Dec 19 2019 at 09:50, on Zulip):

The problem is that record_accesses calls is_inlining_candidate inside the iterator, which is enumerated inside the lock. That should instead be called when appending the item to neighbors in collect_items_rec.

Josh Triplett (Dec 19 2019 at 15:34, on Zulip):

That makes sense.

Josh Triplett (Dec 19 2019 at 15:34, on Zulip):

Nice catch!

Last update: Jul 02 2020 at 18:55UTC