[PATCH] RFC: console: hack up console_lock more v3

Thu May 9 16:43:12 UTC 2019

On Thu, May 9, 2019 at 4:56 PM Petr Mladek <pmladek at suse.com> wrote:
>
> On Thu 2019-05-09 14:09:03, Daniel Vetter wrote:
> > console_trylock, called from within printk, can be called from pretty
> > much anywhere. Including try_to_wake_up. Note that this isn't common,
> > usually the box is in pretty bad shape at that point already. But it
> > really doesn't help when then lockdep jumps in and spams the logs,
> > potentially obscuring the real backtrace we're really interested in.
> > One case I've seen (slightly simplified backtrace):
> >
> >  Call Trace:
> >   <IRQ>
> >   console_trylock+0xe/0x60
> >   vprintk_emit+0xf1/0x320
> >   printk+0x4d/0x69
> >   __warn_printk+0x46/0x90
> >   native_smp_send_reschedule+0x2f/0x40
> >   check_preempt_curr+0x81/0xa0
> >   ttwu_do_wakeup+0x14/0x220
> >   try_to_wake_up+0x218/0x5f0
> >   pollwake+0x6f/0x90
> >   credit_entropy_bits+0x204/0x310
> >   add_interrupt_randomness+0x18f/0x210
> >   handle_irq+0x67/0x160
> >   do_IRQ+0x5e/0x130
> >   common_interrupt+0xf/0xf
> >   </IRQ>
> >
> > This alone isn't a problem, but the spinlock in the semaphore is also
> > still held while waking up waiters (up() -> __up() -> try_to_wake_up()
> > callchain), which then closes the runqueue vs. semaphore.lock loop,
> > and upsets lockdep, which issues a circular locking splat to dmesg.
> > Worse it upsets developers, since we don't want to spam dmesg with
> > clutter when the machine is dying already.
> >
> > Fix this by creating a prinkt_safe_up() which calls wake_up_process
> > outside of the spinlock. This isn't correct in full generality, but
> > good enough for console_lock:
> >
> > - console_lock doesn't use interruptible or killable or timeout down()
> >   calls, hence an up() is the only thing that can wake up a process.
> >   Hence the process can't get woken and killed and reaped while we try
> >   to wake it up too.
> >
> > - semaphore.c always updates the waiter list while under the spinlock,
> >   so there's no other races. Specifically another process that races
> >   with a quick console_lock/unlock while we've dropped the spinlock
> >   already won't see our own waiter.
> >
> > Note that we only have to break the recursion for the semaphore.lock
> > spinlock of the console_lock. Recursion within various scheduler
> > related locks is already prevented by the printk_safe_enter/exit pair
> > in __up_console_sem().
>
> This is not fully true. printk_safe() helps only when
> the first try_to_wake_up() is called from printk_safe() context.
>
> > --- a/kernel/locking/semaphore.c
> > +++ b/kernel/locking/semaphore.c
> > @@ -197,6 +197,37 @@ struct semaphore_waiter {
> >       bool up;
> >  };
> >
> > +/**
> > + * printk_safe_up - release the semaphore in console_unlock
> > + * @sem: the semaphore to release
> > + *
> > + * Release the semaphore.  Unlike mutexes, up() may be called from any
> > + * context and even by tasks which have never called down().
> > + *
> > + * NOTE: This is a special version of up() for console_unlock only. It is only
> > + * safe if there are no killable, interruptible or timing out down() calls.
> > + */
> > +void printk_safe_up(struct semaphore *sem)
> > +{
> > +     unsigned long flags;
> > +     struct semaphore_waiter *waiter = NULL;
> > +
> > +     raw_spin_lock_irqsave(&sem->lock, flags);
> > +     if (likely(list_empty(&sem->wait_list))) {
> > +             sem->count++;
> > +     } else {
> > +             waiter = list_first_entry(&sem->wait_list,
> > +                                       struct semaphore_waiter, list);
> > +             list_del(&waiter->list);
> > +             waiter->up = true;
> > +     }
> > +     raw_spin_unlock_irqrestore(&sem->lock, flags);
> > +
> > +     if (waiter) /* protected by being sole wake source */
> > +             wake_up_process(waiter->task);
>
> I still do not see how this could help. Let's take the above
> backtrace as example:
>
>    <IRQ>
>    console_trylock+0xe/0x60
>    vprintk_emit+0xf1/0x320
>    printk+0x4d/0x69
>    __warn_printk+0x46/0x90
>    native_smp_send_reschedule +0x2f/0x40
>    check_preempt_curr+0x81/0xa0
>    ttwu_do_wakeup+0x14/0x220
>    try_to_wake_up+0x218/0x5f0
>    pollwake+0x6f/0x90
>    credit_entropy_bits+0x204/0x310
>    add_interrupt_randomness+0x18f/0x210
>    handle_irq+0x67/0x160
>    do_IRQ+0x5e/0x130
>    common_interrupt+0xf/0xf
>    </IRQ>
>
> We have the following chain of calls:
>
>   + do_IRQ()
>     ...
>       + try_to_wake_up()    # takes p->pi_lock
>         + ttwu_remote()     # takes rq lock
>           + ttwu_do_wakeup()
>             + check_preempt_curr()
>               + native_smp_send_reschedule()
>                 + __warn_printk()
>                   + printk()
>                     + vprintk_emit()
>                       + console_trylock() # success
>                       + console_unlock()
>                         + up_console_sem()
>                           + up() # wait list in not empty
>                             + __up()
>                               + wake_up_process()
>                                 + try_to_wake_up()
>
> !BANG! Deadlock on p->pi_lock.

Hm right ... I only looked at this starting with console_unlock.

> It does not matter if the nested try_to_wake_up() was called
> under sem->lock or outside.
>
> By other words. The patch removed one lockdep warning. But it just
> just delayed the deadlock. It will not happen on sem->lock but
> later on p->pi_lock.
>
> I am repeating myself. But IMHO, the only solution is to introduce
> printk deferred context and use it in WARN_DEFERRED().

One thing to keep in mind is that the kernel is already dying, and
things will come crashing down later on (I've seen this only in dmesg
tails capture in pstore in our CI, i.e. the box died for good). I just
want to make sure that the useful information isn't overwritten by
more dmesg splats that happen as a consequence of us somehow trying to
run things on an offline cpu. Once console_unlock has completed in
your above backtrace and the important stuff has gone out I'm totally
fine with the kernel just dying. Pulling the wake_up_process out from
under the semaphore.lock is enough to prevent lockdep from dumping
more stuff while we're trying to print the important things, and I
think the untangling of the locking hiararchy is useful irrespective
of this lockdep splat. Plus Peter doesn't sound like he likes to roll
out more printk_deferred kind of things.

But if you think I should do the printk_deferred thing too I can look
into that. Just not quite sure what that's supposed to look like now.
-Daniel

--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch