[PATCH] RFC: console: hack up console_trylock more

Mon May 6 08:40:32 UTC 2019

On Mon, May 6, 2019 at 9:48 AM Petr Mladek <pmladek at suse.com> wrote:
> On Mon 2019-05-06 09:11:37, Daniel Vetter wrote:
> > On Fri, May 3, 2019 at 5:14 PM Petr Mladek <pmladek at suse.com> wrote:
> > > On Thu 2019-05-02 16:16:43, Daniel Vetter wrote:
> > > > console_trylock, called from within printk, can be called from pretty
> > > > much anywhere. Including try_to_wake_up. Note that this isn't common,
> > > > usually the box is in pretty bad shape at that point already. But it
> > > > really doesn't help when then lockdep jumps in and spams the logs,
> > > > potentially obscuring the real backtrace we're really interested in.
> > > > One case I've seen (slightly simplified backtrace):
> > > >
> > > >  Call Trace:
> > > >   <IRQ>
> > > >   console_trylock+0xe/0x60
> > > >   vprintk_emit+0xf1/0x320
> > > >   printk+0x4d/0x69
> > > >   __warn_printk+0x46/0x90
> > > >   native_smp_send_reschedule+0x2f/0x40
> > > >   check_preempt_curr+0x81/0xa0
> > > >   ttwu_do_wakeup+0x14/0x220
> > > >   try_to_wake_up+0x218/0x5f0
> > > >   pollwake+0x6f/0x90
> > > >   credit_entropy_bits+0x204/0x310
> > > >   add_interrupt_randomness+0x18f/0x210
> > > >   handle_irq+0x67/0x160
> > > >   do_IRQ+0x5e/0x130
> > > >   common_interrupt+0xf/0xf
> > > >   </IRQ>
> > > >
> > > > This alone isn't a problem, but the spinlock in the semaphore is also
> > > > still held while waking up waiters (up() -> __up() -> try_to_wake_up()
> > > > callchain), which then closes the runqueue vs. semaphore.lock loop,
> > > > and upsets lockdep, which issues a circular locking splat to dmesg.
> > > > Worse it upsets developers, since we don't want to spam dmesg with
> > > > clutter when the machine is dying already.
> > > >
> > > > Fix this by creating a __down_trylock which only trylocks the
> > > > semaphore.lock. This isn't correct in full generality, but good enough
> > > > for console_lock:
> > > >
> > > > - there's only ever one console_lock holder, we won't fail spuriously
> > > >   because someone is doing a down() or up() while there's still room
> > > >   (unlike other semaphores with count > 1).
> > > >
> > > > - console_unlock() has one massive retry loop, which will catch anyone
> > > >   who races the trylock against the up(). This makes sure that no
> > > >   printk lines will get lost. Making the trylock more racy therefore
> > > >   has no further impact.
> > >
> > > To be honest, I do not see how this could solve the problem.
> > >
> > > The circular dependency is still there. If the new __down_trylock()
> > > succeeds then console_unlock() will get called in the same context
> > > and it will still need to call up() -> try_to_wake_up().
> > >
> > > Note that there are many other console_lock() callers that might
> > > happen in parallel and might appear in the wait queue.
> >
> > Hm right. It's very rare we hit this in our CI and I don't know how to
> > repro otherwise, so just threw this out at the wall to see if it
> > sticks. I'll try and come up with a new trick then.
>
> Single messages are printed from scheduler via printk_deferred().
> WARN() might be solved by introducing printk deferred context,
> see the per-cpu variable printk_context.

I convinced myself that I can take the wake_up_process out from under
the spinlock, for the limited case of the console lock. I think that's
a cleaner and more robust fix than leaking printk_context trickery
into the console_unlock code.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch