[Intel-gfx] [PATCH v2 1/3] drm: Add support for panic message output

Thu Mar 14 09:35:35 UTC 2019

On Thu, Mar 14, 2019 at 05:45:32AM +0100, Ahmed S. Darwish wrote:
> Hi,
> 
> On Wed, Mar 13, 2019 at 09:35:05AM +0100, Daniel Vetter wrote:
> > On Tue, Mar 12, 2019 at 11:13:03PM +0100, Ahmed S. Darwish wrote:
> > > On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
> > > > Den 11.03.2019 20.23, skrev Daniel Vetter:
> [……]
> > > > >
> > > > > class_for_each_device uses klist, which only uses an irqsave spinlock. I
> > > > > think that's good enough. Comment to that effect would be good e.g.
> > > > >
> > > > > 	/* based on klist, which uses only a spin_lock_irqsave, which we
> > > > > 	 * assume still works */
> > > > >
> > > > > If we aim for perfect this should be a trylock still, maybe using our own
> > > > > device list.
> > > > >
> > >
> > > I definitely agree here.
> > >
> > > The lock may already be locked either by a stopped CPU, or by the
> > > very same CPU we execute panic() on (e.g. NMI panic() on the
> > > printing CPU).
> > >
> > > This is why it's very common for example in serial consoles, which
> > > are usually careful about re-entrance and panic contexts, to do:
> > >
> > >   xxxxxx_console_write(...) {
> > > 	if (oops_in_progress)
> > > 		locked = spin_trylock_irqsave(&port->lock, flags);
> > > 	else
> > > 		spin_lock_irqsave(&port->lock, flags);
> > >   }
> > >
> > > I'm quite positive we should do the same for panic drm drivers.
> >
> > Yeah Ideally all the locking in the drm path would be trylock only.
> >
> > I wonder whether lockdep could help us validate this, with some "don't
> > allow anything except trylocks in this context". It's easy to audit the
> > core code with review, but drivers are much tougher. And often end up with
> > really deep callchains to get at the backing buffers.
> [……]
> > > > > Instead what I had in mind is to recreate the worst
> > > > > possible panic context as much as feasible (disabling interrupts should be
> > > > > a good start, maybe we can even do an nmi callback), and then call our
> > > > > panic implementation. That way we can test the panic handler in a
> > > > > non-destructive way (i.e. aside from last dmesg lines printed to the
> > > > > screen nothing bad happens to the kernel: No real panic, no oops, no
> > > > > tainting).
> > > >
> > > > The interrupt case I can do, nmi I have no idea.
> > > >
> > >
> > > I agree too. Disabling interrupts + CONFIG_DEBUG_ATOMIC_SLEEP
> > > would be a nice non-destructive test-case emulation.
> >
> > See above, if we can somehow emulate "all locks are held, only allow
> > trylock" with lockdep that would be great too.
> >
> 
> Hmmmm, to prototype things faster, I'll do it as a hook:
> 
>    spin_lock(spinlock_t *lock)
>    {
>         might_spin_forever_under_panic_path();
>         ...
>    }
> 
> Just like how mutexes and DEBUG_ATOMIC_SLEEP do it today:
> 
>    void mutex_lock(struct mutex *lock)
>    {
>         might_sleep();
>         ...
>    }
> 
> Hopefully the locking maintainers can accept something like this.
> If not, let's play with lockdep (which is a bit complex).
> 
> ===> Thanks to Sebastian for suggesting this!

Ah yes this is a very nice idea. We only need to make sure that we don't
enable any of this for a real oops. Or we just scroll away the real oops
with tons of warnings.
-Daniel

> 
> > Plus nmi context, once/if that somehow becomes relevant.
> >
> > The thing that killed the old drm panic handling code definitely was that
> > we flat out couldnt' test it except with real oopses. And that's just
> > whack-a-mole and bug reporter frustration if you first have a few patch
> > iterations around "oh sry, it's still some oops in the panic handler, not
> > the first one that we're seeing" :-/
> >
> 
> Oh, that's a critical point:
> 
>    full data > some data > no data > invalid/useless data
> 
> First impressions for the feature will definitely count. Hopefully
> the hooks above and some heavy DRM CI testing __beforehand__ can
> improve things before publishing to a wider audience..

Yup, let's get this right!
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch