[Intel-gfx] [PATCH v2 1/3] drm: Add support for panic message output
Ahmed S. Darwish
darwish.07 at gmail.com
Thu Mar 14 02:51:13 UTC 2019
[[ Adding Sebastian, who is quite experienced in intricate
locking situations due to daily PREEMPT_RT work.. ]]
On Wed, Mar 13, 2019 at 09:37:10AM +0100, Daniel Vetter wrote:
> On Wed, Mar 13, 2019 at 08:49:17AM +0100, John Ogness wrote:
> > On 2019-03-12, Ahmed S. Darwish <darwish.07 at gmail.com> wrote:
> > > On Wed, Mar 13, 2019 at 09:37:10AM +0100, Daniel Vetter wrote:
[……]
> > > >
> > > > If we aim for perfect this should be a trylock still, maybe using
> > > > our own device list.
> > > >
> > >
> > > I definitely agree here.
> > >
> > > The lock may already be locked either by a stopped CPU, or by the
> > > very same CPU we execute panic() on (e.g. NMI panic() on the
> > > printing CPU).
> > >
> > > This is why it's very common for example in serial consoles, which
> > > are usually careful about re-entrance and panic contexts, to do:
> > >
> > > xxxxxx_console_write(...) {
> > > if (oops_in_progress)
> > > locked = spin_trylock_irqsave(&port->lock, flags);
> > > else
> > > spin_lock_irqsave(&port->lock, flags);
> > > }
> > >
> > > I'm quite positive we should do the same for panic drm drivers.
> >
> > This construction will continue, even if the trylock fails. It only
> > makes sense to do this if the driver has a chance of being
> > successful. Ignoring locks is a serious issue. I personally am quite
> > unhappy that the serial drivers do this, which was part of my motivation
> > for the new printk design I'm working on.
> >
> > If the driver is not capable of doing something useful on a failed
> > trylock, then I recommend just skipping that device. Maybe trying it
> > again later after trying all the devices?
>
> Ah yes missed that. If the trylock fails anywhere, we must bail out.
>
> Not sure retrying is useful, my experience from at least drm is that
> either you're lucky, and drm wasn't doing anything right when the machine
> blew up, and then the trylocks will all go through. Or you're unlucky, and
> most likely that means drm itself blew up, and no amount of retrying is
> going to help. I wouldn't bother.
>
Ack, retrying most probably won't help.
I see that, at least in x86:
kernel/panic.c::panic()
=> kernel/panic.c::smp_send_stop()
=> arch/x86/incluse/smp.h::smp_send_stop(0)
=> arch/x86/kernel/smp.c::native_stop_other_cpus(wait)
/*
* Don't wait longer than a second if the caller
* didn't ask us to wait.
*/
timeout = USEC_PER_SEC;
while (num_online_cpus() > 1 && (wait || timeout--))
udelay(1);
So the panic()-ing core wait by default for _a full second_ until
all other cores finish their non-preemptible regions. If that did
not work out, it even tries more aggressively with NMIs.
So if one of the other non panic()-ing cores was holding a DRM
related lock through spin_lock_irqsave() for more than one second,
well, it's a bug in the DRM layer code anyway.. [A]
Problem is, what will happen if the non panic()-ing core was
holding a DRM lock through barebone spin_lock()? Interrupts are
enabled there, so the IPI will be handled, and thus that core will
effectively die while the lock is forever held :-( [B]
But, well, yeah, I don't think there's a solution for the [B] part
except by spin_trylock() the fewest amount of locks possible, and
bailing out if anyone of them is held.
For reference, I asked over IRC why not just resetting GPU state,
but it's not easy as it sounds to people outside of the gfx world:
[abridged]
<darwi> So it seems Windows is forcing driver writers to
implement a gpu reset method [*]
<darwi> since Windows vista
<danvet> yeah
<danvet> not going to do that from panic
<danvet> there's a non-zero chance that gpu reset takes down
the system with at least lots of hw/driver combos
<danvet> I think all our drivers also have gpu reset support
<danvet> if something is still rendering while we panic, mea culpa
<danvet> only way around that is switching planes
<danvet> which we tried with the old approach, and that's just
too hard
<danvet> you end up running a few 100kloc of code worst case
<danvet> no way to audit/validate that for panic context
<danvet> but the rendering shouldn't really matter
<anholt> yeah, actually modesetting in a panic seems hopeless.
<danvet> at least for modern compositors
<danvet> anholt, yeah I killed that idea
<danvet> so they only render to the backbuffer
<danvet> and we won't show that with a panic
<anholt> yep. just draw on a plane. for everything but
uncomposited x11, it'll already be idle.
<danvet> so only really an issue on boot splash and frontbuffer x11
<danvet> and there a few cursors/characters drawn won't really
matter in most cases
<danvet> anholt, yeah that's the idea, we pick the currently
showing plane and have a simple draw_xy functions which
knows how to draw bright/dark pixels for any drm_fourcc
<danvet> so you can see the oops even if it's yuv or whatever
<anholt> nice
Thanks!
[*] DXGKDDI_RESETFROMTIMEOUT callback function:
https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/d3dkmddi/nc-d3dkmddi-dxgkddi_resetfromtimeout
--
darwi
http://darwish.chasingpointers.com
More information about the Intel-gfx
mailing list