[Intel-gfx] [PATCH v2 1/3] drm: Add support for panic message output

Fri Mar 15 10:58:15 UTC 2019

On Thu, Mar 14, 2019 at 12:44:06PM +0000, Kazlauskas, Nicholas wrote:
> On 3/14/19 5:50 AM, Daniel Vetter wrote:
> > On Wed, Mar 13, 2019 at 05:41:52PM +0000, Kazlauskas, Nicholas wrote:
> >> On 3/13/19 1:33 PM, Michel Dänzer wrote:
> >>> On 2019-03-13 5:16 p.m., Kazlauskas, Nicholas wrote:
> >>>> On 3/13/19 11:54 AM, Christian König wrote:
> >>>>> Am 13.03.19 um 16:38 schrieb Michel Dänzer:
> >>>>>> On 2019-03-13 2:37 p.m., Christian König wrote:
> >>>>>>> Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
> >>>>>>>> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
> >>>>>>>>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
> >>>>>>>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
> >>>>>>>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
> >>>>>>>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
> >>>>>>>>>>>>> This adds support for outputting kernel messages on panic().
> >>>>>>>>>>>>> A kernel message dumper is used to dump the log. The dumper
> >>>>>>>>>>>>> iterates
> >>>>>>>>>>>>> over each DRM device and it's crtc's to find suitable
> >>>>>>>>>>>>> framebuffers.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> All the other dumpers are run before this one except mtdoops.
> >>>>>>>>>>>>> Only atomic drivers are supported.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Signed-off-by: Noralf Trønnes <noralf at tronnes.org>
> >>>>>>>>>>>>> ---
> >>>>>>>>>>>>>      [...]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> diff --git a/include/drm/drm_framebuffer.h
> >>>>>>>>>>>>> b/include/drm/drm_framebuffer.h
> >>>>>>>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
> >>>>>>>>>>>>> --- a/include/drm/drm_framebuffer.h
> >>>>>>>>>>>>> +++ b/include/drm/drm_framebuffer.h
> >>>>>>>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> >>>>>>>>>>>>>                   struct drm_file *file_priv, unsigned flags,
> >>>>>>>>>>>>>                   unsigned color, struct drm_clip_rect *clips,
> >>>>>>>>>>>>>                   unsigned num_clips);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +    /**
> >>>>>>>>>>>>> +     * @panic_vmap:
> >>>>>>>>>>>>> +     *
> >>>>>>>>>>>>> +     * Optional callback for panic handling.
> >>>>>>>>>>>>> +     *
> >>>>>>>>>>>>> +     * For vmapping the selected framebuffer in a panic context.
> >>>>>>>>>>>>> Must
> >>>>>>>>>>>>> +     * be super careful about locking (only trylocking allowed).
> >>>>>>>>>>>>> +     *
> >>>>>>>>>>>>> +     * RETURNS:
> >>>>>>>>>>>>> +     *
> >>>>>>>>>>>>> +     * NULL if it didn't work out, otherwise an opaque cookie
> >>>>>>>>>>>>> which is
> >>>>>>>>>>>>> +     * passed to @panic_draw_xy. It can be anything: vmap area,
> >>>>>>>>>>>>> structure
> >>>>>>>>>>>>> +     * with more details, just a few flags, ...
> >>>>>>>>>>>>> +     */
> >>>>>>>>>>>>> +    void *(*panic_vmap)(struct drm_framebuffer *fb);
> >>>>>>>>>>>> FWIW, the panic_vmap hook cannot work in general with the
> >>>>>>>>>>>> amdgpu/radeon
> >>>>>>>>>>>> drivers:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Framebuffers are normally tiled, writing to them with the CPU
> >>>>>>>>>>>> results in
> >>>>>>>>>>>> garbled output.
> >>>>>>>>>>>>
> >>>>>>>>>> In which case the driver needs to support the ->panic_draw_xy
> >>>>>>>>>> callback,
> >>>>>>>>>> or maybe it's possible to make a generic helper for tiled buffers.
> >>>>>>>>> I'm afraid that won't help, at least not without porting big chunks of
> >>>>>>>>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
> >>>>>>>>> into the kernel, none of which will be used for anything else.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>> There would need to be a mechanism for switching scanout to a
> >>>>>>>>>>>> linear,
> >>>>>>>>>>>> CPU accessible framebuffer.
> >>>>>>>>>>> I suppose panic_vmap() could just provide a linear temp buffer
> >>>>>>>>>>> to the panic handler, and panic_unmap() could copy the contents
> >>>>>>>>>>> over to the real fb.
> >>>>>>>>> Copy how? Using a GPU engine?
> >>>>>>>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
> >>>>>>>> accesible :/
> >>>>>>> Well we do have a debug path for accessing invisible memory with the
> >>>>>>> CPU.
> >>>>>>>
> >>>>>>> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
> >>>>>>> just read/write DATA over and over again if you want to access some
> >>>>>>> memory.
> >>>>>> Right. I assume that'll be very slow, but I guess it could do when the
> >>>>>> memory isn't directly CPU accessible.
> >>>>>
> >>>>> Just made a quick test and reading 33423360 bytes (4096x2040x4) using
> >>>>> that interfaces takes about 13 seconds.
> >>>>>
> >>>>> IIRC we don't use the auto increment optimization yet, so that can
> >>>>> probably be improved by a factor of 3 or more.
> >>>
> >>> I'd assume only writes are needed, no reads.
> >>>
> >>>
> >>>>>>> But turning of tilling etc is still extremely tricky when the system is
> >>>>>>> already unstable.
> >>>>>> Maybe we could add a little hook to the display code, which just
> >>>>>> disables tiling for scanout and maybe disables non-primary planes, but
> >>>>>> doesn't touch anything else. Harry / Nicholas, does that seem feasible?
> >>>>>>
> >>>>>>
> >>>>>> I'm coming around from "this is never going to work" to "it might
> >>>>>> actually work" with our hardware...
> >>>>>
> >>>>> Yeah, agree. It's a bit tricky, but doable.
> >>>>
> >>>> A "disable_tiling" hook or something along those lines could work for
> >>>> display. It's a little bit non trivial when you want to start dealing
> >>>> with locking and any active DRM commits, but we have a global lock
> >>>> around all our hardware programming anyway that makes that easier to
> >>>> deal with.
> >>>
> >>> This code only runs when there's a kernel panic, so it doesn't have to
> >>> care about such things. :) In fact, it should rather avoid them and only
> >>> do the absolute minimum register writes needed to do the job.
> >>
> >> Good point. I guess it doesn't really matter at that point.
> >>
> >>>
> >>>
> >>>> I think we can just re-commit and update the existing hardware state
> >>>> with only the tiling info for every plane reset to off.
> >>>
> >>> There's also a concern that non-primary planes might prevent the output
> >>> from being visible.
> >>
> >> I would just hope that the overlay doesn't cover the full screen, or
> >> that the panic handler writes to all buffers if possible.
> >>
> >> DC would have to swap the current context in order to disable planes and
> >> we'd have to do memory allocations / frees etc in order to do so.
> > 
> > You can't. And I'm with Michel, you don't want to run any of the DC code.
> > Pulling in ~100kloc of dependency in panic context just doesn't work
> > (that's what we tried before essentially).
> > 
> > What you can do is
> > - trylock, and bail out if any lock is locked (since that means undefined
> >    hw or sw state, and then you really shouldn't touch anything)
> > - maybe write a few very specific registers with the most minimal amount
> >    of code (since we need to validate every line that runs in oops/panic
> >    context to make sure it still works in nmi context). Stuff like
> >    disabling tiling (on i915 that's just flipping a bit in a register).
> > - memory writes (or that slow indirect write interface you have for
> >    cpu invisible vram) to draw the actual oops.
> > 
> > Everything else is out of question. Especially calling existing display
> > code just doesn't work, it's ime impossible to keep allocations/spinlocks
> > and stuff like that out of it.
> > -Daniel
> 
> At some point we'll have to run at least some DC code since disabling 
> tiling and compression is ASIC specific and you need to know what 
> registers to actually write.
> 
> There's no fast path that DC exposes specifically for this usecase, but 
> locking is controlled outside of DC anyway (so we don't have to lock) 
> and it DC itself won't be doing any memory allocations if it's just 
> disabling tiling and compression.
> 
> It's not actually that many register reads/writes, and it could probably 
> stripped down to just writes since the reads are only used for sanity 
> checking and the system is going down anyway. That part might need a 
> separate interface for dc_commit_updates_for_stream or for amdgpu_dm to 
> disable sanity checking before the call through DC's debug options.

I'd copypaste that code for oops handling. Resetting to untiled shouldn't
be that much code. And the risk of accidentally pulling in a lot of other
bits are imo too high.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch