[PATCH v2 1/2] drm: Add GPU reset sysfs event

Thu Mar 17 15:34:21 UTC 2022

On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel at ffwll.ch> wrote:
>
> On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > [SNIP]
> > > just one point of clarification.. in the msm and i915 case it is
> > > purely for debugging and telemetry (ie. sending crash logs back to
> > > distro for analysis if user has crash reporting enabled).. it isn't
> > > used for triggering any action like killing app or compositor.
> >
> > By the way, how does msm it's memory management for the devcoredumps?
>
> GFP_NORECLAIM all the way. It's purely best effort.

We do one GEM obj allocation in the snapshot path (the hw has a
mechanism to snapshot it's own state into a gpu buffer.. not sure if
nice debugging functionality like that is a commentary on the blob
driver quality, but I'm not complaining)

I suppose we could pre-allocate this buffer up-front.. but it doesn't
seem like a problem, ie. if allocation fails we just skip snapshotting
stuff that needs the hw crashdumper.  I guess since vram is not
involved, perhaps that makes the situation a bit more straightforward.

> Note that the fancy new plan for i915 discrete gpu is to only support gpu
> crash dumps on non-recoverable gpu contexts, i.e. those that do not
> continue to the next batch when something bad happens. This is what vk
> wants and also what iris now uses (we do context recovery in userspace in
> all cases), and non-recoverable contexts greatly simplify the crash dump
> gather: Only thing you need to gather is the register state from hw
> (before you reset it), all the batchbuffer bo and indirect state bo (in
> i915 you can mark which bo to capture in the CS ioctl) can be captured in
> a worker later on. Which for non-recoverable context is no issue, since
> subsequent batchbuffers won't trample over any of these things.
>
> And that way you can record the crashdump (or at least the big pieces like
> all the indirect state stuff) with GFP_KERNEL.
>
> msm probably gets it wrong since embedded drivers have much less shrinker
> and generally no mmu notifiers going on :-)

Note that the bo's associated with the batch are still pinned at this
point, from the bo lifecycle the batch is still active.  So from the
point of view of shrinker, there should be no interaction.  We aren't
doing anything with mmu notifiers (yet), so not entirely sure offhand
the concern there.

Currently we just use GFP_KERNEL and bail if allocation fails.

BR,
-R

> > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > path.
> >
> > > I would however *strongly* recommend devcoredump support in other GPU
> > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > to debug and fix a couple obscure issues that I was not able to
> > > reproduce by myself.
> >
> > Yes, completely agree as well.
>
> +1
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch