[PATCH v2 1/2] drm: Add GPU reset sysfs event

Christian König christian.koenig at amd.com
Thu Mar 17 09:46:27 UTC 2022


Am 17.03.22 um 10:29 schrieb Daniel Vetter:
> On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
>> Am 16.03.22 um 16:36 schrieb Rob Clark:
>>> [SNIP]
>>> just one point of clarification.. in the msm and i915 case it is
>>> purely for debugging and telemetry (ie. sending crash logs back to
>>> distro for analysis if user has crash reporting enabled).. it isn't
>>> used for triggering any action like killing app or compositor.
>> By the way, how does msm it's memory management for the devcoredumps?
> GFP_NORECLAIM all the way. It's purely best effort.

Ok, good to know that it's as simple as that.

> Note that the fancy new plan for i915 discrete gpu is to only support gpu
> crash dumps on non-recoverable gpu contexts, i.e. those that do not
> continue to the next batch when something bad happens.

> This is what vk wants

That's exactly what I'm telling an internal team for a couple of years 
now as well. Good to know that this is not that totally crazy.

>   and also what iris now uses (we do context recovery in userspace in
> all cases), and non-recoverable contexts greatly simplify the crash dump
> gather: Only thing you need to gather is the register state from hw
> (before you reset it), all the batchbuffer bo and indirect state bo (in
> i915 you can mark which bo to capture in the CS ioctl) can be captured in
> a worker later on. Which for non-recoverable context is no issue, since
> subsequent batchbuffers won't trample over any of these things.
>
> And that way you can record the crashdump (or at least the big pieces like
> all the indirect state stuff) with GFP_KERNEL.

Interesting idea, so basically we only do the state we need to reset 
initially and grab a reference on the killed application to gather the 
rest before we clean them up.

Going to keep that in mind as well.

Thanks,
Christian.

>
> msm probably gets it wrong since embedded drivers have much less shrinker
> and generally no mmu notifiers going on :-)
>
>> I mean it is strictly forbidden to allocate any memory in the GPU reset
>> path.
>>
>>> I would however *strongly* recommend devcoredump support in other GPU
>>> drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
>>> to debug and fix a couple obscure issues that I was not able to
>>> reproduce by myself.
>> Yes, completely agree as well.
> +1
>
> Cheers, Daniel



More information about the amd-gfx mailing list