[PATCH v2 1/2] drm: Add GPU reset sysfs event

Tue Mar 15 07:25:48 UTC 2022

Am 15.03.22 um 08:13 schrieb Dave Airlie:
> On Tue, 15 Mar 2022 at 00:23, Alex Deucher <alexdeucher at gmail.com> wrote:
>> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen at gmail.com> wrote:
>>> On Thu, 10 Mar 2022 11:56:41 -0800
>>> Rob Clark <robdclark at gmail.com> wrote:
>>>
>>>> For something like just notifying a compositor that a gpu crash
>>>> happened, perhaps drm_event is more suitable.  See
>>>> virtio_gpu_fence_event_create() for an example of adding new event
>>>> types.  Although maybe you want it to be an event which is not device
>>>> specific.  This isn't so much of a debugging use-case as simply
>>>> notification.
>>> Hi,
>>>
>>> for this particular use case, are we now talking about the display
>>> device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
>>>
>>> If the former, I wasn't aware that display device crashes are a thing.
>>> How should a userspace display server react to those?
>>>
>>> If the latter, don't we have EGL extensions or Vulkan API already to
>>> deliver that?
>>>
>>> The above would be about device crashes that directly affect the
>>> display server. Is that the use case in mind here, or is it instead
>>> about notifying the display server that some application has caused a
>>> driver/hardware crash? If the latter, how should a display server react
>>> to that? Disconnect the application?
>>>
>>> Shashank, what is the actual use case you are developing this for?
>>>
>>> I've read all the emails here so far, and I don't recall seeing it
>>> explained.
>>>
>> The idea is that a support daemon or compositor would listen for GPU
>> reset notifications and do something useful with them (kill the guilty
>> app, restart the desktop environment, etc.).  Today when the GPU
>> resets, most applications just continue assuming nothing is wrong,
>> meanwhile the GPU has stopped accepting work until the apps re-init
>> their context so all of their command submissions just get rejected.
> Just one thing comes to mind reading this, racy PID reuse.
>
> process 1234 does something bad to GPU.
> process 1234 dies in parallel to sysfs notification being sent.
> other process 1234 reuses the pid
> new process 1234 gets destroyed by receiver of sysfs notification.

That's a well known problem inherit to the uses of PIDs.

IIRC because of this the kernel only reuses PIDs when 
/proc/sys/kernel/pid_max is reached and then wraps around.

Regards,
Christian.

>
> Dave.