Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

14 Mar 2022


      On Mon, 14 Mar 2022 10:23:27 -0400
Alex Deucher alexdeucher@gmail.com wrote:
...
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen ppaalanen@gmail.com wrote:
...
On Thu, 10 Mar 2022 11:56:41 -0800
Rob Clark robdclark@gmail.com wrote:
...
For something like just notifying a compositor that a gpu crash
happened, perhaps drm_event is more suitable.  See
virtio_gpu_fence_event_create() for an example of adding new event
types.  Although maybe you want it to be an event which is not device
specific.  This isn't so much of a debugging use-case as simply
notification.
Hi,
for this particular use case, are we now talking about the display
device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
If the former, I wasn't aware that display device crashes are a thing.
How should a userspace display server react to those?
If the latter, don't we have EGL extensions or Vulkan API already to
deliver that?
The above would be about device crashes that directly affect the
display server. Is that the use case in mind here, or is it instead
about notifying the display server that some application has caused a
driver/hardware crash? If the latter, how should a display server react
to that? Disconnect the application?
Shashank, what is the actual use case you are developing this for?
I've read all the emails here so far, and I don't recall seeing it
explained.
The idea is that a support daemon or compositor would listen for GPU
reset notifications and do something useful with them (kill the guilty
app, restart the desktop environment, etc.).  Today when the GPU
resets, most applications just continue assuming nothing is wrong,
meanwhile the GPU has stopped accepting work until the apps re-init
their context so all of their command submissions just get rejected.
...
Btw. somewhat relatedly, there has been work aiming to allow
graceful hot-unplug of DRM devices. There is a kernel doc outlining how
the various APIs should react towards userspace when a DRM device
suddenly disappears. That seems to have some overlap here IMO.
See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
which also has a couple pointers to EGL and Vulkan APIs.
The problem is most applications don't use the GL or VK robustness
APIs.
Hi,
how would this new event help with that?
I mean, yeah, there could be a daemon that kills those GPU users, but
then what? You still lose any unsaved work, and may need to manually
restart them.
Is the idea that it is better to have the app crash and disappear than
to look like it froze while it otherwise still runs?
If some daemon or compositor goes killing apps that trigger GPU resets,
then how do we stop that for an app that actually does use the
appropriate EGL or Vulkan APIs to detect and remedy that situation
itself?
...
You could use something like that in the compositor, but those
APIs tend to be focused more on the application itself rather than the
GPU in general.  E.g., Is my context lost.  Which is fine for
restarting your context, but doesn't really help if you want to try
and do something with another application (i.e., the likely guilty
app).  Also, on dGPU at least, when you reset the GPU, vram is usually
lost (either due to the memory controller being reset, or vram being
zero'd on init due to ECC support), so even if you are not the guilty
process, in that case you'd need to re-init your context anyway.
Why should something like a compositor listen for this and kill apps
that triggered GPU resets, instead of e.g. Mesa noticing that in the app
and killing itself? Mesa in the app would know if robustness API is
being used.
Would be really nice to have the answers to all these questions to be
collected and reiterated in the next version of this proposal.
Thanks,
pq

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event