[PATCH v9 1/4] drm: Introduce device wedged event

Aravind Iddamsetty aravind.iddamsetty at linux.intel.com
Mon Nov 25 05:56:45 UTC 2024


On 22/11/24 21:32, Raag Jadav wrote:
> On Fri, Nov 22, 2024 at 11:09:32AM +0100, Christian König wrote:
>> Am 22.11.24 um 08:07 schrieb Raag Jadav:
>>> On Mon, Nov 18, 2024 at 08:26:37PM +0530, Aravind Iddamsetty wrote:
>>>> On 15/11/24 10:37, Raag Jadav wrote:
>>>>> Introduce device wedged event, which notifies userspace of 'wedged'
>>>>> (hanged/unusable) state of the DRM device through a uevent. This is
>>>>> useful especially in cases where the device is no longer operating as
>>>>> expected and has become unrecoverable from driver context. Purpose of
>>>>> this implementation is to provide drivers a generic way to recover with
>>>>> the help of userspace intervention without taking any drastic measures
>>>>> in the driver.
>>>>>
>>>>> A 'wedged' device is basically a dead device that needs attention. The
>>>>> uevent is the notification that is sent to userspace along with a hint
>>>>> about what could possibly be attempted to recover the device and bring
>>>>> it back to usable state. Different drivers may have different ideas of
>>>>> a 'wedged' device depending on their hardware implementation, and hence
>>>>> the vendor agnostic nature of the event. It is up to the drivers to
>>>>> decide when they see the need for recovery and how they want to recover
>>>>> from the available methods.
>>>>>
>>>>> Prerequisites
>>>>> -------------
>>>>>
>>>>> The driver, before opting for recovery, needs to make sure that the
>>>>> 'wedged' device doesn't harm the system as a whole by taking care of the
>>>>> prerequisites. Necessary actions must include disabling DMA to system
>>>>> memory as well as any communication channels with other devices. Further,
>>>>> the driver must ensure that all dma_fences are signalled and any device
>>>>> state that the core kernel might depend on are cleaned up. Once the event
>>>>> is sent, the device must be kept in 'wedged' state until the recovery is
>>>>> performed. New accesses to the device (IOCTLs) should be blocked,
>>>>> preferably with an error code that resembles the type of failure the
>>>>> device has encountered. This will signify the reason for wegeding which
>>>>> can be reported to the application if needed.
>>>> should we even drop the mmaps we created?
>>> Whatever is required for a clean recovery, yes.
>>>
>>> Although how would this play out? Do we risk loosing display?
>>> Or any other possible side-effects?
>> Before sending a wedge event all DMA transfers of the device have to be
>> blocked.
>>
>> So yes, all display, mmap() and file descriptor connections you had with the
>> device would need to be re-created.
> Does it mean we'd have to rely on userspace to unmap()?


I'm not sure of display, but at least all user mappings can be destroyed
using drm_vma_node_unmap.

Thanks,
Aravind.
>
> Raag


More information about the amd-gfx mailing list