[PATCH v7] drm/doc: Document DRM device reset expectations

Wed Aug 23 18:07:21 UTC 2023

Hi Rodrigo,

Em 23/08/2023 14:31, Rodrigo Vivi escreveu:
> On Fri, Aug 18, 2023 at 05:06:42PM -0300, André Almeida wrote:
>> Create a section that specifies how to deal with DRM device resets for
>> kernel and userspace drivers.
>>
>> Signed-off-by: André Almeida <andrealmeid at igalia.com>
>>
>> ---
>>
>> v7 changes:
>>   - s/application/graphical API contex/ in the robustness part (Michel)
>>   - Grammar fixes (Randy)
>>
>> v6: https://lore.kernel.org/lkml/20230815185710.159779-1-andrealmeid@igalia.com/
>>
>> v6 changes:
>>   - Due to substantial changes in the content, dropped Pekka's Acked-by
>>   - Grammar fixes (Randy)
>>   - Add paragraph about disabling device resets
>>   - Add note about integrating reset tracking in drm/sched
>>   - Add note that KMD should return failure for contexts affected by
>>     resets and UMD should check for this
>>   - Add note about lack of consensus around what to do about non-robust
>>     apps
>>
>> v5: https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
>> ---
>>   Documentation/gpu/drm-uapi.rst | 77 ++++++++++++++++++++++++++++++++++
>>   1 file changed, 77 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 65fb3036a580..3694bdb977f5 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -285,6 +285,83 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>>   mmapped regular files. Threads cause additional pain with signal
>>   handling as well.
>>   
>> +Device reset
>> +============
>> +
>> +The GPU stack is really complex and is prone to errors, from hardware bugs,
>> +faulty applications and everything in between the many layers. Some errors
>> +require resetting the device in order to make the device usable again. This
>> +section describes the expectations for DRM and usermode drivers when a
>> +device resets and how to propagate the reset status.
>> +
>> +Device resets can not be disabled without tainting the kernel, which can lead to
>> +hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in
>> +device resets is to propagate the message to the application and apply any
>> +special policy for blocking guilty applications, if any. Corollary is that
>> +debugging a hung GPU context require hardware support to be able to preempt such
>> +a GPU context while it's stopped.
>> +
>> +Kernel Mode Driver
>> +------------------
>> +
>> +The KMD is responsible for checking if the device needs a reset, and to perform
>> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
>> +should keep track of resets, because userspace can query any time about the
>> +reset status for a specific context. This is needed to propagate to the rest of
>> +the stack that a reset has happened. Currently, this is implemented by each
>> +driver separately, with no common DRM interface. Ideally this should be properly
>> +integrated at DRM scheduler to provide a common ground for all drivers. After a
>> +reset, KMD should reject new command submissions for affected contexts.
> 
> is there any consensus around what exactly 'affected contexts' might mean?
> I see i915 pin-point only the context that was at execution with head pointing
> at it and doesn't blame the queued ones, while on Xe it looks like we are
> blaming all the queued context. Not sure what other drivers are doing for the
> 'affected contexts'.
> 

"Affected contexts" is a generic term indeed, giving the differences 
from each driver as you already pointed out. amdgpu also tends to affect 
all queued contexts during a reset. This wording was used to fit how 
different drivers works.