[PATCH 0/5] Add work pool to reset domain

Sat Aug 12 17:08:31 UTC 2023

On 8/12/2023 1:53 PM, Christian König wrote:
> Am 11.08.23 um 08:02 schrieb Lijo Lazar:
>> Presently, there are multiple clients of reset like RAS, job timeout, 
>> KFD hang
>> detection and debug method. Instead of each client maintaining a work 
>> item,
>> reset work pool is moved to reset domain. When a client makes a 
>> recovery request,
>> a work item is allocated by the reset domain and queued for execution. 
>> For the
>> case of job timeout, each ring has its own TDR queue to which tdr work is
>> scheduled. From there, it's further queued to a reset domain based on 
>> the device
>> configuration.
>>
>> This allows flexibility to have multiple reset domains. For example, when
>> there are partitions, each partition can maintain its own reset domain 
>> and a job
>> timeout on one partition doesn't affect jobs on the other partition 
>> (when the
>> jobs don't have any interdependency). The reset logic will select the
>> appropriate reset domain based on the current device configuration.
> 
> Well completely NAK to that design.
> 
> We intentionally added the workqueue to serialize *all* reset work and I 
> absolutely don't see any reason to change that.
> 

This is for the case where there are multiple spatial partitions and a 
reset is possible by hardware design on one partition without affecting 
other partitions on the same device. The partition scenario can be 
considered equivalent to a multi-gpu case (not interconnected through 
XGMI) where each gpu gets its own reset domain and can be reset 
independently.

BTW, this design doesn't restrict from keeping only one reset domain as 
in the case of legacy ASICs like Aldebaran. The reset work is always 
serialized within a domain. This allows to have multiple reset domains 
or you could also fall back to reset_domain1 -> reset_domain2 for 
hierarchical resets, if required (though that is not planned now).

Thanks,
Lijo

> Regards,
> Christian.
> 
>>
>> Lijo Lazar (5):
>>    drm/amdgpu: Add work pool to reset domain
>>    drm/amdgpu: Move to reset_schedule_work
>>    drm/amdgpu: Set flags to cancel all pending resets
>>    drm/amdgpu: Add API to queue and do reset work
>>    drm/amdgpu: Add TDR queue for ring
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 -
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  32 +++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |   1 -
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  24 +---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  40 +++----
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  16 ++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    |  71 ++++++------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  | 122 ++++++++++++++++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  32 +++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   |   5 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |   1 -
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  38 +++----
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  44 ++++----
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  33 +++---
>>   15 files changed, 285 insertions(+), 177 deletions(-)
>>
>