[PATCH 1/2] drm/amdgpu: Handle xgmi device removal and add reset wq.

Fri Nov 30 15:14:24 UTC 2018

On 11/30/2018 04:03 AM, Christian König wrote:
> Am 29.11.18 um 21:36 schrieb Andrey Grodzovsky:
>> XGMI hive has some resources allocted on device init which
>> needs to be deallocated when the device is unregistered.
>>
>> Add per hive wq to allow all the nodes in hive to run resets
>> concurently - this should speed up the total reset time to avoid
>> breaching the PSP FW timeout.
>
> Do you really want a workqueue? That sounds like the exact opposite of 
> what you describe here.
>
> E.g. a workqueue is there to serialize all work items scheduled to it.
>
> What you most likely rather want is a work item per device which are 
> scheduled, run in parallel and are joined back together before 
> dropping the per hive lock.
>
> Christian.

As you can see in second patch that exactly how I use this wq. I define 
per adev work item all of which I enqueue during GPU reset routine.
I believe the wq I created will processes work items in parallel since 
it's is served by per CPU worker thread (thread pool) while  a 
serializing wq is
created  using the 'create_singlethread_workqueue' interface which 
creates an ordered wq which process at most one item simultaneously.
I did have my doubts about creating a dedicated wq instead of reusing 
existing system wide wq such as system_unbound_wq because this adds new
kernel threads but eventually I opted for my own queue because there is 
a hard requirement from the PSP FW to complete all ASIC resets in  
defined finite amount
of time (1s i believe)   otherwise some nodes in the XGMI hive might 
switch to single GPU mode due to timeout.
If I rely on system wide wq then during stress times this queue might 
fail to process my work items within that time limit because it's 
overloaded by work items from other queue clients.
Maybe I should squash the 2 changes into one to make all of this more 
clear ?

Andrey
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 24 ++++++++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h |  3 +++
>>   2 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> index fb37e69..9ac2dc5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> @@ -61,6 +61,8 @@ struct amdgpu_hive_info 
>> *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
>>       INIT_LIST_HEAD(&tmp->device_list);
>>       mutex_init(&tmp->hive_lock);
>>   +    tmp->reset_queue = alloc_workqueue("xgmi-hive", WQ_UNBOUND | 
>> WQ_HIGHPRI, 0);
>> +
>>       return tmp;
>>   }
>>   @@ -135,3 +137,25 @@ int amdgpu_xgmi_add_device(struct 
>> amdgpu_device *adev)
>>       mutex_unlock(&xgmi_mutex);
>>       return ret;
>>   }
>> +
>> +void amdgpu_xgmi_remove_device(struct amdgpu_device *adev)
>> +{
>> +    struct amdgpu_hive_info *hive;
>> +
>> +    if ((adev->asic_type < CHIP_VEGA20) || (adev->flags & AMD_IS_APU))
>> +        return;
>> +
>> +    mutex_lock(&xgmi_mutex);
>> +
>> +    hive = amdgpu_get_xgmi_hive(adev);
>> +    if (!hive)
>> +        goto exit;
>> +
>> +    if (!(hive->number_devices--)) {
>> +        mutex_destroy(&hive->hive_lock);
>> +        destroy_workqueue(hive->reset_queue);
>> +    }
>> +
>> +exit:
>> +    mutex_unlock(&xgmi_mutex);
>> +}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> index 6335bfd..285ab93 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> @@ -30,10 +30,13 @@ struct amdgpu_hive_info {
>>       struct psp_xgmi_topology_info    topology_info;
>>       int number_devices;
>>       struct mutex hive_lock;
>> +    /* hive members reset wq */
>> +    struct workqueue_struct *reset_queue;
>>   };
>>     struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct 
>> amdgpu_device *adev);
>>   int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, 
>> struct amdgpu_device *adev);
>>   int amdgpu_xgmi_add_device(struct amdgpu_device *adev);
>> +void amdgpu_xgmi_remove_device(struct amdgpu_device *adev);
>>     #endif
>