[PATCH 1/2] drm/amdgpu: Handle xgmi device removal and add reset wq.
Grodzovsky, Andrey
Andrey.Grodzovsky at amd.com
Fri Nov 30 15:14:24 UTC 2018
On 11/30/2018 04:03 AM, Christian König wrote:
> Am 29.11.18 um 21:36 schrieb Andrey Grodzovsky:
>> XGMI hive has some resources allocted on device init which
>> needs to be deallocated when the device is unregistered.
>>
>> Add per hive wq to allow all the nodes in hive to run resets
>> concurently - this should speed up the total reset time to avoid
>> breaching the PSP FW timeout.
>
> Do you really want a workqueue? That sounds like the exact opposite of
> what you describe here.
>
> E.g. a workqueue is there to serialize all work items scheduled to it.
>
> What you most likely rather want is a work item per device which are
> scheduled, run in parallel and are joined back together before
> dropping the per hive lock.
>
> Christian.
As you can see in second patch that exactly how I use this wq. I define
per adev work item all of which I enqueue during GPU reset routine.
I believe the wq I created will processes work items in parallel since
it's is served by per CPU worker thread (thread pool) while a
serializing wq is
created using the 'create_singlethread_workqueue' interface which
creates an ordered wq which process at most one item simultaneously.
I did have my doubts about creating a dedicated wq instead of reusing
existing system wide wq such as system_unbound_wq because this adds new
kernel threads but eventually I opted for my own queue because there is
a hard requirement from the PSP FW to complete all ASIC resets in
defined finite amount
of time (1s i believe) otherwise some nodes in the XGMI hive might
switch to single GPU mode due to timeout.
If I rely on system wide wq then during stress times this queue might
fail to process my work items within that time limit because it's
overloaded by work items from other queue clients.
Maybe I should squash the 2 changes into one to make all of this more
clear ?
Andrey
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 24 ++++++++++++++++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +++
>> 2 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> index fb37e69..9ac2dc5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> @@ -61,6 +61,8 @@ struct amdgpu_hive_info
>> *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
>> INIT_LIST_HEAD(&tmp->device_list);
>> mutex_init(&tmp->hive_lock);
>> + tmp->reset_queue = alloc_workqueue("xgmi-hive", WQ_UNBOUND |
>> WQ_HIGHPRI, 0);
>> +
>> return tmp;
>> }
>> @@ -135,3 +137,25 @@ int amdgpu_xgmi_add_device(struct
>> amdgpu_device *adev)
>> mutex_unlock(&xgmi_mutex);
>> return ret;
>> }
>> +
>> +void amdgpu_xgmi_remove_device(struct amdgpu_device *adev)
>> +{
>> + struct amdgpu_hive_info *hive;
>> +
>> + if ((adev->asic_type < CHIP_VEGA20) || (adev->flags & AMD_IS_APU))
>> + return;
>> +
>> + mutex_lock(&xgmi_mutex);
>> +
>> + hive = amdgpu_get_xgmi_hive(adev);
>> + if (!hive)
>> + goto exit;
>> +
>> + if (!(hive->number_devices--)) {
>> + mutex_destroy(&hive->hive_lock);
>> + destroy_workqueue(hive->reset_queue);
>> + }
>> +
>> +exit:
>> + mutex_unlock(&xgmi_mutex);
>> +}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> index 6335bfd..285ab93 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> @@ -30,10 +30,13 @@ struct amdgpu_hive_info {
>> struct psp_xgmi_topology_info topology_info;
>> int number_devices;
>> struct mutex hive_lock;
>> + /* hive members reset wq */
>> + struct workqueue_struct *reset_queue;
>> };
>> struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct
>> amdgpu_device *adev);
>> int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive,
>> struct amdgpu_device *adev);
>> int amdgpu_xgmi_add_device(struct amdgpu_device *adev);
>> +void amdgpu_xgmi_remove_device(struct amdgpu_device *adev);
>> #endif
>
More information about the amd-gfx
mailing list