[PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
Andrey Grodzovsky
Andrey.Grodzovsky at amd.com
Thu Jan 14 14:40:16 UTC 2021
On 1/14/21 8:37 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the
> amdgpu_device_gpu_recover will be reentered. Then the
> adev->gmc.xgmi.head will be grabbed by 2 local linked list,
> which may cause wild pointer issue in iterating.
>
> lock the device earily to prevent the node be added to 2
> different lists.
>
> for xgmi there is a hive lock which can promise there won't have
> 2 devices on same hive reenter the interation. So xgmi doesn't
> need to lock the device.
Note that amdgpu_device_lock_adev does bunch of other stuff besides taking
the lock and I don't think we want to skip them for the other devices in case of
XGMI.
Andrey
>
> Signed-off-by: Horace Chen <horace.chen at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++-------
> 1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d434803fb49..a28e138ac72c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4591,19 +4591,20 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> device_list_handle = &hive->device_list;
> } else {
> + /* if current dev is already in reset, skip adding list to prevent race issue */
> + if (!amdgpu_device_lock_adev(adev, hive)) {
> + dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> + job ? job->base.id : -1);
> + r = 0;
> + goto skip_recovery;
> + }
> +
> list_add_tail(&adev->gmc.xgmi.head, &device_list);
> device_list_handle = &device_list;
> }
>
> /* block all schedulers and reset given job's ring */
> list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> - if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> - dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> - job ? job->base.id : -1);
> - r = 0;
> - goto skip_recovery;
> - }
> -
> /*
> * Try to put the audio codec into suspend state
> * before gpu reset started.
More information about the amd-gfx
mailing list