[PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

Tue Jan 19 14:33:36 UTC 2021

On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen <horace.chen at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
>   1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	int i, r = 0;
>   	bool need_emergency_restart = false;
>   	bool audio_suspended = false;
> +	bool get_dev_lock = false;
>   
>   	/*
>   	 * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	 * Build list of devices to reset.
>   	 * In case we are in XGMI hive mode, resort the device list
>   	 * to put adev in the 1st position.
> +	 *
> +	 * lock the device before we try to operate the linked list
> +	 * if didn't get the device lock, don't touch the linked list since
> +	 * others may iterating it.
>   	 */
>   	INIT_LIST_HEAD(&device_list);
>   	if (adev->gmc.xgmi.num_physical_nodes > 1) {
>   		if (!hive)
>   			return -ENODEV;
> -		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> -			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> -		device_list_handle = &hive->device_list;
> +
> +		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> +			get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> +			if (!get_dev_lock)
> +				break;

What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in 
the current code.

Also, maybe now it's better to separate the actual locking in 
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle 
stuff
like adev->mp1_state back and forth and also the function name is not 
descriptive of
the other stuff going on there anyway.

Andrey

> +		}
> +		if (get_dev_lock) {
> +			if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> +				list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> +			device_list_handle = &hive->device_list;
> +		}
>   	} else {
> -		list_add_tail(&adev->gmc.xgmi.head, &device_list);
> -		device_list_handle = &device_list;
> +		get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> +		tmp_adev = adev;
> +		if (get_dev_lock) {
> +			list_add_tail(&adev->gmc.xgmi.head, &device_list);
> +			device_list_handle = &device_list;
> +		}
> +	}
> +
> +	if (!get_dev_lock) {
> +		dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> +					job ? job->base.id : -1);
> +		r = 0;
> +		/* even we skipped this reset, still need to set the job to guilty */
> +		goto skip_recovery;
>   	}
>   
>   	/* block all schedulers and reset given job's ring */
>   	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -		if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> -			dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> -				  job ? job->base.id : -1);
> -			r = 0;
> -			goto skip_recovery;
> -		}
> -
>   		/*
>   		 * Try to put the audio codec into suspend state
>   		 * before gpu reset started.