[PATCH v5] drm/amd/amdgpu:Fix compute ring unable to detect hang.

Christian König ckoenig.leichtzumerken at gmail.com
Thu Sep 19 12:12:15 UTC 2019


Am 19.09.19 um 12:09 schrieb Jesse Zhang:
> When compute fence did signal, compute ring cannot detect hardware hang
> because its timeout value is set to be infinite by default.
>
> In SR-IOV and passthrough mode, if user does not declare custome timeout
> value for compute ring, then use gfx ring timeout value as default. So
> that when there is a ture hardware hang, compute ring can detect it.
>
> Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
> Signed-off-by: Jesse Zhang <zhexi.zhang at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +++-
>   2 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3b5282b..03ac5a1da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
>   
>   	amdgpu_device_check_block_size(adev);
>   
> -	ret = amdgpu_device_get_job_timeout_settings(adev);
> -	if (ret) {
> -		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
> -		return ret;
> -	}
> -
>   	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
>   
>   	return ret;
> @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	if (r)
>   		return r;
>   
> +	r = amdgpu_device_get_job_timeout_settings(adev);
> +	if (r) {
> +		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
> +		return r;
> +	}
> +

I assume that you move the code because previously SRIOV/passthrough 
setting is not available yet?

But even with this here you can still remove the extra SRIOV check in 
amdgpu_fence.c.

Regards,
Christian.

>   	/* doorbell bar mapping and doorbell index init*/
>   	amdgpu_device_doorbell_init(adev);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 420888e..1236245 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
>   		}
>   		/*
>   		 * There is only one value specified and
> -		 * it should apply to all non-compute jobs.
> +		 * it should apply to all jobs.
>   		 */
>   		if (index == 1)
>   			adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
> +			if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
> +				adev->compute_timeout = adev->gfx_timeout;
>   	}
>   
>   	return ret;



More information about the amd-gfx mailing list