[PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

Mon Oct 9 08:35:51 UTC 2017

Please be aware that this policy is what the strict mode defined and what customer want, 
And also please check VK spec, it defines that after GPU reset all vk INSTANCE should close/release its resource/device/ctx and all buffers, and call re-initvkinstance after gpu reset

So this whole approach is what just aligned with the spec, and to not influence with current MESA/OGL client that's why I put the whole approach into the strict mode
And by default strict mode is not selected 

BR Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com] 
Sent: 2017年10月9日 16:26
To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> for SRIOV strict mode gpu reset:
>
> In kms open we mark the latest adev->gpu_reset_counter in fpriv we 
> return -ENODEV in cs_ioctl or info_ioctl if they found
> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>
> this way we prevent a potential bad process/FD from submitting cmds 
> and notify userspace with -ENODEV.
>
> userspace should close all BO/ctx and re-open dri FD to re-create 
> virtual memory system for this process

The whole aproach is a NAK from my side.

We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.

Regards,
Christian.

>
> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>   3 files changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index de9c164..b40d4ba 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>   	struct idr		bo_list_handles;
>   	struct amdgpu_ctx_mgr	ctx_mgr;
>   	u32			vram_lost_counter;
> +	int gpu_reset_counter;
>   };
>   
>   /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 9467cf6..6a1515e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>   	if (amdgpu_kms_vram_lost(adev, fpriv))
>   		return -ENODEV;
>   
> +	if (amdgpu_sriov_vf(adev) &&
> +		amdgpu_sriov_reset_level == 1 &&
> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
> +		return -ENODEV;
> +
>   	parser.adev = adev;
>   	parser.filp = filp;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 282f45b..bd389cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>   	if (amdgpu_kms_vram_lost(adev, fpriv))
>   		return -ENODEV;
>   
> +	if (amdgpu_sriov_vf(adev) &&
> +		amdgpu_sriov_reset_level == 1 &&
> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
> +		return -ENODEV;
> +
>   	switch (info->query) {
>   	case AMDGPU_INFO_ACCEL_WORKING:
>   		ui32 = adev->accel_working;
> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>   		goto out_suspend;
>   	}
>   
> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
> +
>   	r = amdgpu_vm_init(adev, &fpriv->vm,
>   			   AMDGPU_VM_CONTEXT_GFX, 0);
>   	if (r) {