[PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened

Thu Dec 10 02:25:38 UTC 2020

[AMD Public Use]

Yeah, we discussed this issue again. We think it's better UMD make some changes instead of changing in KMD. If FLR happened, it's OK for UMD create new context and continue to submit jobs.
However, if BACO or mode 1 reset happens, of course UMD could also submit jobs, but these jobs can't use any resource create before the reset, including page table.
Because all the contents in VRAM has lost after BACO or mode 1 reset, which including APP's buffer.

-----Original Message-----
From: Koenig, Christian <Christian.Koenig at amd.com> 
Sent: Wednesday, December 9, 2020 6:06 PM
To: Liu, Cheng Zhe <ChengZhe.Liu at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Tuikov, Luben <Luben.Tuikov at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Xiao, Jack <Jack.Xiao at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Xu, Feifei <Feifei.Xu at amd.com>; Wang, Kevin(Yang) <Kevin1.Wang at amd.com>; Yuan, Xiaojie <Xiaojie.Yuan at amd.com>
Subject: Re: [PATCH] drm/amdgpu: Return -EINVAL when whole gpu reset happened

Am 09.12.20 um 10:46 schrieb Liu ChengZhe:
> If CS init return -ECANCELED, UMD will free and create new context.
> Job in this new context could conitnue exexcuting. In the case of BACO 
> or mode 1, we can't allow this happpen. Because VRAM has lost after 
> whole gpu reset, the job can't guarantee to succeed.

NAK, this is intentional.

When ECANCELED is returned UMD should create new context after a GPU reset to get back into an usable state and continue to submit jobs.

Regards,
Christian.

>
> Signed-off-by: Liu ChengZhe <ChengZhe.Liu at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 85e48c29a57c..2a98f58134ed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -120,6 +120,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>   	uint64_t *chunk_array;
>   	unsigned size, num_ibs = 0;
>   	uint32_t uf_offset = 0;
> +	uint32_t vramlost_count = 0;
>   	int i;
>   	int ret;
>   
> @@ -140,7 +141,11 @@ static int amdgpu_cs_parser_init(struct 
> amdgpu_cs_parser *p, union drm_amdgpu_cs
>   
>   	/* skip guilty context job */
>   	if (atomic_read(&p->ctx->guilty) == 1) {
> -		ret = -ECANCELED;
> +		vramlost_count = atomic_read(&p->adev->vram_lost_counter);
> +		if (p->ctx->vram_lost_counter != vramlost_count)
> +			ret = -EINVAL;
> +		else
> +			ret = -ECANCELED;
>   		goto free_chunk;
>   	}
>   
> @@ -246,7 +251,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>   		goto free_all_kdata;
>   
>   	if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
> -		ret = -ECANCELED;
> +		ret = -EINVAL;
>   		goto free_all_kdata;
>   	}
>