[PATCH 1/3] drm/amdgpu: recreate retired page's bo if vram get lost in gpu reset

Mon Sep 30 09:24:52 UTC 2019

Am 30.09.19 um 11:06 schrieb Zhou1, Tao:
> Hi Christian:
>
> So the allocation information of a bo's vram page could be reserved after vram lost?

Yes, exactly. VRAM lost just means that we are returning an error code 
to userspace when they try to use a context created before VRAM was lost.

> In fact, this series could be dropped if the result of amdgpu_bo_create_reserved is guaranteed to be failed after vram lost (protect bad pages from reallocating).

The BOs are not freed in any way, so I think Andrey meant something else.

Regards,
Christian.

>
> Hi Andrey:
>
> A bad page's allocation information will not be lost in gpu reset according to Christian's comments. Do you have any other concern?
>
> Regards,
> Tao
>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> Sent: 2019年9月30日 16:35
>> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org;
>> Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, Guchun
>> <Guchun.Chen at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>
>> Subject: Re: [PATCH 1/3] drm/amdgpu: recreate retired page's bo if vram get
>> lost in gpu reset
>>
>> Am 30.09.19 um 05:19 schrieb Zhou1, Tao:
>>> the info of retired page's bo may be lost if vram lost is encountered
>>> in gpu reset (gpu page table in vram may be lost), force to recreate
>>> all bos
>>>
>>> v2: simplify NULL pointer check
>>>       add more comments
>> Repeating on the v2 of the patch set, this is complete nonsense.
>>
>> When VRAM is lost the BOs and their reservation still exits, only the content
>> is lost because the MC is has been resetted.
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
>>> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 48
>> ++++++++++++++++++++++
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  1 +
>>>    3 files changed, 50 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 62955156653c..a89f6d053b3f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3675,6 +3675,7 @@ static int amdgpu_do_asic_reset(struct
>> amdgpu_hive_info *hive,
>>>    				if (vram_lost) {
>>>    					DRM_INFO("VRAM is lost due to GPU
>> reset!\n");
>>>    					amdgpu_inc_vram_lost(tmp_adev);
>>> +
>> 	amdgpu_ras_recovery_reset(tmp_adev);
>>>    				}
>>>
>>>    				r = amdgpu_gtt_mgr_recover(
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index 486568ded6d6..3f2a2f13e4c6 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -1573,6 +1573,54 @@ static int amdgpu_ras_recovery_fini(struct
>>> amdgpu_device *adev)
>>>
>>>    	return 0;
>>>    }
>>> +
>>> +/*
>>> + * the info of retired page's bo may be lost if vram lost is
>>> +encountered
>>> + * in gpu reset (gpu page table in vram may be lost), force to
>>> +recreate
>>> + * all bos
>>> + */
>>> +void amdgpu_ras_recovery_reset(struct amdgpu_device *adev) {
>>> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>> +	struct ras_err_handler_data *data;
>>> +	uint64_t bp;
>>> +	struct amdgpu_bo *bo = NULL;
>>> +	int i;
>>> +
>>> +	if (!con || !con->eh_data)
>>> +		return;
>>> +
>>> +	/* Note: It's called in gpu reset, recovery_lock may be acquired
>> before
>>> +	 * gpu reset. The following code is out of protect of con-
>>> recovery_lock
>>> +	 * in case of deadlock and bps_bo should be recreated (if successfully)
>>> +	 * whether recovery_lock is locked or not.
>>> +	 * We assume there is no other ras recovery operation simultaneous
>> during
>>> +	 * gpu reset.
>>> +	 */
>>> +	data = con->eh_data;
>>> +	/* force to reserve all retired page again */
>>> +	data->last_reserved = 0;
>>> +
>>> +	for (i = data->last_reserved; i < data->count; i++) {
>>> +		bp = data->bps[i].retired_page;
>>> +
>>> +		/* There are three cases of reserve error should be ignored:
>>> +		 * 1) a ras bad page has been allocated (used by someone);
>>> +		 * 2) a ras bad page has been reserved (duplicate error
>> injection
>>> +		 *    for one page);
>>> +		 * 3) bo info isn't lost in gpu reset
>>> +		 */
>>> +		if (amdgpu_bo_create_kernel_at(adev, bp <<
>> AMDGPU_GPU_PAGE_SHIFT,
>>> +					       AMDGPU_GPU_PAGE_SIZE,
>>> +					       AMDGPU_GEM_DOMAIN_VRAM,
>>> +					       &bo, NULL))
>>> +			DRM_NOTE("RAS NOTE: recreate bo for retired page
>> 0x%llx fail\n", bp);
>>> +		else
>>> +			data->bps_bo[i] = bo;
>>> +		data->last_reserved = i + 1;
>>> +		bo = NULL;
>>> +	}
>>> +}
>>>    /* recovery end */
>>>
>>>    /* return 0 if ras will reset gpu and repost.*/ diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index f80fd3428c98..7a606d3be806 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -479,6 +479,7 @@ static inline int amdgpu_ras_is_supported(struct
>> amdgpu_device *adev,
>>>    }
>>>
>>>    int amdgpu_ras_recovery_init(struct amdgpu_device *adev);
>>> +void amdgpu_ras_recovery_reset(struct amdgpu_device *adev);
>>>    int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
>>>    		unsigned int block);
>>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx