[PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover

Thu Oct 26 14:05:47 UTC 2023

Am 26.10.23 um 15:34 schrieb Yifan Zhang:
> gpu tlb flush is skipped if reset sem is held, it makes
> mes_self_test fail since it involves add_hw_queue/remove_hw_queue
> which needs tlb flush functional. Remove mes_self_test in gpu
> recover sequence.

Oh, the TLB issue is actually only the tip of the iceberg. That only 
worked by coincident.

During GPU reset you are not allowed to make any memory allocation or 
otherwise you can run into a deadlock. So doing a MES test is a complete 
no-go in the first place.

>
> This patch is to fix the recover failure in gfx11.
>
> [ 1831.768292] [drm] ring sdma_32769.3.3 was added
> [ 1831.768313] [drm] ring gfx_32769.1.1 ib test pass
> [ 1831.768337] [drm] ring compute_32769.2.2 ib test pass
> [ 1831.768399] amdgpu 0000:c2:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process  pid 0 thread  pid 0)
> [ 1831.768434] amdgpu 0000:c2:00.0: amdgpu:   in page starting at address 0x0000aec200000000 from client 10
> [ 1831.768456] amdgpu 0000:c2:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00800A30
> [ 1831.768473] amdgpu 0000:c2:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
> [ 1831.768489] amdgpu 0000:c2:00.0: amdgpu:      MORE_FAULTS: 0x0
> [ 1831.768501] amdgpu 0000:c2:00.0: amdgpu:      WALKER_ERROR: 0x0
> [ 1831.768513] amdgpu 0000:c2:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
> [ 1831.768521] amdgpu 0000:c2:00.0: amdgpu:      MAPPING_ERROR: 0x0
> [ 1831.768529] amdgpu 0000:c2:00.0: amdgpu:      RW: 0x0
> [ 1831.931229] amdgpu 0000:c2:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma_32769.3.3 test failed (-110)
> [ 1832.062917] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
> [ 1832.063107] [drm:amdgpu_mes_remove_hw_queue [amdgpu]] *ERROR* failed to remove hardware queue, queue id = 3
>
> Fixes: d0c860f33553 ("drm/amdgpu: rework lock handling for flush_tlb v2")
> Reported-by: Li Ma <li.ma at amd.com>
> Signed-off-by: Yifan Zhang <yifan1.zhang at amd.com>

Reviewed-by: Christian König <christian.koenig at amd.com>

Thanks for looking into this,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
>   1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7ec32b44df05..5169e55b7fd2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5557,10 +5557,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			drm_sched_start(&ring->sched, true);
>   		}
>   
> -		if (adev->enable_mes &&
> -		    amdgpu_ip_version(adev, GC_HWIP, 0) != IP_VERSION(11, 0, 3))
> -			amdgpu_mes_self_test(tmp_adev);
> -
>   		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
>   			drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
>