[PATCH 5/5] drm/amdgpu: reduce reset time
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Tue Jul 26 16:57:06 UTC 2022
On 2022-07-26 05:40, Zhao, Victor wrote:
> [AMD Official Use Only - General]
>
> Hi Andrey,
>
> Reply inline.
>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Sent: Tuesday, July 26, 2022 5:18 AM
> To: Zhao, Victor <Victor.Zhao at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Deng, Emily <Emily.Deng at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> Subject: Re: [PATCH 5/5] drm/amdgpu: reduce reset time
>
>
> On 2022-07-22 03:34, Victor Zhao wrote:
>> In multi container use case, reset time is important, so skip ring
>> tests and cp halt wait during ip suspending for reset as they are
>> going to fail and cost more time on reset
>
> Why are they failing in this case ? Skipping ring tests is not the best idea as you loose important indicator of system's sanity. Is there any way to make them work ?
>
> [Victor]: I've seen gfx ring test fail every time after a gfx engine hang. I thought it should be expected as gfx is in a bad state. Do you know the reason we have ring tests before reset? As we are going to reset the asic anyway.
> Another approach could be to make the skip mode2 only or reduce the wait time here.
I dug down in history and according to commit 'drm/amdgpu:unmap KCQ in
gfx hw_fini(v2)' you need to write to scratch register for completion of
queue unmap operation so you defently don't want to just skip it. I
agree in case
that the ring is hung this has no point but remember that GPU reset can
happen not only to a hunged ring but for other reasons (RAS, manual
reset e.t.c.) in which case you probably want to shut down gracefully here ?
I see we have adev->ip_blocks[i].status.hang flag which you maybe can
use here instead ?
>
>
>> Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 2 +-
>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 26 +++++++++++++++----------
>> 2 files changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> index 222d3d7ea076..f872495ccc3a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> @@ -477,7 +477,7 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device *adev)
>> kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.compute_ring[i],
>> RESET_QUEUES, 0, 0);
>>
>> - if (adev->gfx.kiq.ring.sched.ready)
>> + if (adev->gfx.kiq.ring.sched.ready && !amdgpu_in_reset(adev))
>> r = amdgpu_ring_test_helper(kiq_ring);
>> spin_unlock(&adev->gfx.kiq.ring_lock);
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> index fafbad3cf08d..9ae29023e38f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> @@ -5971,16 +5971,19 @@ static int gfx_v10_0_cp_gfx_enable(struct amdgpu_device *adev, bool enable)
>> WREG32_SOC15(GC, 0, mmCP_ME_CNTL, tmp);
>> }
>>
>> - for (i = 0; i < adev->usec_timeout; i++) {
>> - if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
>> - break;
>> - udelay(1);
>> - }
>> -
>> - if (i >= adev->usec_timeout)
>> - DRM_ERROR("failed to %s cp gfx\n", enable ? "unhalt" : "halt");
>> + if (!amdgpu_in_reset(adev)) {
>> + for (i = 0; i < adev->usec_timeout; i++) {
>> + if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
>> + break;
>> + udelay(1);
>> + }
>>
>> + if (i >= adev->usec_timeout)
>> + DRM_ERROR("failed to %s cp gfx\n",
>> + enable ? "unhalt" : "halt");
>> + }
>> return 0;
>> +
>> }
>
> This change has impact beyond container case no ? We had no issue with this code during regular reset cases so why we would give up on this code which confirms CP is idle ? What is the side effect of skipping this during all GPU resets ?
>
> Andrey
>
> [Victor]: I see "failed to halt cp gfx" with regular reset cases as well when doing a gfx hang test using quark. I haven't seen a side effect with Mode1 reset yet but maybe shorten the wait time could be better?
Same as above i guess, it would indeed time out for a hung ring but GPU
reset happens not only because of hung rings but for other reasons.
Andrey
>
>>
>> static int gfx_v10_0_cp_gfx_load_pfp_microcode(struct amdgpu_device
>> *adev) @@ -7569,8 +7572,10 @@ static int gfx_v10_0_kiq_disable_kgq(struct amdgpu_device *adev)
>> for (i = 0; i < adev->gfx.num_gfx_rings; i++)
>> kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.gfx_ring[i],
>> PREEMPT_QUEUES, 0, 0);
>> -
>> - return amdgpu_ring_test_helper(kiq_ring);
>> + if (!amdgpu_in_reset(adev))
>> + return amdgpu_ring_test_helper(kiq_ring);
>> + else
>> + return 0;
>> }
>> #endif
>>
>> @@ -7610,6 +7615,7 @@ static int gfx_v10_0_hw_fini(void *handle)
>>
>> return 0;
>> }
>> +
>> gfx_v10_0_cp_enable(adev, false);
>> gfx_v10_0_enable_gui_idle_interrupt(adev, false);
>>
More information about the amd-gfx
mailing list