[PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset

Liu, HaoPing (Alan) HaoPing.Liu at amd.com
Thu Mar 30 09:15:07 UTC 2023


[AMD Official Use Only - General]


Hi Christian,



Thanks for the review. Please see inline.



Best Regards,

Alan



-----Original Message-----
From: Christian König <ckoenig.leichtzumerken at gmail.com>
Sent: Tuesday, March 28, 2023 7:16 PM
To: Liu, HaoPing (Alan) <HaoPing.Liu at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Lakha, Bhawanpreet <Bhawanpreet.Lakha at amd.com>
Subject: Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset



Am 27.03.23 um 17:20 schrieb Alan Liu:

> [Why]

> After gpu-reset, sometimes the driver would fail to enable vblank irq,

> causing flip_done timed out and the desktop freezed.

>

> During gpu-reset, we will disable and enable vblank irq in

> dm_suspend() and dm_resume(). Later on in

> amdgpu_irq_gpu_reset_resume_helper(), we will check irqs' refcount and decide to enable or disable the irqs again.

>

> However, we have 2 sets of API for controling vblank irq, one is

> dm_vblank_get/put() and another is amdgpu_irq_get/put(). Each API has

> its own refcount and flag to store the state of vblank irq, and they

> are not synchronized.



This is the source of the problem and you should address this instead.

The change you suggested below would break in some use cases.



In struct drm_vblank_crtc we have a vblank irq refcount controlled by drm layer, and in struct amdgpu_irq_src we have enabled_types as refcount for each irq controlled by the dm.

I think the best solution will be to get rid of the refcount in drm and only maintain the dm one, and add a crtc function for the drm layer to get the refcount/state of vblank.

But this may be dangerous since it's a change in drm layer. Do you have any comments?



>

> In drm we use the first API to control vblank irq but in

> amdgpu_irq_gpu_reset_resume_helper() we use the second set of API.

>

> The failure happens when vblank irq was enabled by dm_vblank_get()

> before gpu-reset, we have vblank->enabled true. However, during

> gpu-reset, in amdgpu_irq_gpu_reset_resume_helper(), vblank irq's state

> checked from

> amdgpu_irq_update() is DISABLED. So finally it will disable vblank irq

> again. After gpu-reset, if there is a cursor plane commit, the driver

> will try to enable vblank irq by calling drm_vblank_enable(), but the

> vblank->enabled is still true, so it fails to turn on vblank irq and

> causes flip_done can't be completed in vblank irq handler and desktop

> become freezed.

>

> [How]

> Combining the 2 vblank control APIs by letting drm's API finally calls

> amdgpu_irq's API, so the irq's refcount and state of both APIs can be

> synchronized. Also add a check to prevent refcount from being less

> then

> 0 in amdgpu_irq_put().

>

> Signed-off-by: Alan Liu <HaoPing.Liu at amd.com<mailto:HaoPing.Liu at amd.com>>

> ---

>   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c            |  3 +++

>   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 14 ++++++++++----

>   2 files changed, 13 insertions(+), 4 deletions(-)

>

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c

> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c

> index a6aef488a822..1b66003657e2 100644

> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c

> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c

> @@ -597,6 +597,9 @@ int amdgpu_irq_put(struct amdgpu_device *adev, struct amdgpu_irq_src *src,

>            if (!src->enabled_types || !src->funcs->set)

>                           return -EINVAL;

>

> +         if (!amdgpu_irq_enabled(adev, src, type))

> +                       return 0;

> +



That is racy and won't work. The intention of amdgpu_irq_update() is to always update the irq state, no matter what the status is.



This is a change to amdgpu_irq_put() to prevent the refcount from being cut to less than 0. Does it break the intention of amdgpu_irq_update()?



Regards,

Christian.



>            if (atomic_dec_and_test(&src->enabled_types[type]))

>                           return amdgpu_irq_update(adev, src, type);

>

> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c

> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c

> index dc4f37240beb..e04f846b0b19 100644

> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c

> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c

> @@ -146,7 +146,7 @@ static void vblank_control_worker(struct

> work_struct *work)

>

>   static inline int dm_set_vblank(struct drm_crtc *crtc, bool enable)

>   {

> -          enum dc_irq_source irq_source;

> +         int irq_type;

>            struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc);

>            struct amdgpu_device *adev = drm_to_adev(crtc->dev);

>            struct dm_crtc_state *acrtc_state = to_dm_crtc_state(crtc->state);

> @@ -169,10 +169,16 @@ static inline int dm_set_vblank(struct drm_crtc *crtc, bool enable)

>            if (rc)

>                           return rc;

>

> -          irq_source = IRQ_TYPE_VBLANK + acrtc->otg_inst;

> +         irq_type = amdgpu_display_crtc_idx_to_irq_type(adev,

> +acrtc->crtc_id);

> +

> +         if (enable)

> +                       rc = amdgpu_irq_get(adev, &adev->crtc_irq, irq_type);

> +

> +         else

> +                       rc = amdgpu_irq_put(adev, &adev->crtc_irq, irq_type);

>

> -          if (!dc_interrupt_set(adev->dm.dc, irq_source, enable))

> -                        return -EBUSY;

> +         if (rc)

> +                       return rc;

>

>   skip:

>            if (amdgpu_in_reset(adev))


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20230330/0727d347/attachment-0001.htm>


More information about the amd-gfx mailing list