[PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset

Luben Tuikov luben.tuikov at amd.com
Thu Mar 30 13:01:31 UTC 2023


Hi Alan,

Inline:

On 2023-03-30 06:48, Christian König wrote:
> Am 30.03.23 um 11:15 schrieb Liu, HaoPing (Alan):
>>
>> [AMD Official Use Only - General]
>>
>>  
>>
>> Hi Christian,
>>
>>  
>>
>> Thanks for the review. Please see inline.
>>
>>  
>>
>> Best Regards,
>>
>> Alan
>>
>>  
>>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> Sent: Tuesday, March 28, 2023 7:16 PM
>> To: Liu, HaoPing (Alan) <HaoPing.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>> Cc: Lakha, Bhawanpreet <Bhawanpreet.Lakha at amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset
>>
>>  
>>
>> Am 27.03.23 um 17:20 schrieb Alan Liu:
>>
>> > [Why]
>>
>> > After gpu-reset, sometimes the driver would fail to enable vblank irq,
>>
>> > causing flip_done timed out and the desktop freezed.
>>
>>>>
>> > During gpu-reset, we will disable and enable vblank irq in
>>
>> > dm_suspend() and dm_resume(). Later on in
>>
>> > amdgpu_irq_gpu_reset_resume_helper(), we will check irqs' refcount and decide to enable or disable the irqs again.
>>
>>>>
>> > However, we have 2 sets of API for controling vblank irq, one is
>>
>> > dm_vblank_get/put() and another is amdgpu_irq_get/put(). Each API has
>>
>> > its own refcount and flag to store the state of vblank irq, and they
>>
>> > are not synchronized.
>>
>>  
>>
>> This is the source of the problem and you should address this instead.
>>
>> The change you suggested below would break in some use cases.
>>
>>  
>>
>> In struct drm_vblank_crtc we have a vblank irq refcount controlled by drm layer, and in struct amdgpu_irq_src we have enabled_types as refcount for each irq controlled by the dm.
>>
>> I think the best solution will be to get rid of the refcount in drm and only maintain the dm one, and add a crtc function for the drm layer to get the refcount/state of vblank.
>>
>> But this may be dangerous since it’s a change in drm layer. Do you have any comments?
>>
> 
> You don't necessarily need to remove it completely, what you can do as well is properly chaining of them.
> 
> E.g. when the DRM counter goes from 0->1 or 1->0 it calls some function to enable/disable the hw irq. In this situation you call amdgpu_irq_get()/amdgpu_irq_put() as appropriate.
> 
> The the code in this patch already looks like it goes into the right direction regarding that. It just seems to be that you have some race issues when you need to add checks that the IRQ counter doesn't goes below 0.

Changing the DRM layer is generally not a good idea, unless
there is a compelling reason to do so, like fixing a bug, or adding
a new feature benefiting all drivers. As there are many drivers using
DRM, any changes in DRM are vetted thoroughly and need a good reason to
take place.

I suggest you follow Christian's advice.

Note that there's already a callback from drm_vblank_get() down
to amdgpu_enable_vblank_kms() which calls amdgpu_irq_get(). Perhaps,
you can leverage that. Similarly for the drm_vblank_put() to
the amdgpu_vblank_put() path.

> 
>>  
>>
>>>>
>> > In drm we use the first API to control vblank irq but in
>>
>> > amdgpu_irq_gpu_reset_resume_helper() we use the second set of API.
>>
>>>>
>> > The failure happens when vblank irq was enabled by dm_vblank_get()
>>
>> > before gpu-reset, we have vblank->enabled true. However, during
>>
>> > gpu-reset, in amdgpu_irq_gpu_reset_resume_helper(), vblank irq's state
>>
>> > checked from
>>
>> > amdgpu_irq_update() is DISABLED. So finally it will disable vblank irq
>>
>> > again. After gpu-reset, if there is a cursor plane commit, the driver
>>
>> > will try to enable vblank irq by calling drm_vblank_enable(), but the
>>
>> > vblank->enabled is still true, so it fails to turn on vblank irq and
>>
>> > causes flip_done can't be completed in vblank irq handler and desktop
>>
>> > become freezed.
>>
>>>>
>> > [How]
>>
>> > Combining the 2 vblank control APIs by letting drm's API finally calls
>>
>> > amdgpu_irq's API, so the irq's refcount and state of both APIs can be
>>
>> > synchronized. Also add a check to prevent refcount from being less
>>
>> > then
>>
>> > 0 in amdgpu_irq_put().
>>
>>>>
>> > Signed-off-by: Alan Liu <HaoPing.Liu at amd.com <mailto:HaoPing.Liu at amd.com>>
>>
>> > ---
>>
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c            |  3 +++
>>
>> >   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 14 ++++++++++----
>>
>> >   2 files changed, 13 insertions(+), 4 deletions(-)
>>
>>>>
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > index a6aef488a822..1b66003657e2 100644
>>
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > @@ -597,6 +597,9 @@ int amdgpu_irq_put(struct amdgpu_device *adev, struct amdgpu_irq_src *src,
>>
>> >            if (!src->enabled_types || !src->funcs->set)
>>
>> >                           return -EINVAL;
>>
>> >  
>>
>> > +         if (!amdgpu_irq_enabled(adev, src, type))
>>
>> > +                       return 0;
>>
>> > +
>>
>>  
>>
>> That is racy and won't work. The intention of amdgpu_irq_update() is to always update the irq state, no matter what the status is.
>>
>>  
>>
>> This is a change to amdgpu_irq_put() to prevent the refcount from being cut to less than 0. Does it break the intention of amdgpu_irq_update()?
>>
> 
> Yes, exactly that. The reference count can *never* be below 0 or you have a bug in the caller.
> 
> What you could do is to add a WARN_ON(!amdgpu_irq_enabled(adev, src, type)), but just ignoring the call is an absolute no-go.
> 

In addition to adding the WARN_ON() as Christian suggested, and noting that
you cannot ignore the amdgpu_irq_put() call, perhaps investigate if you can
allow the decrement to take place and then check for negative--atomic_t is
an "int". If it is negative, then complain, dump the stack, set to 0 and return.
-- 
Regards,
Luben


> Regards,
> Christian.
> 
> PS: Please don't use HTML formating when discussing on public mailing lists.
> 
>>  
>>
>> Regards,
>>
>> Christian.
>>
>>  
>>
>> >            if (atomic_dec_and_test(&src->enabled_types[type]))
>>
>> >                           return amdgpu_irq_update(adev, src, type);
>>
>> >  
>>
>> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
>>
>> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
>>
>> > index dc4f37240beb..e04f846b0b19 100644
>>
>> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
>>
>> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
>>
>> > @@ -146,7 +146,7 @@ static void vblank_control_worker(struct
>>
>> > work_struct *work)
>>
>> >  
>>
>> >   static inline int dm_set_vblank(struct drm_crtc *crtc, bool enable)
>>
>> >   {
>>
>> > -          enum dc_irq_source irq_source;
>>
>> > +         int irq_type;
>>
>> >            struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc);
>>
>> >            struct amdgpu_device *adev = drm_to_adev(crtc->dev);
>>
>> >            struct dm_crtc_state *acrtc_state = to_dm_crtc_state(crtc->state);
>>
>> > @@ -169,10 +169,16 @@ static inline int dm_set_vblank(struct drm_crtc *crtc, bool enable)
>>
>> >            if (rc)
>>
>> >                           return rc;
>>
>> >  
>>
>> > -          irq_source = IRQ_TYPE_VBLANK + acrtc->otg_inst;
>>
>> > +         irq_type = amdgpu_display_crtc_idx_to_irq_type(adev,
>>
>> > +acrtc->crtc_id);
>>
>> > +
>>
>> > +         if (enable)
>>
>> > +                       rc = amdgpu_irq_get(adev, &adev->crtc_irq, irq_type);
>>
>> > +
>>
>> > +         else
>>
>> > +                       rc = amdgpu_irq_put(adev, &adev->crtc_irq, irq_type);
>>
>> >  
>>
>> > -          if (!dc_interrupt_set(adev->dm.dc, irq_source, enable))
>>
>> > -                        return -EBUSY;
>>
>> > +         if (rc)
>>
>> > +                       return rc;
>>
>> >  
>>
>> >   skip:
>>
>> >            if (amdgpu_in_reset(adev))
>>
>>  
>>
> 



More information about the amd-gfx mailing list