[PATCH] drm/amdgpu: Update irq disable flow during unload

Mon Jan 8 08:32:58 UTC 2024

On 1/8/2024 1:51 PM, Christian König wrote:
> Am 08.01.24 um 09:13 schrieb Kamal, Asad:
>> [AMD Official Use Only - General]
>>
>> Hi Christian,
>>
>> Thank you for the comment.
>>
>> This is not normal reset, it is reset done during unload for smu 
>> v_13_0_2.
> 
> Yeah, but this doesn't explain the rational for this.
> 
> IRQ enable/disable should be balanced in hw_init()/hw_fini(), 
> independent of what else you do.
> 
> I'm not sure what you are trying to solve but this here is a complete 
> no-go.
> 

This is a special reset done during module unload by this commit - 
f5c7e7797060 ("drm/amdgpu: Adjust removal control flow for smu
v13_0_2"). Without this commit, it seems driver reload doesnt' work.

In this particular case, a the reset is done during unload and only 
resume sequence of only select IPs are done (part of the workaround in 
the patch). For those IPs, irqs are enabled during 
late_init/ras_late_init, and not during hw_init(), that part gets skipped.

The module unload sequence causes a WARN trace during irq_put of those 
IPs during hw_fini(). Those are mix of generic irqs and ras irqs, so 
there is no clean way to untangle it.

One thing that could be done is to add an extra check for 13.0.2 version 
to make it clear that this workaround is done for only for those ASICs.

Thanks,
Lijo

> Regards,
> Christian.
> 
>>
>> Thanks & Regards
>> Asad
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> Sent: Monday, January 8, 2024 1:33 PM
>> To: Kamal, Asad <Asad.Kamal at amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: Update irq disable flow during unload
>>
>> Am 05.01.24 um 16:21 schrieb Asad Kamal:
>>> In certain special cases, e.g device reset before module unload, irq
>>> gets disabled as part of reset sequence and won't get enabled back.
>>> Add special check to cover such scenarios
>> Well complete NAK to that. Resets shouldn't affect the IRQ state at all!
>>
>> If this is an issue then something else is broken.
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Asad Kamal <asad.kamal at amd.com>
>>> Suggested-by: Lijo Lazar <lijo.lazar at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 12 ++++++++++--
>>>    drivers/gpu/drm/amd/amdgpu/soc15.c    | 13 +++++++++++--
>>>    2 files changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> index 372de9f1ce59..a4e1b9a58679 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> @@ -2361,6 +2361,7 @@ static void gmc_v9_0_gart_disable(struct 
>>> amdgpu_device *adev)
>>>    static int gmc_v9_0_hw_fini(void *handle)
>>>    {
>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>> +     bool irq_release = true;
>>>
>>>        gmc_v9_0_gart_disable(adev);
>>>
>>> @@ -2378,9 +2379,16 @@ static int gmc_v9_0_hw_fini(void *handle)
>>>        if (adev->mmhub.funcs->update_power_gating)
>>>                adev->mmhub.funcs->update_power_gating(adev, false);
>>>
>>> -     amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
>>> +     if (adev->shutdown)
>>> +             irq_release = amdgpu_irq_enabled(adev, 
>>> &adev->gmc.vm_fault, 0);
>>>
>>> -     if (adev->gmc.ecc_irq.funcs &&
>>> +     if (irq_release)
>>> +             amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
>>> +
>>> +     if (adev->shutdown)
>>> +             irq_release = amdgpu_irq_enabled(adev, 
>>> &adev->gmc.ecc_irq, 0);
>>> +
>>> +     if (adev->gmc.ecc_irq.funcs && irq_release &&
>>>                amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__UMC))
>>>                amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0);
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>> b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>> index 15033efec2ba..7ee835049d57 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>> @@ -1266,6 +1266,7 @@ static int soc15_common_hw_init(void *handle)
>>>    static int soc15_common_hw_fini(void *handle)
>>>    {
>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>> +     bool irq_release = true;
>>>
>>>        /* Disable the doorbell aperture and selfring doorbell aperture
>>>         * separately in hw_fini because 
>>> soc15_enable_doorbell_aperture @@
>>> -1280,10 +1281,18 @@ static int soc15_common_hw_fini(void *handle)
>>>
>>>        if (adev->nbio.ras_if &&
>>>            amdgpu_ras_is_supported(adev, adev->nbio.ras_if->block)) {
>>> -             if (adev->nbio.ras &&
>>> +             if (adev->shutdown)
>>> +                     irq_release = amdgpu_irq_enabled(adev,
>>> +&adev->nbio.ras_controller_irq, 0);
>>> +
>>> +             if (adev->nbio.ras && irq_release &&
>>>                    adev->nbio.ras->init_ras_controller_interrupt)
>>>                        amdgpu_irq_put(adev, 
>>> &adev->nbio.ras_controller_irq, 0);
>>> -             if (adev->nbio.ras &&
>>> +
>>> +             if (adev->shutdown)
>>> +                     irq_release = amdgpu_irq_enabled(adev,
>>> +                                     
>>> &adev->nbio.ras_err_event_athub_irq, 0);
>>> +
>>> +             if (adev->nbio.ras && irq_release &&
>>>                    adev->nbio.ras->init_ras_err_event_athub_interrupt)
>>>                        amdgpu_irq_put(adev, 
>>> &adev->nbio.ras_err_event_athub_irq, 0);
>>>        }
>