[PATCH] drm/amdgpu: Update irq disable flow during unload

Mon Jan 8 09:01:05 UTC 2024

Am 08.01.24 um 09:32 schrieb Lazar, Lijo:
> On 1/8/2024 1:51 PM, Christian König wrote:
>> Am 08.01.24 um 09:13 schrieb Kamal, Asad:
>>> [AMD Official Use Only - General]
>>>
>>> Hi Christian,
>>>
>>> Thank you for the comment.
>>>
>>> This is not normal reset, it is reset done during unload for smu 
>>> v_13_0_2.
>>
>> Yeah, but this doesn't explain the rational for this.
>>
>> IRQ enable/disable should be balanced in hw_init()/hw_fini(), 
>> independent of what else you do.
>>
>> I'm not sure what you are trying to solve but this here is a complete 
>> no-go.
>>
>
> This is a special reset done during module unload by this commit - 
> f5c7e7797060 ("drm/amdgpu: Adjust removal control flow for smu
> v13_0_2"). Without this commit, it seems driver reload doesnt' work.
>
> In this particular case, a the reset is done during unload and only 
> resume sequence of only select IPs are done (part of the workaround in 
> the patch). For those IPs, irqs are enabled during 
> late_init/ras_late_init, and not during hw_init(), that part gets 
> skipped.

Please revert that immediately, this whole approach is completely broken 
as far as I can see.

>
> The module unload sequence causes a WARN trace during irq_put of those 
> IPs during hw_fini(). Those are mix of generic irqs and ras irqs, so 
> there is no clean way to untangle it.

This is WARN is just the tip of the iceberg here, the problem is that 
you are not supposed to call amdgpu_device_ip_resume_phase1() as you do 
in f5c7e7797060.

Please sync up with Alex how to do this cleanly.

Regards,
Christian.

>
> One thing that could be done is to add an extra check for 13.0.2 
> version to make it clear that this workaround is done for only for 
> those ASICs.
>
> Thanks,
> Lijo
>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks & Regards
>>> Asad
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Monday, January 8, 2024 1:33 PM
>>> To: Kamal, Asad <Asad.Kamal at amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: Update irq disable flow during unload
>>>
>>> Am 05.01.24 um 16:21 schrieb Asad Kamal:
>>>> In certain special cases, e.g device reset before module unload, irq
>>>> gets disabled as part of reset sequence and won't get enabled back.
>>>> Add special check to cover such scenarios
>>> Well complete NAK to that. Resets shouldn't affect the IRQ state at 
>>> all!
>>>
>>> If this is an issue then something else is broken.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Signed-off-by: Asad Kamal <asad.kamal at amd.com>
>>>> Suggested-by: Lijo Lazar <lijo.lazar at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 12 ++++++++++--
>>>>    drivers/gpu/drm/amd/amdgpu/soc15.c    | 13 +++++++++++--
>>>>    2 files changed, 21 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> index 372de9f1ce59..a4e1b9a58679 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> @@ -2361,6 +2361,7 @@ static void gmc_v9_0_gart_disable(struct 
>>>> amdgpu_device *adev)
>>>>    static int gmc_v9_0_hw_fini(void *handle)
>>>>    {
>>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>> +     bool irq_release = true;
>>>>
>>>>        gmc_v9_0_gart_disable(adev);
>>>>
>>>> @@ -2378,9 +2379,16 @@ static int gmc_v9_0_hw_fini(void *handle)
>>>>        if (adev->mmhub.funcs->update_power_gating)
>>>> adev->mmhub.funcs->update_power_gating(adev, false);
>>>>
>>>> -     amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
>>>> +     if (adev->shutdown)
>>>> +             irq_release = amdgpu_irq_enabled(adev, 
>>>> &adev->gmc.vm_fault, 0);
>>>>
>>>> -     if (adev->gmc.ecc_irq.funcs &&
>>>> +     if (irq_release)
>>>> +             amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
>>>> +
>>>> +     if (adev->shutdown)
>>>> +             irq_release = amdgpu_irq_enabled(adev, 
>>>> &adev->gmc.ecc_irq, 0);
>>>> +
>>>> +     if (adev->gmc.ecc_irq.funcs && irq_release &&
>>>>                amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__UMC))
>>>>                amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0);
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> index 15033efec2ba..7ee835049d57 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> @@ -1266,6 +1266,7 @@ static int soc15_common_hw_init(void *handle)
>>>>    static int soc15_common_hw_fini(void *handle)
>>>>    {
>>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>> +     bool irq_release = true;
>>>>
>>>>        /* Disable the doorbell aperture and selfring doorbell aperture
>>>>         * separately in hw_fini because 
>>>> soc15_enable_doorbell_aperture @@
>>>> -1280,10 +1281,18 @@ static int soc15_common_hw_fini(void *handle)
>>>>
>>>>        if (adev->nbio.ras_if &&
>>>>            amdgpu_ras_is_supported(adev, adev->nbio.ras_if->block)) {
>>>> -             if (adev->nbio.ras &&
>>>> +             if (adev->shutdown)
>>>> +                     irq_release = amdgpu_irq_enabled(adev,
>>>> +&adev->nbio.ras_controller_irq, 0);
>>>> +
>>>> +             if (adev->nbio.ras && irq_release &&
>>>> adev->nbio.ras->init_ras_controller_interrupt)
>>>>                        amdgpu_irq_put(adev, 
>>>> &adev->nbio.ras_controller_irq, 0);
>>>> -             if (adev->nbio.ras &&
>>>> +
>>>> +             if (adev->shutdown)
>>>> +                     irq_release = amdgpu_irq_enabled(adev,
>>>> + &adev->nbio.ras_err_event_athub_irq, 0);
>>>> +
>>>> +             if (adev->nbio.ras && irq_release &&
>>>> adev->nbio.ras->init_ras_err_event_athub_interrupt)
>>>>                        amdgpu_irq_put(adev, 
>>>> &adev->nbio.ras_err_event_athub_irq, 0);
>>>>        }
>>
>