[PATCH] drm/amdgpu: Update irq disable flow during unload
Christian König
christian.koenig at amd.com
Mon Jan 8 09:01:05 UTC 2024
Am 08.01.24 um 09:32 schrieb Lazar, Lijo:
> On 1/8/2024 1:51 PM, Christian König wrote:
>> Am 08.01.24 um 09:13 schrieb Kamal, Asad:
>>> [AMD Official Use Only - General]
>>>
>>> Hi Christian,
>>>
>>> Thank you for the comment.
>>>
>>> This is not normal reset, it is reset done during unload for smu
>>> v_13_0_2.
>>
>> Yeah, but this doesn't explain the rational for this.
>>
>> IRQ enable/disable should be balanced in hw_init()/hw_fini(),
>> independent of what else you do.
>>
>> I'm not sure what you are trying to solve but this here is a complete
>> no-go.
>>
>
> This is a special reset done during module unload by this commit -
> f5c7e7797060 ("drm/amdgpu: Adjust removal control flow for smu
> v13_0_2"). Without this commit, it seems driver reload doesnt' work.
>
> In this particular case, a the reset is done during unload and only
> resume sequence of only select IPs are done (part of the workaround in
> the patch). For those IPs, irqs are enabled during
> late_init/ras_late_init, and not during hw_init(), that part gets
> skipped.
Please revert that immediately, this whole approach is completely broken
as far as I can see.
>
> The module unload sequence causes a WARN trace during irq_put of those
> IPs during hw_fini(). Those are mix of generic irqs and ras irqs, so
> there is no clean way to untangle it.
This is WARN is just the tip of the iceberg here, the problem is that
you are not supposed to call amdgpu_device_ip_resume_phase1() as you do
in f5c7e7797060.
Please sync up with Alex how to do this cleanly.
Regards,
Christian.
>
> One thing that could be done is to add an extra check for 13.0.2
> version to make it clear that this workaround is done for only for
> those ASICs.
>
> Thanks,
> Lijo
>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks & Regards
>>> Asad
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Monday, January 8, 2024 1:33 PM
>>> To: Kamal, Asad <Asad.Kamal at amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: Update irq disable flow during unload
>>>
>>> Am 05.01.24 um 16:21 schrieb Asad Kamal:
>>>> In certain special cases, e.g device reset before module unload, irq
>>>> gets disabled as part of reset sequence and won't get enabled back.
>>>> Add special check to cover such scenarios
>>> Well complete NAK to that. Resets shouldn't affect the IRQ state at
>>> all!
>>>
>>> If this is an issue then something else is broken.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Signed-off-by: Asad Kamal <asad.kamal at amd.com>
>>>> Suggested-by: Lijo Lazar <lijo.lazar at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 12 ++++++++++--
>>>> drivers/gpu/drm/amd/amdgpu/soc15.c | 13 +++++++++++--
>>>> 2 files changed, 21 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> index 372de9f1ce59..a4e1b9a58679 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>> @@ -2361,6 +2361,7 @@ static void gmc_v9_0_gart_disable(struct
>>>> amdgpu_device *adev)
>>>> static int gmc_v9_0_hw_fini(void *handle)
>>>> {
>>>> struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>> + bool irq_release = true;
>>>>
>>>> gmc_v9_0_gart_disable(adev);
>>>>
>>>> @@ -2378,9 +2379,16 @@ static int gmc_v9_0_hw_fini(void *handle)
>>>> if (adev->mmhub.funcs->update_power_gating)
>>>> adev->mmhub.funcs->update_power_gating(adev, false);
>>>>
>>>> - amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
>>>> + if (adev->shutdown)
>>>> + irq_release = amdgpu_irq_enabled(adev,
>>>> &adev->gmc.vm_fault, 0);
>>>>
>>>> - if (adev->gmc.ecc_irq.funcs &&
>>>> + if (irq_release)
>>>> + amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
>>>> +
>>>> + if (adev->shutdown)
>>>> + irq_release = amdgpu_irq_enabled(adev,
>>>> &adev->gmc.ecc_irq, 0);
>>>> +
>>>> + if (adev->gmc.ecc_irq.funcs && irq_release &&
>>>> amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__UMC))
>>>> amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0);
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> index 15033efec2ba..7ee835049d57 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> @@ -1266,6 +1266,7 @@ static int soc15_common_hw_init(void *handle)
>>>> static int soc15_common_hw_fini(void *handle)
>>>> {
>>>> struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>> + bool irq_release = true;
>>>>
>>>> /* Disable the doorbell aperture and selfring doorbell aperture
>>>> * separately in hw_fini because
>>>> soc15_enable_doorbell_aperture @@
>>>> -1280,10 +1281,18 @@ static int soc15_common_hw_fini(void *handle)
>>>>
>>>> if (adev->nbio.ras_if &&
>>>> amdgpu_ras_is_supported(adev, adev->nbio.ras_if->block)) {
>>>> - if (adev->nbio.ras &&
>>>> + if (adev->shutdown)
>>>> + irq_release = amdgpu_irq_enabled(adev,
>>>> +&adev->nbio.ras_controller_irq, 0);
>>>> +
>>>> + if (adev->nbio.ras && irq_release &&
>>>> adev->nbio.ras->init_ras_controller_interrupt)
>>>> amdgpu_irq_put(adev,
>>>> &adev->nbio.ras_controller_irq, 0);
>>>> - if (adev->nbio.ras &&
>>>> +
>>>> + if (adev->shutdown)
>>>> + irq_release = amdgpu_irq_enabled(adev,
>>>> + &adev->nbio.ras_err_event_athub_irq, 0);
>>>> +
>>>> + if (adev->nbio.ras && irq_release &&
>>>> adev->nbio.ras->init_ras_err_event_athub_interrupt)
>>>> amdgpu_irq_put(adev,
>>>> &adev->nbio.ras_err_event_athub_irq, 0);
>>>> }
>>
>
More information about the amd-gfx
mailing list