[PATCH] drm/amdgpu: add support to SMU debug option
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Wed Dec 1 16:01:27 UTC 2021
On 2021-12-01 8:11 a.m., Christian König wrote:
> Adding Andrey as well.
>
> Am 01.12.21 um 12:37 schrieb Yu, Lang:
>> [SNIP]
>>>>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>>>> prevent further data corruption with that.
>>>>>>>
>>>>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>>>>>> judge here since I don't know the SMU code well enough.
>>>>>> This is what SMU FW guys want. They want "user-visible (potentially
>>>>>> fatal)
>>>>> errors", then a hang.
>>>>>> They want to keep system state since the error occurred.
>>>>> Well that is rather problematic.
>>>>>
>>>>> First of all we need to really justify that, crashing the kernel is
>>>>> not something easily done.
>>>>>
>>>>> Then this isn't really effective here. What happens is that you crash
>>>>> the kernel thread of the currently executing process, but it is
>>>>> perfectly possible that another thread still tries to send messages
>>>>> to the SMU. You need to have the BUG_ON() before dropping the lock to
>>>>> make sure that this really gets the driver stuck in the current
>>>>> state.
>>>> Thanks. I got it. I just thought it is a kenel panic.
>>>> Could we use a panic() here?
>>> Potentially, but that might reboot the system automatically which is
>>> probably not
>>> what you want either.
>>>
>>> How does the SMU firmware team gather the necessary information when a
>>> problem occurs?
>> As far as I know, they usually use a HDT to collect information.
>> And they request a hang when error occurred in ticket.
>> "Suggested error responses include pop-up windows (by x86 driver, if
>> this is possible) or simply hanging after logging the error."
>
> In that case I suggest to set the "don't_touch_the_hardware_any_more"
> procedure we also use in case of PCIe hotplug.
>
> Andrey has the details but essentially it stops the driver from
> touching the hardware any more, signals all fences and unblocks
> everything.
>
> It should then be trivial to inspect the hardware state and see what's
> going on, but the system will keep stable at least for SSH access.
>
> Might be a good idea to have that mode for other fault cases like page
> faults and hardware crashes.
>
> Regards,
> Christian.
There is no one specific function that does all of that, what I think
can be done is to bring the device to kind of halt state where no one
touches it - as following -
1) Follow amdpgu_pci_remove -
drm_dev_unplug to make device inaccessible to user space (IOCTLs
e.t.c.) and clears MMIO mappings to device and disallows remappings
through page faults
No need to call all of amdgpu_driver_unload_kms but, within it call
amdgpu_irq_disable_all and amdgpu_fence_driver_hw_fini toi disable
interrupts and force signall all HW fences.
pci_disable_device and pci_wait_for_pending_transaction to flush
any in flight DMA operations from device
2) set adev->no_hw_access so that most of places we access HW (all
subsequent registers reads/writes and SMU/PSP message sending is
skipped, but some race will be with those already in progress so maybe
adding some wait)
Andrey
>
>>
>> Regards,
>> Lang
>>
>
More information about the amd-gfx
mailing list