[PATCH] drm/amdgpu: add support to SMU debug option

Wed Dec 1 16:01:27 UTC 2021

On 2021-12-01 8:11 a.m., Christian König wrote:
> Adding Andrey as well.
>
> Am 01.12.21 um 12:37 schrieb Yu, Lang:
>> [SNIP]
>>>>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>>>> prevent further data corruption with that.
>>>>>>>
>>>>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>>>>>> judge here since I don't know the SMU code well enough.
>>>>>> This is what SMU FW guys want. They want "user-visible (potentially
>>>>>> fatal)
>>>>> errors", then a hang.
>>>>>> They want to keep system state since the error occurred.
>>>>> Well that is rather problematic.
>>>>>
>>>>> First of all we need to really justify that, crashing the kernel is
>>>>> not something easily done.
>>>>>
>>>>> Then this isn't really effective here. What happens is that you crash
>>>>> the kernel thread of the currently executing process, but it is
>>>>> perfectly possible that another thread still tries to send messages
>>>>> to the SMU. You need to have the BUG_ON() before dropping the lock to
>>>>> make sure that this really gets the driver stuck in the current 
>>>>> state.
>>>> Thanks. I got it. I just thought it is a kenel panic.
>>>> Could we use a panic() here?
>>> Potentially, but that might reboot the system automatically which is 
>>> probably not
>>> what you want either.
>>>
>>> How does the SMU firmware team gather the necessary information when a
>>> problem occurs?
>> As far as I know, they usually use a HDT to collect information.
>> And they request a hang when error occurred in ticket.
>> "Suggested error responses include pop-up windows (by x86 driver, if 
>> this is possible) or simply hanging after logging the error."
>
> In that case I suggest to set the "don't_touch_the_hardware_any_more" 
> procedure we also use in case of PCIe hotplug.
>
> Andrey has the details but essentially it stops the driver from 
> touching the hardware any more, signals all fences and unblocks 
> everything.
>
> It should then be trivial to inspect the hardware state and see what's 
> going on, but the system will keep stable at least for SSH access.
>
> Might be a good idea to have that mode for other fault cases like page 
> faults and hardware crashes.
>
> Regards,
> Christian.

There is no one specific function that does all of that, what I think 
can be done is to bring the device to kind of halt state where no one 
touches it - as following -

1) Follow amdpgu_pci_remove -

     drm_dev_unplug to make device inaccessible to user space (IOCTLs 
e.t.c.) and clears MMIO mappings to device and disallows remappings 
through page faults

     No need to call all of amdgpu_driver_unload_kms but, within it call 
amdgpu_irq_disable_all and amdgpu_fence_driver_hw_fini toi disable 
interrupts and force signall all HW fences.

     pci_disable_device and pci_wait_for_pending_transaction to flush 
any in flight DMA operations from device

2) set adev->no_hw_access so that most of places we access HW (all 
subsequent registers reads/writes and SMU/PSP message sending is 
skipped, but some race will be with those already in progress so maybe 
adding some wait)

Andrey

>
>>
>> Regards,
>> Lang
>>
>