[RFC PATCH 1/1] drm/amdgpu: add initial support for pci error handler

Thu Aug 13 18:18:33 UTC 2020

On 8/13/20 11:06 AM, Nirmoy wrote:
>
> On 8/13/20 3:38 PM, Andrey Grodzovsky wrote:
>>
>> On 8/13/20 7:09 AM, Nirmoy wrote:
>>>
>>> On 8/12/20 4:52 PM, Andrey Grodzovsky wrote:
>>>>
>>>> On 8/11/20 9:30 AM, Nirmoy Das wrote:
>>>>> This patch will ignore non-fatal errors and try to
>>>>> stop amdgpu's sw stack on fatal errors.
>>>>>
>>>>> Signed-off-by: Nirmoy Das <nirmoy.das at amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 56 ++++++++++++++++++++++++-
>>>>>   1 file changed, 54 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> index c1219af2e7d6..2b9ede3000ee 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> @@ -35,6 +35,7 @@
>>>>>   #include <linux/pm_runtime.h>
>>>>>   #include <linux/vga_switcheroo.h>
>>>>>   #include <drm/drm_probe_helper.h>
>>>>> +#include <drm/drm_atomic_helper.h>
>>>>>   #include <linux/mmu_notifier.h>
>>>>>     #include "amdgpu.h"
>>>>> @@ -1516,6 +1517,58 @@ static struct drm_driver kms_driver = {
>>>>>       .patchlevel = KMS_DRIVER_PATCHLEVEL,
>>>>>   };
>>>>>   +static pci_ers_result_t amdgpu_pci_err_detected(struct pci_dev *pdev,
>>>>> +                        pci_channel_state_t state)
>>>>> +{
>>>>> +    struct drm_device *dev = pci_get_drvdata(pdev);
>>>>> +    struct amdgpu_device *adev = dev->dev_private;
>>>>> +    int i;
>>>>> +    int ret = PCI_ERS_RESULT_DISCONNECT;
>>>>> +
>>>>> +    switch (state) {
>>>>> +    case pci_channel_io_normal:
>>>>> +        ret = PCI_ERS_RESULT_CAN_RECOVER;
>>>>> +        break;
>>>>> +    default:
>>>>> +        /* Disable power management */
>>>>> +        adev->runpm = 0;
>>>>> +        /* Suspend all IO operations */
>>>>> +        amdgpu_fbdev_set_suspend(adev, 1);
>>>>> + cancel_delayed_work_sync(&adev->delayed_init_work);
>>>>> +        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>> +            struct amdgpu_ring *ring = adev->rings[i];
>>>>> +
>>>>> +            if (!ring || !ring->sched.thread)
>>>>> +                continue;
>>>>> +
>>>>> + amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
>>>>
>>>>
>>>> You need to call drm_sched_stop first before calling this
>>>>
>>>>> +        }
>>>>> +
>>>>> +        if (adev->mode_info.mode_config_initialized) {
>>>>> +            if (!amdgpu_device_has_dc_support(adev))
>>>>> + drm_helper_force_disable_all(adev->ddev);
>>>>> +            else
>>>>> + drm_atomic_helper_shutdown(adev->ddev);
>>>>> +        }
>>>>> +
>>>>> +        amdgpu_fence_driver_fini(adev);
>>>>> +        amdgpu_fbdev_fini(adev);
>>>>> +        /* Try to close drm device to stop applications
>>>>> +         * from opening dri files for further IO operations.
>>>>> +         * TODO: This will throw warning as ttm is not
>>>>> +         * cleaned perperly */
>>>>> +        drm_dev_fini(dev);
>>>>
>>>>
>>>> I think user mode applications might still hold reference to the drm device 
>>>> through through drm_dev_get either by directly opening
>>>> the device file or indirectly through importing DMA buff, if so when the 
>>>> last of them terminate drm_dev_put->drm_dev_release->...->drm_dev_fini
>>>> might get called again causing use after free e.t.c issues. Maybe better to 
>>>> call here drm_dev_put then and so drm_dev_fini will get called when this
>>>> last user client releases his reference.
>>>
>>>
>>> drm_dev_fini() seems to be cleaner. Problem is  window manager(sway) never 
>>> gets terminated after the AER error and drm files remains active. Simple cat 
>>> on dri files
>>>
>>> goes though amdgpu and spits out more errors.
>>
>>
>> What happens if you kill the window manager after you closed drm device with 
>> your original code applied ? I would expect drm_dev_fini to be called again
>> for the reason i explained above and this would obviously would be wrong to 
>> happen.
>
> Hi Andrey,
>
>
> hmm I quickly tried that, Kernel crashed and later rebooted after sometime. I 
> don't have a serial console to check logs and there was no logs afterwards in 
> journalctl.
>
> drm_dev_put() had similar behavior, kernel/machine was inaccessible over ssh.
>
>
> Did you face same behavior while testing gpu hotplug ?
>
>
> Nirmoy

Yea, in my case device sysfs structure was removed on pci_remove while when last 
user client dropped reference and this led
to drm_dev_fini to be called there were more sysfs entries removal there which 
lead to a crash. But here i don't think the sysfs for drm_device
is removed because the device is not extracted...

Andrey

>
>
>>
>> Andrey
>>
>>
>>>
>>>
>>>>
>>>> Also a general question - in my work on DPC recovery feature which tries to 
>>>> recover after PCIe error - once the PCI error has happened MMIO registers 
>>>> become
>>>> unaccessible for r/w as the PCI link is dead until after the PCI link is 
>>>> reset by the DPC driver (see 
>>>> https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html section 
>>>> 6.1.4).
>>>> Your case is to try and gracefully to close the drm device once fatal error 
>>>> happened, didn't you encounter errors or warnings when accessing HW 
>>>> registers during any of the operations
>>>> above ?
>>>
>>>
>>> As discussed over chat, it seems aer generated with aer-inject tool just 
>>> triggers kernel PCI error APIs but the device is still active so I didn't 
>>> encounter any errors when accessing HW registers.
>>>
>>>
>>> Nirmoy
>>>
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>> +        break;
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +static const struct pci_error_handlers amdgpu_err_handler = {
>>>>> +       .error_detected = amdgpu_pci_err_detected,
>>>>> +};
>>>>> +
>>>>> +
>>>>>   static struct pci_driver amdgpu_kms_pci_driver = {
>>>>>       .name = DRIVER_NAME,
>>>>>       .id_table = pciidlist,
>>>>> @@ -1523,10 +1576,9 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>>>>>       .remove = amdgpu_pci_remove,
>>>>>       .shutdown = amdgpu_pci_shutdown,
>>>>>       .driver.pm = &amdgpu_pm_ops,
>>>>> +    .err_handler = &amdgpu_err_handler,
>>>>>   };
>>>>>   -
>>>>> -
>>>>>   static int __init amdgpu_init(void)
>>>>>   {
>>>>>       int r;