[RFC PATCH 1/1] drm/amdgpu: add initial support for pci error handler
Andrey Grodzovsky
Andrey.Grodzovsky at amd.com
Thu Aug 13 13:38:26 UTC 2020
On 8/13/20 7:09 AM, Nirmoy wrote:
>
> On 8/12/20 4:52 PM, Andrey Grodzovsky wrote:
>>
>> On 8/11/20 9:30 AM, Nirmoy Das wrote:
>>> This patch will ignore non-fatal errors and try to
>>> stop amdgpu's sw stack on fatal errors.
>>>
>>> Signed-off-by: Nirmoy Das <nirmoy.das at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 56 ++++++++++++++++++++++++-
>>> 1 file changed, 54 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index c1219af2e7d6..2b9ede3000ee 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -35,6 +35,7 @@
>>> #include <linux/pm_runtime.h>
>>> #include <linux/vga_switcheroo.h>
>>> #include <drm/drm_probe_helper.h>
>>> +#include <drm/drm_atomic_helper.h>
>>> #include <linux/mmu_notifier.h>
>>> #include "amdgpu.h"
>>> @@ -1516,6 +1517,58 @@ static struct drm_driver kms_driver = {
>>> .patchlevel = KMS_DRIVER_PATCHLEVEL,
>>> };
>>> +static pci_ers_result_t amdgpu_pci_err_detected(struct pci_dev *pdev,
>>> + pci_channel_state_t state)
>>> +{
>>> + struct drm_device *dev = pci_get_drvdata(pdev);
>>> + struct amdgpu_device *adev = dev->dev_private;
>>> + int i;
>>> + int ret = PCI_ERS_RESULT_DISCONNECT;
>>> +
>>> + switch (state) {
>>> + case pci_channel_io_normal:
>>> + ret = PCI_ERS_RESULT_CAN_RECOVER;
>>> + break;
>>> + default:
>>> + /* Disable power management */
>>> + adev->runpm = 0;
>>> + /* Suspend all IO operations */
>>> + amdgpu_fbdev_set_suspend(adev, 1);
>>> + cancel_delayed_work_sync(&adev->delayed_init_work);
>>> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> + struct amdgpu_ring *ring = adev->rings[i];
>>> +
>>> + if (!ring || !ring->sched.thread)
>>> + continue;
>>> +
>>> + amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
>>
>>
>> You need to call drm_sched_stop first before calling this
>>
>>> + }
>>> +
>>> + if (adev->mode_info.mode_config_initialized) {
>>> + if (!amdgpu_device_has_dc_support(adev))
>>> + drm_helper_force_disable_all(adev->ddev);
>>> + else
>>> + drm_atomic_helper_shutdown(adev->ddev);
>>> + }
>>> +
>>> + amdgpu_fence_driver_fini(adev);
>>> + amdgpu_fbdev_fini(adev);
>>> + /* Try to close drm device to stop applications
>>> + * from opening dri files for further IO operations.
>>> + * TODO: This will throw warning as ttm is not
>>> + * cleaned perperly */
>>> + drm_dev_fini(dev);
>>
>>
>> I think user mode applications might still hold reference to the drm device
>> through through drm_dev_get either by directly opening
>> the device file or indirectly through importing DMA buff, if so when the last
>> of them terminate drm_dev_put->drm_dev_release->...->drm_dev_fini
>> might get called again causing use after free e.t.c issues. Maybe better to
>> call here drm_dev_put then and so drm_dev_fini will get called when this
>> last user client releases his reference.
>
>
> drm_dev_fini() seems to be cleaner. Problem is window manager(sway) never
> gets terminated after the AER error and drm files remains active. Simple cat
> on dri files
>
> goes though amdgpu and spits out more errors.
What happens if you kill the window manager after you closed drm device with
your original code applied ? I would expect drm_dev_fini to be called again
for the reason i explained above and this would obviously would be wrong to happen.
Andrey
>
>
>>
>> Also a general question - in my work on DPC recovery feature which tries to
>> recover after PCIe error - once the PCI error has happened MMIO registers become
>> unaccessible for r/w as the PCI link is dead until after the PCI link is
>> reset by the DPC driver (see
>> https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html section
>> 6.1.4).
>> Your case is to try and gracefully to close the drm device once fatal error
>> happened, didn't you encounter errors or warnings when accessing HW registers
>> during any of the operations
>> above ?
>
>
> As discussed over chat, it seems aer generated with aer-inject tool just
> triggers kernel PCI error APIs but the device is still active so I didn't
> encounter any errors when accessing HW registers.
>
>
> Nirmoy
>
>
>>
>> Andrey
>>
>>
>>> + break;
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +static const struct pci_error_handlers amdgpu_err_handler = {
>>> + .error_detected = amdgpu_pci_err_detected,
>>> +};
>>> +
>>> +
>>> static struct pci_driver amdgpu_kms_pci_driver = {
>>> .name = DRIVER_NAME,
>>> .id_table = pciidlist,
>>> @@ -1523,10 +1576,9 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>>> .remove = amdgpu_pci_remove,
>>> .shutdown = amdgpu_pci_shutdown,
>>> .driver.pm = &amdgpu_pm_ops,
>>> + .err_handler = &amdgpu_err_handler,
>>> };
>>> -
>>> -
>>> static int __init amdgpu_init(void)
>>> {
>>> int r;
More information about the amd-gfx
mailing list