[RFC PATCH 1/1] drm/amdgpu: add initial support for pci error handler
Alex Deucher
alexdeucher at gmail.com
Thu Aug 13 13:36:53 UTC 2020
On Thu, Aug 13, 2020 at 7:06 AM Nirmoy <nirmodas at amd.com> wrote:
>
>
> On 8/12/20 4:52 PM, Andrey Grodzovsky wrote:
> >
> > On 8/11/20 9:30 AM, Nirmoy Das wrote:
> >> This patch will ignore non-fatal errors and try to
> >> stop amdgpu's sw stack on fatal errors.
> >>
> >> Signed-off-by: Nirmoy Das <nirmoy.das at amd.com>
> >> ---
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 56 ++++++++++++++++++++++++-
> >> 1 file changed, 54 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> index c1219af2e7d6..2b9ede3000ee 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> @@ -35,6 +35,7 @@
> >> #include <linux/pm_runtime.h>
> >> #include <linux/vga_switcheroo.h>
> >> #include <drm/drm_probe_helper.h>
> >> +#include <drm/drm_atomic_helper.h>
> >> #include <linux/mmu_notifier.h>
> >> #include "amdgpu.h"
> >> @@ -1516,6 +1517,58 @@ static struct drm_driver kms_driver = {
> >> .patchlevel = KMS_DRIVER_PATCHLEVEL,
> >> };
> >> +static pci_ers_result_t amdgpu_pci_err_detected(struct pci_dev *pdev,
> >> + pci_channel_state_t state)
> >> +{
> >> + struct drm_device *dev = pci_get_drvdata(pdev);
> >> + struct amdgpu_device *adev = dev->dev_private;
> >> + int i;
> >> + int ret = PCI_ERS_RESULT_DISCONNECT;
> >> +
> >> + switch (state) {
> >> + case pci_channel_io_normal:
> >> + ret = PCI_ERS_RESULT_CAN_RECOVER;
> >> + break;
> >> + default:
> >> + /* Disable power management */
> >> + adev->runpm = 0;
> >> + /* Suspend all IO operations */
> >> + amdgpu_fbdev_set_suspend(adev, 1);
> >> + cancel_delayed_work_sync(&adev->delayed_init_work);
> >> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> >> + struct amdgpu_ring *ring = adev->rings[i];
> >> +
> >> + if (!ring || !ring->sched.thread)
> >> + continue;
> >> +
> >> + amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
> >
> >
> > You need to call drm_sched_stop first before calling this
> >
> >> + }
> >> +
> >> + if (adev->mode_info.mode_config_initialized) {
> >> + if (!amdgpu_device_has_dc_support(adev))
> >> + drm_helper_force_disable_all(adev->ddev);
> >> + else
> >> + drm_atomic_helper_shutdown(adev->ddev);
> >> + }
> >> +
> >> + amdgpu_fence_driver_fini(adev);
> >> + amdgpu_fbdev_fini(adev);
> >> + /* Try to close drm device to stop applications
> >> + * from opening dri files for further IO operations.
> >> + * TODO: This will throw warning as ttm is not
> >> + * cleaned perperly */
> >> + drm_dev_fini(dev);
> >
> >
> > I think user mode applications might still hold reference to the drm
> > device through through drm_dev_get either by directly opening
> > the device file or indirectly through importing DMA buff, if so when
> > the last of them terminate
> > drm_dev_put->drm_dev_release->...->drm_dev_fini
> > might get called again causing use after free e.t.c issues. Maybe
> > better to call here drm_dev_put then and so drm_dev_fini will get
> > called when this
> > last user client releases his reference.
>
>
> drm_dev_fini() seems to be cleaner. Problem is window manager(sway)
> never gets terminated after the AER error and drm files remains active.
> Simple cat on dri files
>
> goes though amdgpu and spits out more errors.
>
>
> >
> > Also a general question - in my work on DPC recovery feature which
> > tries to recover after PCIe error - once the PCI error has happened
> > MMIO registers become
> > unaccessible for r/w as the PCI link is dead until after the PCI link
> > is reset by the DPC driver (see
> > https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html
> > section 6.1.4).
> > Your case is to try and gracefully to close the drm device once fatal
> > error happened, didn't you encounter errors or warnings when accessing
> > HW registers during any of the operations
> > above ?
>
>
> As discussed over chat, it seems aer generated with aer-inject tool just
> triggers kernel PCI error APIs but the device is still active so I
> didn't encounter any errors when accessing HW registers.
I think DPC is special because it triggers a special hardware mode
which resets the device. Most other PCI errors do not.
Alex
>
>
> Nirmoy
>
>
> >
> > Andrey
> >
> >
> >> + break;
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static const struct pci_error_handlers amdgpu_err_handler = {
> >> + .error_detected = amdgpu_pci_err_detected,
> >> +};
> >> +
> >> +
> >> static struct pci_driver amdgpu_kms_pci_driver = {
> >> .name = DRIVER_NAME,
> >> .id_table = pciidlist,
> >> @@ -1523,10 +1576,9 @@ static struct pci_driver amdgpu_kms_pci_driver
> >> = {
> >> .remove = amdgpu_pci_remove,
> >> .shutdown = amdgpu_pci_shutdown,
> >> .driver.pm = &amdgpu_pm_ops,
> >> + .err_handler = &amdgpu_err_handler,
> >> };
> >> -
> >> -
> >> static int __init amdgpu_init(void)
> >> {
> >> int r;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list