[Intel-xe] [PATCH] drm/xe: implement driver initiated functional reset

Wed Oct 4 17:19:23 UTC 2023

On Mon, Oct 02, 2023 at 12:22:19PM +0000, Ofir Bitton wrote:
> On 02/10/2023 11:11, Andrzej Hajda wrote:
> > 
> > 
> > On 01.10.2023 10:50, Ofir Bitton wrote:
> >> On 27/09/2023 10:46, Andrzej Hajda wrote:
> >>> Driver initiated functional reset (FLR) is the highest level of reset
> >>> that we can trigger from within the driver. In contrast to PCI FLR it
> >>> doesn't require re-enumeration of PCI BAR. It can be useful in case
> >>> GT fails to reset. It is also the only way to trigger GSC reset from
> >>> the driver and can be used in future addition of GSC support.
> >>>
> >>> Signed-off-by: Andrzej Hajda <andrzej.hajda at intel.com>
> >>> ---
> >>> Hi all,
> >>>
> >>> This is initial attempt to implement Driver-FLR in xe. Implementation
> >>> is copied from i915. I've skipped comments, but I can re-add them if
> >>> requested.
> >>> In i915 Driver-FLR is performed on driver unload if GSC was used.
> >>> In xe GSC is not yet supported, but I have added trigger to perform FLR
> >>> in case GT fails to reset, I guess this could be proper use-case and
> >>> of course this is prerequisite to GSC support.
> >>>
> >>> Since this is my 1st patch to xe please be merciful in comments :)
> >>>
> >>> Regards
> >>> Andrzej
> >>> ---
> >>>    drivers/gpu/drm/xe/regs/xe_gt_regs.h |  6 +++++
> >>>    drivers/gpu/drm/xe/xe_gt.c           |  2 ++
> >>>    drivers/gpu/drm/xe/xe_gt_types.h     |  3 +++
> >>>    drivers/gpu/drm/xe/xe_mmio.c         | 38 
> >>> ++++++++++++++++++++++++++++
> >>>    4 files changed, 49 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/regs/xe_gt_regs.h 
> >>> b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> >>> index e13fbbdf6929ea..769a8fd005931a 100644
> >>> --- a/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> >>> +++ b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> >>> @@ -359,6 +359,12 @@
> >>>    #define RCU_MODE                            XE_REG(0x14800, 
> >>> XE_REG_OPTION_MASKED)
> >>>    #define   RCU_MODE_CCS_ENABLE                       REG_BIT(0)
> >>>
> >>> +#define GU_CNTL                                      XE_REG(0x101010)
> >>> +#define   LMEM_INIT                          REG_BIT(7)
> >>> +#define   DRIVERFLR                          REG_BIT(31)
> >>> +#define GU_DEBUG                             XE_REG(0x101018)
> >>> +#define   DRIVERFLR_STATUS                   REG_BIT(31)
> >>> +
> >>>    #define FORCEWAKE_ACK_GT                    XE_REG(0x130044)
> >>>    #define   FORCEWAKE_KERNEL                  BIT(0)
> >>>    #define   FORCEWAKE_USER                    BIT(1)
> >>> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> >>> index 1aa44d4f9ac1c8..1e0b76000ed46b 100644
> >>> --- a/drivers/gpu/drm/xe/xe_gt.c
> >>> +++ b/drivers/gpu/drm/xe/xe_gt.c
> >>> @@ -604,6 +604,8 @@ static int gt_reset(struct xe_gt *gt)
> >>>        xe_uevent_gt_reset_failure(to_pci_dev(gt_to_xe(gt)->drm.dev),
> >>>                                   gt_to_tile(gt)->id, gt->info.id);
> >>>
> >>> +     gt->needs_flr_on_fini = true;
> >>> +
> >>>        return err;
> >>>    }
> >> I think this should be platform dependent and not set to true by default.
> > 
> > Why? Do we have xe supported hw without FLR?
> > This is kind-of bigger hammer solution - if GT reset fails, lets try FLR.
> 
> In future platforms this might not be straight forward, this is why I
> think it is preferable to be able to choose whether this feature is
> valid or not. We can always have bypass flags in innersource.

we could check if this is PVC case already and make the case for adding
a device flag already. Or maybe just a if () return; in the beggining
of the function when/where that is needed.

> 
> > 
> >>
> >>> diff --git a/drivers/gpu/drm/xe/xe_gt_types.h 
> >>> b/drivers/gpu/drm/xe/xe_gt_types.h
> >>> index d4310be3e1e7c8..dc194e9656abd5 100644
> >>> --- a/drivers/gpu/drm/xe/xe_gt_types.h
> >>> +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> >>> @@ -347,6 +347,9 @@ struct xe_gt {
> >>>                /** @oob: bitmap with active OOB workaroudns */
> >>>                unsigned long *oob;
> >>>        } wa_active;
> >>> +
> >>> +     /** @needs_flr_on_fini: requests functional reset on fini */
> >>> +     bool needs_flr_on_fini;
> >>>    };
> >>>
> >>>    #endif
> >>> diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> >>> index 3ccc0af4430be4..593ecc98cdfe8a 100644
> >>> --- a/drivers/gpu/drm/xe/xe_mmio.c
> >>> +++ b/drivers/gpu/drm/xe/xe_mmio.c
> >>> @@ -4,6 +4,7 @@
> >>>     */
> >>>
> >>>    #include <linux/minmax.h>
> >>> +#include <linux/units.h>
> >>>
> >>>    #include "xe_mmio.h"
> >>>
> >>> @@ -364,9 +365,46 @@ static void xe_mmio_probe_tiles(struct xe_device 
> >>> *xe)
> >>>        }
> >>>    }
> >>>
> >>> +static void driver_initiated_flr(struct xe_gt *gt)
> >>> +{
> >>> +     const unsigned int flr_timeout = 3 * MICRO; /* specs recommend 
> >>> a 3s wait */
> >>> +     struct xe_device *xe = gt->tile->xe;
> >>> +     int ret;
> >>> +
> >>> +     drm_dbg(&xe->drm, "Triggering Driver-FLR\n");
> >>> +
> >>> +     ret = xe_mmio_wait32(gt, GU_CNTL, DRIVERFLR, 0, flr_timeout, 
> >>> NULL, false);
> >>> +     if (ret) {
> >>> +             drm_err(&xe->drm, "Driver-FLR-prepare wait for ready 
> >>> failed! %d\n", ret);
> >>> +             return;
> >>> +     }
> >>> +     xe_mmio_write32(gt, GU_DEBUG, DRIVERFLR_STATUS);
> >>> +     xe_mmio_rmw32(gt, GU_CNTL, 0, DRIVERFLR);
> >>> +     ret = xe_mmio_wait32(gt, GU_CNTL, DRIVERFLR, 0, flr_timeout, 
> >>> NULL, false);
> >>> +     if (ret) {
> >>> +             drm_err(&xe->drm, "Driver-FLR-teardown wait completion 
> >>> failed! %d\n", ret);
> >>> +             return;
> >>> +     }
> >>> +     ret = xe_mmio_wait32(gt, GU_DEBUG, DRIVERFLR_STATUS, 
> >>> DRIVERFLR_STATUS,
> >>> +                          flr_timeout, NULL, false);
> >>> +     if (ret) {
> >>> +             drm_err(&xe->drm, "Driver-FLR-reinit wait completion 
> >>> failed! %d\n", ret);
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     /* Clear sticky completion status */
> >>> +     xe_mmio_write32(gt, GU_DEBUG, DRIVERFLR_STATUS);
> >>> +}
> >>> +
> >>>    static void mmio_fini(struct drm_device *drm, void *arg)
> >>>    {
> >>>        struct xe_device *xe = arg;
> >>> +     struct xe_gt *gt;
> >>> +     u8 id;
> >>> +
> >>> +     for_each_gt(gt, xe, id)
> >>> +             if (gt->needs_flr_on_fini)
> >>> +                     driver_initiated_flr(gt);
> >>>
> >>>        pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> >>>        if (xe->mem.vram.mapping)
> >> Not sure mmio_fini is the best place to put this, as this is not mmio
> >> related.
> > 
> > 
> > The reason I have put it there, is that it should be the last action 
> > before unmapping mmio. What place would you suggest then?
> 
> maybe add something like:
> 'drmm_add_action_or_reset(&xe->drm, flr, xe);' under 'xe_device_probe'
> after 'mmio_init'.

yeap, I also had my doubts about the placement. Why not simply calling this
function or a work-queue calling this function from the place where the
gt_reset failed?

hmmm... and maybe not even a work-queue but a direct call is better, since
on that point we are surely serialized and no execution attempt happening
in parallel.

> 
> --
> Ofir
> 
> > 
> > Regards
> > Andrzej
> > 
> > 
> >>
> >> -- 
> >> Ofir
> >>
> >>
> > 
>