[Intel-xe] [PATCH] drm/xe: implement driver initiated functional reset
Rodrigo Vivi
rodrigo.vivi at intel.com
Thu Oct 5 14:38:17 UTC 2023
On Thu, Oct 05, 2023 at 06:50:47AM -0700, Daniele Ceraolo Spurio wrote:
>
>
> On 10/4/2023 10:19 AM, Rodrigo Vivi wrote:
> > On Mon, Oct 02, 2023 at 12:22:19PM +0000, Ofir Bitton wrote:
> > > On 02/10/2023 11:11, Andrzej Hajda wrote:
> > > >
> > > > On 01.10.2023 10:50, Ofir Bitton wrote:
> > > > > On 27/09/2023 10:46, Andrzej Hajda wrote:
> > > > > > Driver initiated functional reset (FLR) is the highest level of reset
> > > > > > that we can trigger from within the driver. In contrast to PCI FLR it
> > > > > > doesn't require re-enumeration of PCI BAR. It can be useful in case
> > > > > > GT fails to reset. It is also the only way to trigger GSC reset from
> > > > > > the driver and can be used in future addition of GSC support.
> > > > > >
> > > > > > Signed-off-by: Andrzej Hajda <andrzej.hajda at intel.com>
> > > > > > ---
> > > > > > Hi all,
> > > > > >
> > > > > > This is initial attempt to implement Driver-FLR in xe. Implementation
> > > > > > is copied from i915. I've skipped comments, but I can re-add them if
> > > > > > requested.
> > > > > > In i915 Driver-FLR is performed on driver unload if GSC was used.
> > > > > > In xe GSC is not yet supported, but I have added trigger to perform FLR
> > > > > > in case GT fails to reset, I guess this could be proper use-case and
> > > > > > of course this is prerequisite to GSC support.
> > > > > >
> > > > > > Since this is my 1st patch to xe please be merciful in comments :)
> > > > > >
> > > > > > Regards
> > > > > > Andrzej
> > > > > > ---
> > > > > > drivers/gpu/drm/xe/regs/xe_gt_regs.h | 6 +++++
> > > > > > drivers/gpu/drm/xe/xe_gt.c | 2 ++
> > > > > > drivers/gpu/drm/xe/xe_gt_types.h | 3 +++
> > > > > > drivers/gpu/drm/xe/xe_mmio.c | 38
> > > > > > ++++++++++++++++++++++++++++
> > > > > > 4 files changed, 49 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> > > > > > b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> > > > > > index e13fbbdf6929ea..769a8fd005931a 100644
> > > > > > --- a/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> > > > > > +++ b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> > > > > > @@ -359,6 +359,12 @@
> > > > > > #define RCU_MODE XE_REG(0x14800,
> > > > > > XE_REG_OPTION_MASKED)
> > > > > > #define RCU_MODE_CCS_ENABLE REG_BIT(0)
> > > > > >
> > > > > > +#define GU_CNTL XE_REG(0x101010)
> > > > > > +#define LMEM_INIT REG_BIT(7)
> > > > > > +#define DRIVERFLR REG_BIT(31)
> > > > > > +#define GU_DEBUG XE_REG(0x101018)
> > > > > > +#define DRIVERFLR_STATUS REG_BIT(31)
> > > > > > +
> > > > > > #define FORCEWAKE_ACK_GT XE_REG(0x130044)
> > > > > > #define FORCEWAKE_KERNEL BIT(0)
> > > > > > #define FORCEWAKE_USER BIT(1)
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > > > > index 1aa44d4f9ac1c8..1e0b76000ed46b 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > > > > @@ -604,6 +604,8 @@ static int gt_reset(struct xe_gt *gt)
> > > > > > xe_uevent_gt_reset_failure(to_pci_dev(gt_to_xe(gt)->drm.dev),
> > > > > > gt_to_tile(gt)->id, gt->info.id);
> > > > > >
> > > > > > + gt->needs_flr_on_fini = true;
> > > > > > +
> > > > > > return err;
> > > > > > }
> > > > > I think this should be platform dependent and not set to true by default.
> > > > Why? Do we have xe supported hw without FLR?
> > > > This is kind-of bigger hammer solution - if GT reset fails, lets try FLR.
> > > In future platforms this might not be straight forward, this is why I
> > > think it is preferable to be able to choose whether this feature is
> > > valid or not. We can always have bypass flags in innersource.
> > we could check if this is PVC case already and make the case for adding
> > a device flag already. Or maybe just a if () return; in the beggining
> > of the function when/where that is needed.
> >
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_gt_types.h
> > > > > > b/drivers/gpu/drm/xe/xe_gt_types.h
> > > > > > index d4310be3e1e7c8..dc194e9656abd5 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_gt_types.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> > > > > > @@ -347,6 +347,9 @@ struct xe_gt {
> > > > > > /** @oob: bitmap with active OOB workaroudns */
> > > > > > unsigned long *oob;
> > > > > > } wa_active;
> > > > > > +
> > > > > > + /** @needs_flr_on_fini: requests functional reset on fini */
> > > > > > + bool needs_flr_on_fini;
> > > > > > };
> > > > > >
> > > > > > #endif
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > index 3ccc0af4430be4..593ecc98cdfe8a 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > @@ -4,6 +4,7 @@
> > > > > > */
> > > > > >
> > > > > > #include <linux/minmax.h>
> > > > > > +#include <linux/units.h>
> > > > > >
> > > > > > #include "xe_mmio.h"
> > > > > >
> > > > > > @@ -364,9 +365,46 @@ static void xe_mmio_probe_tiles(struct xe_device
> > > > > > *xe)
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > +static void driver_initiated_flr(struct xe_gt *gt)
> > > > > > +{
> > > > > > + const unsigned int flr_timeout = 3 * MICRO; /* specs recommend
> > > > > > a 3s wait */
> > > > > > + struct xe_device *xe = gt->tile->xe;
> > > > > > + int ret;
> > > > > > +
> > > > > > + drm_dbg(&xe->drm, "Triggering Driver-FLR\n");
> > > > > > +
> > > > > > + ret = xe_mmio_wait32(gt, GU_CNTL, DRIVERFLR, 0, flr_timeout,
> > > > > > NULL, false);
> > > > > > + if (ret) {
> > > > > > + drm_err(&xe->drm, "Driver-FLR-prepare wait for ready
> > > > > > failed! %d\n", ret);
> > > > > > + return;
> > > > > > + }
> > > > > > + xe_mmio_write32(gt, GU_DEBUG, DRIVERFLR_STATUS);
> > > > > > + xe_mmio_rmw32(gt, GU_CNTL, 0, DRIVERFLR);
> > > > > > + ret = xe_mmio_wait32(gt, GU_CNTL, DRIVERFLR, 0, flr_timeout,
> > > > > > NULL, false);
> > > > > > + if (ret) {
> > > > > > + drm_err(&xe->drm, "Driver-FLR-teardown wait completion
> > > > > > failed! %d\n", ret);
> > > > > > + return;
> > > > > > + }
> > > > > > + ret = xe_mmio_wait32(gt, GU_DEBUG, DRIVERFLR_STATUS,
> > > > > > DRIVERFLR_STATUS,
> > > > > > + flr_timeout, NULL, false);
> > > > > > + if (ret) {
> > > > > > + drm_err(&xe->drm, "Driver-FLR-reinit wait completion
> > > > > > failed! %d\n", ret);
> > > > > > + return;
> > > > > > + }
> > > > > > +
> > > > > > + /* Clear sticky completion status */
> > > > > > + xe_mmio_write32(gt, GU_DEBUG, DRIVERFLR_STATUS);
> > > > > > +}
> > > > > > +
> > > > > > static void mmio_fini(struct drm_device *drm, void *arg)
> > > > > > {
> > > > > > struct xe_device *xe = arg;
> > > > > > + struct xe_gt *gt;
> > > > > > + u8 id;
> > > > > > +
> > > > > > + for_each_gt(gt, xe, id)
> > > > > > + if (gt->needs_flr_on_fini)
> > > > > > + driver_initiated_flr(gt);
> > > > > >
> > > > > > pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> > > > > > if (xe->mem.vram.mapping)
> > > > > Not sure mmio_fini is the best place to put this, as this is not mmio
> > > > > related.
> > > >
> > > > The reason I have put it there, is that it should be the last action
> > > > before unmapping mmio. What place would you suggest then?
> > > maybe add something like:
> > > 'drmm_add_action_or_reset(&xe->drm, flr, xe);' under 'xe_device_probe'
> > > after 'mmio_init'.
> > yeap, I also had my doubts about the placement. Why not simply calling this
> > function or a work-queue calling this function from the place where the
> > gt_reset failed?
> >
> > hmmm... and maybe not even a work-queue but a direct call is better, since
> > on that point we are surely serialized and no execution attempt happening
> > in parallel.
>
> FLR is extremely destructive and impacts domains outside of GT, including
> wiping LMEM and resetting display and gunit. To trigger an FLR at runtime
> from the GT reset code we'd have to save/restore LMEM (potentially via the
> CPU, because if we're doing an FLR the copy engine might be dead) and do a
> full HW re-init afterwards. By adding it at driver remove time we can ensure
> that we leave the HW in a clean working state (GSC in particular can only be
> stopped by FLR, so if we don't trigger one it'll keep executing and
> potentially accessing memory after the driver has been removed), while at
> the same time we won't have to worry about re-initializing the HW because
> that'll be taken care of on driver reload.
>
> Just to be clear, I have nothing against implementing runtime FLR, but given
> its complexity it should probably be a follow up step after we're added the
> much simpler FLR on driver unload.
very good point.
But I still agree with Ofir on the placement of the call.
>
> Daniele
>
> >
> > > --
> > > Ofir
> > >
> > > > Regards
> > > > Andrzej
> > > >
> > > >
> > > > > --
> > > > > Ofir
> > > > >
> > > > >
>
More information about the Intel-xe
mailing list