[Intel-xe] [PATCH] drm/xe: implement driver initiated functional reset

Daniele Ceraolo Spurio daniele.ceraolospurio at intel.com
Thu Oct 5 13:50:47 UTC 2023



On 10/4/2023 10:19 AM, Rodrigo Vivi wrote:
> On Mon, Oct 02, 2023 at 12:22:19PM +0000, Ofir Bitton wrote:
>> On 02/10/2023 11:11, Andrzej Hajda wrote:
>>>
>>> On 01.10.2023 10:50, Ofir Bitton wrote:
>>>> On 27/09/2023 10:46, Andrzej Hajda wrote:
>>>>> Driver initiated functional reset (FLR) is the highest level of reset
>>>>> that we can trigger from within the driver. In contrast to PCI FLR it
>>>>> doesn't require re-enumeration of PCI BAR. It can be useful in case
>>>>> GT fails to reset. It is also the only way to trigger GSC reset from
>>>>> the driver and can be used in future addition of GSC support.
>>>>>
>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda at intel.com>
>>>>> ---
>>>>> Hi all,
>>>>>
>>>>> This is initial attempt to implement Driver-FLR in xe. Implementation
>>>>> is copied from i915. I've skipped comments, but I can re-add them if
>>>>> requested.
>>>>> In i915 Driver-FLR is performed on driver unload if GSC was used.
>>>>> In xe GSC is not yet supported, but I have added trigger to perform FLR
>>>>> in case GT fails to reset, I guess this could be proper use-case and
>>>>> of course this is prerequisite to GSC support.
>>>>>
>>>>> Since this is my 1st patch to xe please be merciful in comments :)
>>>>>
>>>>> Regards
>>>>> Andrzej
>>>>> ---
>>>>>     drivers/gpu/drm/xe/regs/xe_gt_regs.h |  6 +++++
>>>>>     drivers/gpu/drm/xe/xe_gt.c           |  2 ++
>>>>>     drivers/gpu/drm/xe/xe_gt_types.h     |  3 +++
>>>>>     drivers/gpu/drm/xe/xe_mmio.c         | 38
>>>>> ++++++++++++++++++++++++++++
>>>>>     4 files changed, 49 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/regs/xe_gt_regs.h
>>>>> b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
>>>>> index e13fbbdf6929ea..769a8fd005931a 100644
>>>>> --- a/drivers/gpu/drm/xe/regs/xe_gt_regs.h
>>>>> +++ b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
>>>>> @@ -359,6 +359,12 @@
>>>>>     #define RCU_MODE                            XE_REG(0x14800,
>>>>> XE_REG_OPTION_MASKED)
>>>>>     #define   RCU_MODE_CCS_ENABLE                       REG_BIT(0)
>>>>>
>>>>> +#define GU_CNTL                                      XE_REG(0x101010)
>>>>> +#define   LMEM_INIT                          REG_BIT(7)
>>>>> +#define   DRIVERFLR                          REG_BIT(31)
>>>>> +#define GU_DEBUG                             XE_REG(0x101018)
>>>>> +#define   DRIVERFLR_STATUS                   REG_BIT(31)
>>>>> +
>>>>>     #define FORCEWAKE_ACK_GT                    XE_REG(0x130044)
>>>>>     #define   FORCEWAKE_KERNEL                  BIT(0)
>>>>>     #define   FORCEWAKE_USER                    BIT(1)
>>>>> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
>>>>> index 1aa44d4f9ac1c8..1e0b76000ed46b 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_gt.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_gt.c
>>>>> @@ -604,6 +604,8 @@ static int gt_reset(struct xe_gt *gt)
>>>>>         xe_uevent_gt_reset_failure(to_pci_dev(gt_to_xe(gt)->drm.dev),
>>>>>                                    gt_to_tile(gt)->id, gt->info.id);
>>>>>
>>>>> +     gt->needs_flr_on_fini = true;
>>>>> +
>>>>>         return err;
>>>>>     }
>>>> I think this should be platform dependent and not set to true by default.
>>> Why? Do we have xe supported hw without FLR?
>>> This is kind-of bigger hammer solution - if GT reset fails, lets try FLR.
>> In future platforms this might not be straight forward, this is why I
>> think it is preferable to be able to choose whether this feature is
>> valid or not. We can always have bypass flags in innersource.
> we could check if this is PVC case already and make the case for adding
> a device flag already. Or maybe just a if () return; in the beggining
> of the function when/where that is needed.
>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_gt_types.h
>>>>> b/drivers/gpu/drm/xe/xe_gt_types.h
>>>>> index d4310be3e1e7c8..dc194e9656abd5 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_gt_types.h
>>>>> +++ b/drivers/gpu/drm/xe/xe_gt_types.h
>>>>> @@ -347,6 +347,9 @@ struct xe_gt {
>>>>>                 /** @oob: bitmap with active OOB workaroudns */
>>>>>                 unsigned long *oob;
>>>>>         } wa_active;
>>>>> +
>>>>> +     /** @needs_flr_on_fini: requests functional reset on fini */
>>>>> +     bool needs_flr_on_fini;
>>>>>     };
>>>>>
>>>>>     #endif
>>>>> diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
>>>>> index 3ccc0af4430be4..593ecc98cdfe8a 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_mmio.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_mmio.c
>>>>> @@ -4,6 +4,7 @@
>>>>>      */
>>>>>
>>>>>     #include <linux/minmax.h>
>>>>> +#include <linux/units.h>
>>>>>
>>>>>     #include "xe_mmio.h"
>>>>>
>>>>> @@ -364,9 +365,46 @@ static void xe_mmio_probe_tiles(struct xe_device
>>>>> *xe)
>>>>>         }
>>>>>     }
>>>>>
>>>>> +static void driver_initiated_flr(struct xe_gt *gt)
>>>>> +{
>>>>> +     const unsigned int flr_timeout = 3 * MICRO; /* specs recommend
>>>>> a 3s wait */
>>>>> +     struct xe_device *xe = gt->tile->xe;
>>>>> +     int ret;
>>>>> +
>>>>> +     drm_dbg(&xe->drm, "Triggering Driver-FLR\n");
>>>>> +
>>>>> +     ret = xe_mmio_wait32(gt, GU_CNTL, DRIVERFLR, 0, flr_timeout,
>>>>> NULL, false);
>>>>> +     if (ret) {
>>>>> +             drm_err(&xe->drm, "Driver-FLR-prepare wait for ready
>>>>> failed! %d\n", ret);
>>>>> +             return;
>>>>> +     }
>>>>> +     xe_mmio_write32(gt, GU_DEBUG, DRIVERFLR_STATUS);
>>>>> +     xe_mmio_rmw32(gt, GU_CNTL, 0, DRIVERFLR);
>>>>> +     ret = xe_mmio_wait32(gt, GU_CNTL, DRIVERFLR, 0, flr_timeout,
>>>>> NULL, false);
>>>>> +     if (ret) {
>>>>> +             drm_err(&xe->drm, "Driver-FLR-teardown wait completion
>>>>> failed! %d\n", ret);
>>>>> +             return;
>>>>> +     }
>>>>> +     ret = xe_mmio_wait32(gt, GU_DEBUG, DRIVERFLR_STATUS,
>>>>> DRIVERFLR_STATUS,
>>>>> +                          flr_timeout, NULL, false);
>>>>> +     if (ret) {
>>>>> +             drm_err(&xe->drm, "Driver-FLR-reinit wait completion
>>>>> failed! %d\n", ret);
>>>>> +             return;
>>>>> +     }
>>>>> +
>>>>> +     /* Clear sticky completion status */
>>>>> +     xe_mmio_write32(gt, GU_DEBUG, DRIVERFLR_STATUS);
>>>>> +}
>>>>> +
>>>>>     static void mmio_fini(struct drm_device *drm, void *arg)
>>>>>     {
>>>>>         struct xe_device *xe = arg;
>>>>> +     struct xe_gt *gt;
>>>>> +     u8 id;
>>>>> +
>>>>> +     for_each_gt(gt, xe, id)
>>>>> +             if (gt->needs_flr_on_fini)
>>>>> +                     driver_initiated_flr(gt);
>>>>>
>>>>>         pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
>>>>>         if (xe->mem.vram.mapping)
>>>> Not sure mmio_fini is the best place to put this, as this is not mmio
>>>> related.
>>>
>>> The reason I have put it there, is that it should be the last action
>>> before unmapping mmio. What place would you suggest then?
>> maybe add something like:
>> 'drmm_add_action_or_reset(&xe->drm, flr, xe);' under 'xe_device_probe'
>> after 'mmio_init'.
> yeap, I also had my doubts about the placement. Why not simply calling this
> function or a work-queue calling this function from the place where the
> gt_reset failed?
>
> hmmm... and maybe not even a work-queue but a direct call is better, since
> on that point we are surely serialized and no execution attempt happening
> in parallel.

FLR is extremely destructive and impacts domains outside of GT, 
including wiping LMEM and resetting display and gunit. To trigger an FLR 
at runtime from the GT reset code we'd have to save/restore LMEM 
(potentially via the CPU, because if we're doing an FLR the copy engine 
might be dead) and do a full HW re-init afterwards. By adding it at 
driver remove time we can ensure that we leave the HW in a clean working 
state (GSC in particular can only be stopped by FLR, so if we don't 
trigger one it'll keep executing and potentially accessing memory after 
the driver has been removed), while at the same time we won't have to 
worry about re-initializing the HW because that'll be taken care of on 
driver reload.

Just to be clear, I have nothing against implementing runtime FLR, but 
given its complexity it should probably be a follow up step after we're 
added the much simpler FLR on driver unload.

Daniele

>
>> --
>> Ofir
>>
>>> Regards
>>> Andrzej
>>>
>>>
>>>> -- 
>>>> Ofir
>>>>
>>>>



More information about the Intel-xe mailing list