[PATCH v2 1/2] drm/xe: Move survivability back to xe
Lucas De Marchi
lucas.demarchi at intel.com
Tue Mar 11 21:09:44 UTC 2025
On Tue, Mar 11, 2025 at 04:55:16PM -0400, Rodrigo Vivi wrote:
>On Tue, Mar 11, 2025 at 11:34:55AM -0700, Lucas De Marchi wrote:
>> Commit d40f275d96e8 ("drm/xe: Move survivability entirely to xe_pci")
>> moved the survivability handling to be done entirely in the xe_pci
>> layer. However there are some issues with that approach:
>>
>> 1) Survivability mode needs at least the mmio initialized, otherwise it
>> can't really read a register to decide if it should enter that state
>> 2) SR-IOV mode should be initialized, otherwise it's not possible to
>> check if it's VF
>>
>> Besides, as pointed by Riana the check for
>> xe_survivability_mode_enable() was wrong in xe_pci_probe() since it's
>> not a bool return.
>>
>> Fix that by moving the initialization to be entirely in the xe_device
>> layer, with the correct dependencies handled. The xe_pci now only checks
>> for "is it enabled?", like it's doing in
>> xe_pci_suspend()/xe_pci_remove(), etc.
>>
>> Cc: Riana Tauro <riana.tauro at intel.com>
>> Fixes: d40f275d96e8 ("drm/xe: Move survivability entirely to xe_pci")
>> Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_device.c | 14 +++++++++++++-
>> drivers/gpu/drm/xe/xe_pci.c | 16 +++++++---------
>> drivers/gpu/drm/xe/xe_survivability_mode.c | 14 +++++++++-----
>> 3 files changed, 29 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 5d79b439dd625..023290e5be392 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -53,6 +53,7 @@
>> #include "xe_pxp.h"
>> #include "xe_query.h"
>> #include "xe_shrinker.h"
>> +#include "xe_survivability_mode.h"
>> #include "xe_sriov.h"
>> #include "xe_tile.h"
>> #include "xe_ttm_stolen_mgr.h"
>> @@ -705,8 +706,19 @@ int xe_device_probe_early(struct xe_device *xe)
>> sriov_update_device_info(xe);
>>
>> err = xe_pcode_probe_early(xe);
>> - if (err)
>> + if (err) {
>> + int save_err = err;
>> +
>> + /*
>> + * Try to leave device in survivability mode if device is
>> + * capable
>> + */
>> + err = xe_survivability_mode_enable(xe);
>> + if (!err || err == -ENOTRECOVERABLE)
>> + return save_err;
>> +
>> return err;
>> + }
>>
>> err = wait_for_lmem_ready(xe);
>
>I'm not 100% sure if we can skip this...
>Before this patch if lmem_ready failed we would also check for the
>survivability bit...
which was not the case before the breakage in
commit d40f275d96e8 ("drm/xe: Move survivability entirely to xe_pci").
This restores that behavior, without the odd split of "init in the inner
layer (xe_device) / enable in the outer layer (xe_pci)".
>
>> if (err)
>> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>> index 4d982a5a4ffd9..6fea3091e2348 100644
>> --- a/drivers/gpu/drm/xe/xe_pci.c
>> +++ b/drivers/gpu/drm/xe/xe_pci.c
>> @@ -808,16 +808,14 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>> return err;
>>
>> err = xe_device_probe_early(xe);
>> -
>> - /*
>> - * In Boot Survivability mode, no drm card is exposed and driver is
>> - * loaded with bare minimum to allow for firmware to be flashed through
>> - * mei. If early probe fails, check if survivability mode is flagged by
>> - * HW to be enabled. In that case enable it and return success.
>> - */
>> if (err) {
>> - if (xe_survivability_mode_required(xe) &&
>> - xe_survivability_mode_enable(xe))
>> + /*
>> + * In Boot Survivability mode, no drm card is exposed and driver
>> + * is loaded with bare minimum to allow for firmware to be
>> + * flashed through mei. If early probe failed, but it managed to
>> + * enable survivability mode, return success.
>> + */
>> + if (xe_survivability_mode_is_enabled(xe))
>> return 0;
>>
>> return err;
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> index d939ce70e6fa8..153b8d598a270 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> @@ -178,15 +178,16 @@ bool xe_survivability_mode_is_enabled(struct xe_device *xe)
>> return xe->survivability.mode;
>> }
>>
>> -/**
>> - * xe_survivability_mode_required - checks if survivability mode is required
>> +/*
>> + * xe_survivability_mode_capable - checks if it's possible to enable
>> + * survivability mode
>
>that bit doesn't indicate capabitility, but if mode is required by FW
>after identifying if something went wrong like done checks and lmem
>link trainings...
You are saying the HW **requires** to enter survivability mode and yet
there are multiple checks for "is the HW even capable of doing that" by
means of platform and IP version. Combining all of them to me means
more a capability: platform supports it, and during initialization it
**signaled that capability** rather than required anything.
maybe neither capable() nor required()... but requested()?
>
>> * @xe: xe device instance
>> *
>> - * This function reads the boot status from Pcode
>> + * This function reads the boot status from Pcode.
>> *
>> - * Return: true if boot status indicates failure, false otherwise
>> + * Return: true if boot status indicates failure, false otherwise.
>> */
>> -bool xe_survivability_mode_required(struct xe_device *xe)
>> +static bool xe_survivability_mode_capable(struct xe_device *xe)
>> {
>> struct xe_survivability *survivability = &xe->survivability;
>> struct xe_mmio *mmio = xe_root_tile_mmio(xe);
>> @@ -216,6 +217,9 @@ int xe_survivability_mode_enable(struct xe_device *xe)
>> struct xe_survivability_info *info;
>> struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>>
>> + if (!xe_survivability_mode_capable(xe))
>> + return -ENOTRECOVERABLE;
>
>We might want to enter survivability regardless of that bit as well.
>With that configfs work that Riana is doing for instance, you set
>that, but you disregard if that was requested by PCODE in that
>'required' bit.
we'd add the check inside xe_survivability_mode_capable() to account for
that, still checking for platform. For my tests I was just doing a
return .... || true;
in that function, to be replaced by the configfs check.
Lucas De Marchi
>
>> +
>> survivability->size = MAX_SCRATCH_MMIO;
>>
>> info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
>>
>> --
>> 2.48.1
>>
More information about the Intel-xe
mailing list