[PATCH 1/2] RFC drm/xe: Add configfs to enable survivability mode

Riana Tauro riana.tauro at intel.com
Fri Mar 14 07:24:21 UTC 2025



On 3/14/2025 1:22 AM, Lucas De Marchi wrote:
> On Thu, Mar 13, 2025 at 10:52:36AM -0400, Rodrigo Vivi wrote:
>> On Thu, Mar 13, 2025 at 11:48:41AM +0530, Riana Tauro wrote:
>>> Hi Aravind
>>>
>>> On 3/10/2025 12:44 PM, Aravind Iddamsetty wrote:
>>> >
>>> > On 07-03-2025 19:54, Riana Tauro wrote:
>>> >
>>> > Hi Riana,
>>> >
>>> > I do think we can achieve the same functionality with module param and
>>> > we needn't reload the driver
>>> > if we are doing unbind, Since the driver will be loaded event after
>>> > unbind we can modify the module param
>>>
>>> tried this.  We can modify the module param using sysfs and bind 
>>> similar to
>>> configfs.
>>>
>>> If we want to configure any other attributes or move existing module 
>>> params
>>> to device specific then adding configfs seems better.
>>>
>>> @Rodrigo thoughts?
>>
>> Yeap, if we make the param write-able and document that it is only 
>> checked
>> at probe we could indeed use the flow
>>
>> echo <bdf> > /sys/../xe/unbind
>> echo <bdf> > /sys/../xe/param/survivability_mode
>> echo <bdf> > /sys/../xe/bind
>>
>> and accomplish the same.
>>
>> on the good side:
>>
>> So, this is the future prof case. Because if we start seeing cases where
>> the device fails at boot without the FW request for the survivability 
>> mode
>> we might be forced to have a parameter anyway. :/
> 
> With parameters we have these possibilities:
> 
> 1) driver is already loaded:
> 
>      echo <bdf> > /sys/../xe/unbind
>      echo <bdf> > /sys/module/xe/parameters/survivability_mode
>      echo <bdf> > /sys/../xe/bind
> 
> 2) driver is not loaded
> 
>      # put all device in survivability mode
>      modprobe xe survivability_mode=
> 
> 3) driver not loaded, and only one device in this mode
> 
>      echo 0 > /sys/bus/pci/drivers_autoprobe
>      modprobe xe
>      echo <bdf> > /sys/module/xe/parameters/survivability_mode
>      echo <bdf> > /sys/../xe/bind
> 
> With configfs we have these possibilities:
> 
> a) driver is already loaded:
> 
>      echo <bdf> > /sys/../xe/unbind
>      mkdir /sys/kernel/config/xe/0000:03:00.0
>      echo 1 > /sys/kernel/config/xe/0000:03:00.0/survivability_mode
>      echo <bdf> > /sys/../xe/bind
> 
> b) driver is not loaded
> 
>      # no equivalent option for "all devices in survivability mode"
>      # other than repeating (c) for each since the option is per
>      # device not per module
> 
> c) driver not loaded, and only one device in this mode
> 
>      echo 0 > /sys/bus/pci/drivers_autoprobe
>      modprobe xe
>      mkdir /sys/kernel/config/xe/0000:03:00.0
>      echo 1 > /sys/kernel/config/xe/0000:03:00.0/survivability_mode
>      echo <bdf> > /sys/../xe/bind
> 
>>
>> on the bad side:
>>
>> But we were actually aiming to reduce the parameters that we have
>> because that was getting indiscriminately used and even recommended by 
>> some
>> distros' doc and wikis.
>>
>> Another problem with the write-able param is that people might expect to
>>
>> echo <bdf> > /sys/../xe/param/survivability_mode
>>
>> and immediately get the device in the survivability_mode for that device.
>> Then we are going to get questions and bug reports that this is not 
>> working.
>>
>> But well, perhaps documenting the flow properly in the param description
>> might solve this concern.
>>
>> Lucas, Thomas, thoughts?
> 
> Let me add here that we will have to extend configuration to more
> things. The one I will work once this configfs lands (or even before it)
> is to allow developers to add WAs (or we could say
> register-save-restore) dynamically so it's much easier to
> try/error/recover. For that my plan is something like this:
> 
>      echo 0 > /sys/bus/pci/drivers_autoprobe
>      modprobe xe
>      mkdir /sys/kernel/config/xe/0000:03:00.0
>      cat gt-wa.txt > /sys/kernel/config/xe/0000:03:00.0/gt_wa
>      cat engine-wa.txt > /sys/kernel/config/xe/0000:03:00.0/engine_wa
>      echo <bdf> > /sys/../xe/bind
> 
> And cleanup my kmod "frontend" protoype to make it simpler[1]:
> 
> i) no additional configuration:
> 
>      kmod bind --device 0000:03:00.0 xe
> 
> ii) with configuration stored somewhere:
> 
>      kmod bind --device 0000:03:00.0 \
>          --config <path-to-config-dir|path-to-config-tarball> \
>          xe
> 
> iii) with configuration in the command line:
> 
>      kmod bind --device 0000:03:00.0 \
>          --config-kv survivability_mode:1 \
>          xe
> 
> So, I think the configfs approach is more future proof. For a simple
> switch/panic-button like survivability_mode I wouldn't oppose to have it
> as a module parameter though. Maybe make the module param read-only
> and if per-device support is desired, then handle only via configfs?
yeah survivability mode is per device. Since there is a plan to extend 
configfs for other attributes will go ahead with configfs.

Thank you all for the inputs.

Thanks
Riana

> 
> Lucas De Marchi
> 
> [1] Typing here without much thought on the actual interface - my
>      prototype hardcodes it for pci, but the kmod command would
>      probably have to consider other buses too.
> 
>>
>>>
>>> Thanks
>>> Riana
>>> > and once we bind the device back it can check if the BDF belongs to 
>>> this
>>> > driver instance and configure the mode accordingly.
>>> >
>>> > Thanks,
>>> > Aravind.
>>> > > Registers a configfs subsystem called 'xe' to userspace. The user 
>>> can
>>> > > use this to modify exposed attributes.
>>> > >
>>> > > Add survivability mode attribute (config/xe/survivability_mode) 
>>> to the
>>> > > subsystem to allow the user to specify the card that should enter
>>> > > survivability mode.
>>> > >
>>> > > Signed-off-by: Riana Tauro <riana.tauro at intel.com>
>>> > > ---
>>> > >   drivers/gpu/drm/xe/Makefile      |  1 +
>>> > >   drivers/gpu/drm/xe/xe_configfs.c | 95 +++++++++++++++++++++++++ 
>>> +++++++
>>> > >   drivers/gpu/drm/xe/xe_configfs.h | 12 ++++
>>> > >   drivers/gpu/drm/xe/xe_module.c   |  5 ++
>>> > >   drivers/gpu/drm/xe/xe_module.h   |  1 +
>>> > >   5 files changed, 114 insertions(+)
>>> > >   create mode 100644 drivers/gpu/drm/xe/xe_configfs.c
>>> > >   create mode 100644 drivers/gpu/drm/xe/xe_configfs.h
>>> > >
>>> > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/ 
>>> Makefile
>>> > > index 9699b08585f7..3f8c87292cee 100644
>>> > > --- a/drivers/gpu/drm/xe/Makefile
>>> > > +++ b/drivers/gpu/drm/xe/Makefile
>>> > > @@ -28,6 +28,7 @@ $(obj)/generated/%_wa_oob.c $(obj)/generated/ 
>>> %_wa_oob.h: $(obj)/xe_gen_wa_oob \
>>> > >   xe-y += xe_bb.o \
>>> > >       xe_bo.o \
>>> > >       xe_bo_evict.o \
>>> > > +    xe_configfs.o \
>>> > >       xe_devcoredump.o \
>>> > >       xe_device.o \
>>> > >       xe_device_sysfs.o \
>>> > > diff --git a/drivers/gpu/drm/xe/xe_configfs.c b/drivers/gpu/drm/ 
>>> xe/xe_configfs.c
>>> > > new file mode 100644
>>> > > index 000000000000..8c5f248e466d
>>> > > --- /dev/null
>>> > > +++ b/drivers/gpu/drm/xe/xe_configfs.c
>>> > > @@ -0,0 +1,95 @@
>>> > > +// SPDX-License-Identifier: MIT
>>> > > +/*
>>> > > + * Copyright © 2025 Intel Corporation
>>> > > + */
>>> > > +
>>> > > +#include <linux/configfs.h>
>>> > > +#include <linux/init.h>
>>> > > +#include <linux/module.h>
>>> > > +
>>> > > +#include "xe_configfs.h"
>>> > > +#include "xe_module.h"
>>> > > +
>>> > > +/**
>>> > > + * DOC: Xe Configfs
>>> > > + *
>>> > > + * XE KMD registers a configfs subsystem called 'xe'to userspace 
>>> that allows users to modify
>>> > > + * the exposed attributes.
>>> > > + *
>>> > > + * Attributes:
>>> > > + *
>>> > > + * config/xe/survivability_mode : Write only attribute that 
>>> allows user to specify the PCI address
>>> > > + *                  of the card that has to enter survivability 
>>> mode
>>> > > + */
>>> > > +
>>> > > +void xe_configfs_clear_survivability_mode(void)
>>> > > +{
>>> > > +    kfree(xe_modparam.survivability_mode);
>>> > > +    xe_modparam.survivability_mode = NULL;
>>> > > +}
>>> > > +
>>> > > +static ssize_t survivability_mode_store(struct config_item 
>>> *item, const char *page, size_t len)
>>> > > +{
>>> > > +    char *survivability_mode;
>>> > > +    int ret;
>>> > > +    unsigned int domain, bus, slot, function;
>>> > > +
>>> > > +    ret = sscanf(page, "%04x:%02x:%02x.%x", &domain, &bus, 
>>> &slot, &function);
>>> > > +    if (ret != 4)
>>> > > +        return -EINVAL;
>>> > > +
>>> > > +    survivability_mode = kstrdup(page, GFP_KERNEL);
>>> > > +    if (!survivability_mode)
>>> > > +        return -ENOMEM;
>>> > > +
>>> > > +    xe_configfs_clear_survivability_mode();
>>> > > +    xe_modparam.survivability_mode = survivability_mode;
>>> > > +
>>> > > +    return len;
>>> > > +}
>>> > > +
>>> > > +CONFIGFS_ATTR_WO(, survivability_mode);
>>> > > +
>>> > > +static struct configfs_attribute *xe_configfs_attrs[] = {
>>> > > +    &attr_survivability_mode,
>>> > > +    NULL,
>>> > > +};
>>> > > +
>>> > > +static const struct config_item_type xe_config_type = {
>>> > > +    .ct_attrs    = xe_configfs_attrs,
>>> > > +    .ct_owner    = THIS_MODULE,
>>> > > +};
>>> > > +
>>> > > +static struct configfs_subsystem xe_config_subsys = {
>>> > > +    .su_group = {
>>> > > +        .cg_item = {
>>> > > +            .ci_namebuf = "xe",
>>> > > +            .ci_type = &xe_config_type,
>>> > > +        },
>>> > > +    },
>>> > > +};
>>> > > +
>>> > > +int __init xe_configfs_init(void)
>>> > > +{
>>> > > +    int ret;
>>> > > +
>>> > > +    config_group_init(&xe_config_subsys.su_group);
>>> > > +    mutex_init(&xe_config_subsys.su_mutex);
>>> > > +    ret = configfs_register_subsystem(&xe_config_subsys);
>>> > > +    if (ret) {
>>> > > +        pr_err("Error %d while registering subsystem %s\n",
>>> > > +               ret, xe_config_subsys.su_group.cg_item.ci_namebuf);
>>> > > +        mutex_destroy(&xe_config_subsys.su_mutex);
>>> > > +        return ret;
>>> > > +    }
>>> > > +
>>> > > +    return 0;
>>> > > +}
>>> > > +
>>> > > +void __exit xe_configfs_exit(void)
>>> > > +{
>>> > > +    xe_configfs_clear_survivability_mode();
>>> > > +    configfs_unregister_subsystem(&xe_config_subsys);
>>> > > +    mutex_destroy(&xe_config_subsys.su_mutex);
>>> > > +}
>>> > > +
>>> > > diff --git a/drivers/gpu/drm/xe/xe_configfs.h b/drivers/gpu/drm/ 
>>> xe/xe_configfs.h
>>> > > new file mode 100644
>>> > > index 000000000000..491629a2ca53
>>> > > --- /dev/null
>>> > > +++ b/drivers/gpu/drm/xe/xe_configfs.h
>>> > > @@ -0,0 +1,12 @@
>>> > > +/* SPDX-License-Identifier: MIT */
>>> > > +/*
>>> > > + * Copyright © 2025 Intel Corporation
>>> > > + */
>>> > > +#ifndef _XE_CONFIGFS_H_
>>> > > +#define _XE_CONFIGFS_H_
>>> > > +
>>> > > +int xe_configfs_init(void);
>>> > > +void xe_configfs_exit(void);
>>> > > +void xe_configfs_clear_survivability_mode(void);
>>> > > +
>>> > > +#endif
>>> > > diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/ 
>>> xe_module.c
>>> > > index 475acdba2b55..15b3cf22193c 100644
>>> > > --- a/drivers/gpu/drm/xe/xe_module.c
>>> > > +++ b/drivers/gpu/drm/xe/xe_module.c
>>> > > @@ -11,6 +11,7 @@
>>> > >   #include <drm/drm_module.h>
>>> > >   #include "xe_drv.h"
>>> > > +#include "xe_configfs.h"
>>> > >   #include "xe_hw_fence.h"
>>> > >   #include "xe_pci.h"
>>> > >   #include "xe_pm.h"
>>> > > @@ -91,6 +92,10 @@ static const struct init_funcs init_funcs[] = {
>>> > >       {
>>> > >           .init = xe_check_nomodeset,
>>> > >       },
>>> > > +    {
>>> > > +        .init = xe_configfs_init,
>>> > > +        .exit = xe_configfs_exit,
>>> > > +    },
>>> > >       {
>>> > >           .init = xe_hw_fence_module_init,
>>> > >           .exit = xe_hw_fence_module_exit,
>>> > > diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/ 
>>> xe_module.h
>>> > > index 84339e509c80..c238dbee6bc7 100644
>>> > > --- a/drivers/gpu/drm/xe/xe_module.h
>>> > > +++ b/drivers/gpu/drm/xe/xe_module.h
>>> > > @@ -24,6 +24,7 @@ struct xe_modparam {
>>> > >   #endif
>>> > >       int wedged_mode;
>>> > >       u32 svm_notifier_size;
>>> > > +    char *survivability_mode;
>>> > >   };
>>> > >   extern struct xe_modparam xe_modparam;
>>>



More information about the Intel-xe mailing list