[PATCH 1/2] RFC drm/xe: Add configfs to enable survivability mode

Lucas De Marchi lucas.demarchi at intel.com
Mon Mar 10 17:01:55 UTC 2025


On Mon, Mar 10, 2025 at 12:40:39PM -0400, Rodrigo Vivi wrote:
>On Mon, Mar 10, 2025 at 07:53:27PM +0530, Riana Tauro wrote:
>> Hi Lucas
>>
>> On 3/10/2025 7:01 PM, Lucas De Marchi wrote:
>> > On Mon, Mar 10, 2025 at 11:01:50AM +0530, Riana Tauro wrote:
>> > > Hi Lucas
>> > >
>> > > On 3/7/2025 8:46 PM, Lucas De Marchi wrote:
>> > > > On Fri, Mar 07, 2025 at 07:54:44PM +0530, Riana Tauro wrote:
>> > > > > Registers a configfs subsystem called 'xe' to userspace. The user can
>> > > > > use this to modify exposed attributes.
>> > > > >
>> > > > > Add survivability mode attribute (config/xe/survivability_mode) to the
>> > > > > subsystem to allow the user to specify the card that should enter
>> > > > > survivability mode.
>> > > > >
>> > > > > Signed-off-by: Riana Tauro <riana.tauro at intel.com>
>> > > > > ---
>> > > > > drivers/gpu/drm/xe/Makefile      |  1 +
>> > > > > drivers/gpu/drm/xe/xe_configfs.c | 95 ++++++++++++++++++++++++++++++++
>> > > > > drivers/gpu/drm/xe/xe_configfs.h | 12 ++++
>> > > > > drivers/gpu/drm/xe/xe_module.c   |  5 ++
>> > > > > drivers/gpu/drm/xe/xe_module.h   |  1 +
>> > > > > 5 files changed, 114 insertions(+)
>> > > > > create mode 100644 drivers/gpu/drm/xe/xe_configfs.c
>> > > > > create mode 100644 drivers/gpu/drm/xe/xe_configfs.h
>> > > > >
>> > > > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>> > > > > index 9699b08585f7..3f8c87292cee 100644
>> > > > > --- a/drivers/gpu/drm/xe/Makefile
>> > > > > +++ b/drivers/gpu/drm/xe/Makefile
>> > > > > @@ -28,6 +28,7 @@ $(obj)/generated/%_wa_oob.c
>> > > > > $(obj)/generated/ %_wa_oob.h: $(obj)/xe_gen_wa_oob \
>> > > > > xe-y += xe_bb.o \
>> > > > >     xe_bo.o \
>> > > > >     xe_bo_evict.o \
>> > > > > +    xe_configfs.o \
>> > > > >     xe_devcoredump.o \
>> > > > >     xe_device.o \
>> > > > >     xe_device_sysfs.o \
>> > > > > diff --git a/drivers/gpu/drm/xe/xe_configfs.c
>> > > > > b/drivers/gpu/drm/xe/ xe_configfs.c
>> > > > > new file mode 100644
>> > > > > index 000000000000..8c5f248e466d
>> > > > > --- /dev/null
>> > > > > +++ b/drivers/gpu/drm/xe/xe_configfs.c
>> > > > > @@ -0,0 +1,95 @@
>> > > > > +// SPDX-License-Identifier: MIT
>> > > > > +/*
>> > > > > + * Copyright © 2025 Intel Corporation
>> > > > > + */
>> > > > > +
>> > > > > +#include <linux/configfs.h>
>> > > > > +#include <linux/init.h>
>> > > > > +#include <linux/module.h>
>> > > > > +
>> > > > > +#include "xe_configfs.h"
>> > > > > +#include "xe_module.h"
>> > > > > +
>> > > > > +/**
>> > > > > + * DOC: Xe Configfs
>> > > > > + *
>> > > > > + * XE KMD registers a configfs subsystem called 'xe'to
>> > > > > userspace that allows users to modify
>> > > > > + * the exposed attributes.
>> > > > > + *
>> > > > > + * Attributes:
>> > > > > + *
>> > > > > + * config/xe/survivability_mode : Write only attribute that
>> > > > > allows user to specify the PCI address
>> > > > > + *                  of the card that has to enter survivability mode
>> > > >
>> > > > I think the desired interface is actually that the user mkdir the bdf in
>> > > > <configfs>/xe/. This populates the available config entries that
>> > > > the user
>> > > > writes to.
>> > >
>> > > Initial thought was mkdir bdf, but since it was a single attribute
>> > > and after a offline discussion with Rodrigo, did a simpler version
>> > > to get comments on the RFC patch and if configfs is okay to use
>> > > for survivability mode
>> >
>> > I disagree on the conclusion as then you are moving away from how
>> > configfs is commonly used and also making it harder to add new
>> > attributes in a uniform way.
>
>My bad, I'm sorry.
>I might have created the confusion here.
>
>> >
>> > >
>> > > For survivability mode, below procedure needs to be followed
>> > >
>> > > echo "0000:03:00.0" > /sys/bus/pci/drivers/xe/unbind
>> > > add bdf for survivability mode
>> > > echo "0000:03:00.0" > /sys/bus/pci/drivers/xe/bind
>> > >
>> > > After the device is removed directory has to be created, the bdf has
>> > > to be checked if it belongs to xe driver and then attrs populated.
>> > > Even i think mkdir is better in case other attrs have to be added in
>> > > future, but for unbind case the check of the driver might be tricky
>> > > .
>> >
>> > I think you are talking about the mkdir in the wrong place. It isn't
>> > related to unbind at all. The sequence you mentioned is just because we
>> > lazy and let pci auto probe the driver. Under the hood you are doing:
>> >
>> > 1) load the module
>> > 2) bind the driver
>> > 3) unbind the driver
>> > 4) bind the driver in survivability mode
>> >
>> > The sequence we should actually have is:
>> >
>> > 1) load the module
>> > 2) bind the driver in survivability mode
>
>No, this doesn't work for the use case we want it.
>
>The goal is to be able to only move one single device to the survivability
>mode at runtime without disrupting the work that is going on on other devices.
>So we cannot unload the module and remove the autoprobe.

where are you seeing the need to remove the module in the above
sequence?

See that step 1 is about loading  the module. *If* the module
was already loaded, then unbinding it would make sense, but it's not a
required step.

Lucas De Marchi

>
>>
>> This is what i meant.
>>
>> 1) load the module
>> 2) bind the driver
>> 3) unbind the driver
>>
>> 4)  mkdir /sys/kernel/config/xe/0000:03:00.0
>>    echo .... > /sys/kernel/kernel/config/xe/0000:03:00.0/...
>> 5) bind the driver in survivability mode
>>
>>
>> At step 4, when creating directory. The bdf needs to be validated, ie (it
>> should be pci_dev that xe will probe). i checked the struct pci_dev but
>> didn't find anything.
>> Will have to match the device id against supported pci device ids.
>
>I don't believe we need to run any validation. If admin creates a file/directory
>with invalid address it will be just ignored, no?!
>
>On the probe we only get the directory, file for the pci device we are probing.
>
>>
>> Otherwise, there might be wrong non functional bdf directories present based
>> on what user provides.
>>
>> Thanks
>> Riana
>> >
>> > You shouldn't create any directory when unbinding. It's the user who
>> > should create it. When **binding** the driver we should check what are
>> > the extra configuration available and adapt the probe according to that.
>> >
>> > For that we have to disable pci driver autoprobe and tweak the
>> > configfs settings:
>> >
>> > 1) echo 0 > /sys/bus/pci/drivers_autoprobe
>> >     modprobe xe
>> >
>> > 2) mkdir /sys/kernel/config/xe/0000:03:00.0
>> >     echo .... > /sys/kernel/kernel/config/xe/0000:03:00.0/...
>> >     echo "0000:03:00.0" > /sys/bus/pci/drivers/xe/bind
>> >
>> >
>> > Lucas De Marchi
>> >
>> > >
>> > > The attr is WO and any value can be written, only if the correct bdf
>> > > is added in the attr then it'll be used on probe to enter
>> > > survivability mode.The current patch only checks the format and does
>> > > not check if bdf belongs to xe.
>> > > >
>> > > > > + */
>> > > > > +
>> > > > > +void xe_configfs_clear_survivability_mode(void)
>> > > > > +{
>> > > > > +    kfree(xe_modparam.survivability_mode);
>> > > > > +    xe_modparam.survivability_mode = NULL;
>> > > > > +}
>> > > > > +
>> > > > > +static ssize_t survivability_mode_store(struct config_item
>> > > > > *item, const char *page, size_t len)
>> > > >
>> > > > once you handle the mkdir mentioned above, this should probably be
>> > > > a boolean attr like this:
>> > > >
>> > > >     drivers/nvme/target/configfs.c:CONFIGFS_ATTR(nvmet_,
>> > > > param_pi_enable);
>> > > >
>> > > > > +{
>> > > > > +    char *survivability_mode;
>> > > > > +    int ret;
>> > > > > +    unsigned int domain, bus, slot, function;
>> > > > > +
>> > > > > +    ret = sscanf(page, "%04x:%02x:%02x.%x", &domain, &bus,
>> > > > > &slot, &function);
>> > > > > +    if (ret != 4)
>> > > > > +        return -EINVAL;
>> > > > > +
>> > > > > +    survivability_mode = kstrdup(page, GFP_KERNEL);
>> > > > > +    if (!survivability_mode)
>> > > > > +        return -ENOMEM;
>> > > > > +
>> > > > > +    xe_configfs_clear_survivability_mode();
>> > > > > +    xe_modparam.survivability_mode = survivability_mode;
>> > > > > +
>> > > > > +    return len;
>> > > > > +}
>> > > > > +
>> > > > > +CONFIGFS_ATTR_WO(, survivability_mode);
>> > > > > +
>> > > > > +static struct configfs_attribute *xe_configfs_attrs[] = {
>> > > > > +    &attr_survivability_mode,
>> > > > > +    NULL,
>> > > > > +};
>> > > > > +
>> > > > > +static const struct config_item_type xe_config_type = {
>> > > > > +    .ct_attrs    = xe_configfs_attrs,
>> > > > > +    .ct_owner    = THIS_MODULE,
>> > > > > +};
>> > > > > +
>> > > > > +static struct configfs_subsystem xe_config_subsys = {
>> > > > > +    .su_group = {
>> > > > > +        .cg_item = {
>> > > > > +            .ci_namebuf = "xe",
>> > > > > +            .ci_type = &xe_config_type,
>> > > > > +        },
>> > > > > +    },
>> > > > > +};
>> > > > >
>> > > >
>> > > > so... it seems you are missing a configfs_group_operations.make_group.
>> > > >
>> > > > > +int __init xe_configfs_init(void)
>> > > > > +{
>> > > > > +    int ret;
>> > > > > +
>> > > > > +    config_group_init(&xe_config_subsys.su_group);
>> > > > > +    mutex_init(&xe_config_subsys.su_mutex);
>> > > >
>> > > > this mutex_init seems odd, but inline with other drivers
>> > > yeah it is not used anywhere but the sample and other drivers also
>> > > initialize it
>> > > >
>> > > > > +    ret = configfs_register_subsystem(&xe_config_subsys);
>> > > > > +    if (ret) {
>> > > > > +        pr_err("Error %d while registering subsystem %s\n",
>> > > > > +               ret, xe_config_subsys.su_group.cg_item.ci_namebuf);
>> > > > > +        mutex_destroy(&xe_config_subsys.su_mutex);
>> > > > > +        return ret;
>> > > > > +    }
>> > > > > +
>> > > > > +    return 0;
>> > > > > +}
>> > > > > +
>> > > > > +void __exit xe_configfs_exit(void)
>> > > > > +{
>> > > > > +    xe_configfs_clear_survivability_mode();
>> > > > > +    configfs_unregister_subsystem(&xe_config_subsys);
>> > > > > +    mutex_destroy(&xe_config_subsys.su_mutex);
>> > > > > +}
>> > > > > +
>> > > > > diff --git a/drivers/gpu/drm/xe/xe_configfs.h
>> > > > > b/drivers/gpu/drm/xe/ xe_configfs.h
>> > > > > new file mode 100644
>> > > > > index 000000000000..491629a2ca53
>> > > > > --- /dev/null
>> > > > > +++ b/drivers/gpu/drm/xe/xe_configfs.h
>> > > > > @@ -0,0 +1,12 @@
>> > > > > +/* SPDX-License-Identifier: MIT */
>> > > > > +/*
>> > > > > + * Copyright © 2025 Intel Corporation
>> > > > > + */
>> > > > > +#ifndef _XE_CONFIGFS_H_
>> > > > > +#define _XE_CONFIGFS_H_
>> > > > > +
>> > > > > +int xe_configfs_init(void);
>> > > > > +void xe_configfs_exit(void);
>> > > > > +void xe_configfs_clear_survivability_mode(void);
>> > > > > +
>> > > > > +#endif
>> > > > > diff --git a/drivers/gpu/drm/xe/xe_module.c
>> > > > > b/drivers/gpu/drm/xe/ xe_module.c
>> > > > > index 475acdba2b55..15b3cf22193c 100644
>> > > > > --- a/drivers/gpu/drm/xe/xe_module.c
>> > > > > +++ b/drivers/gpu/drm/xe/xe_module.c
>> > > > > @@ -11,6 +11,7 @@
>> > > > > #include <drm/drm_module.h>
>> > > > >
>> > > > > #include "xe_drv.h"
>> > > > > +#include "xe_configfs.h"
>> > > > > #include "xe_hw_fence.h"
>> > > > > #include "xe_pci.h"
>> > > > > #include "xe_pm.h"
>> > > > > @@ -91,6 +92,10 @@ static const struct init_funcs init_funcs[] = {
>> > > > >     {
>> > > > >         .init = xe_check_nomodeset,
>> > > > >     },
>> > > > > +    {
>> > > > > +        .init = xe_configfs_init,
>> > > > > +        .exit = xe_configfs_exit,
>> > > > > +    },
>> > > > >     {
>> > > > >         .init = xe_hw_fence_module_init,
>> > > > >         .exit = xe_hw_fence_module_exit,
>> > > > > diff --git a/drivers/gpu/drm/xe/xe_module.h
>> > > > > b/drivers/gpu/drm/xe/ xe_module.h
>> > > > > index 84339e509c80..c238dbee6bc7 100644
>> > > > > --- a/drivers/gpu/drm/xe/xe_module.h
>> > > > > +++ b/drivers/gpu/drm/xe/xe_module.h
>> > > > > @@ -24,6 +24,7 @@ struct xe_modparam {
>> > > > > #endif
>> > > > >     int wedged_mode;
>> > > > >     u32 svm_notifier_size;
>> > > > > +    char *survivability_mode;
>> > > >
>> > > > drop this.. We want a  config struct in the xe_device. It's actually
>> > > > allocated by the mkdir in the configs and when we are probing, we should
>> > > > find the config instace based on pdev and set the pointer xe->configfs
>> > > > or something like that.
>> > >
>> > > Will add a new struct
>> > >
>> > > Thanks
>> > > Riana
>> > > >
>> > > > thanks
>> > > > Lucas De Marchi
>> > > >
>> > > > > };
>> > > > >
>> > > > > extern struct xe_modparam xe_modparam;
>> > > > > --
>> > > > > 2.47.1
>> > > > >
>> > >
>>


More information about the Intel-xe mailing list