[PATCH 0/2] Add configfs support for survivability mode

Lucas De Marchi lucas.demarchi at intel.com
Mon Mar 31 21:45:06 UTC 2025


On Mon, Mar 31, 2025 at 04:19:28PM -0400, Rodrigo Vivi wrote:
>On Thu, Mar 27, 2025 at 09:40:39AM -0500, Lucas De Marchi wrote:
>> On Thu, Mar 27, 2025 at 12:12:00PM +0530, Riana Tauro wrote:
>> > This series proposes to expose attributes via xe configfs
>> > subsystem. Xe registers a configfs subsystem named 'xe'.
>> > Userspace can then create directories for the devices they
>> > want to configure and set appropriate attributes
>> >
>> > This is done by
>> >
>> > mount -t configfs none /config
>> > mkdir /config/xe/0000:03:00.0
>> >
>>
>> If we need a new version or to document anywhere in our docs, I'd add a
>> comment here:
>>
>> # If driver is already bound, unbind it as this configuration
>> # applies only when probing it
>>
>> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>> > echo 1 > sys/kernel/config/xe/0000:03:00.0/survivability_mode
>> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>> >
>> > This is an alternative to introducing module param that causes
>> > all the connected and supported GPU cards to enter survivability mode.
>> > Manually entering survivability mode is useful when pcode does not
>> > report failure, in field repairs and validation
>> >
>> > Rev2: use config_groups (Lucas)
>>
>> Awesome. I have some other work pending that will make use of
>> it. I will play with these patches soon.
>
>I really liked this new flow and I was giving it a try here right now.
>
>However it didn't work. It didn't take me to the survivability mode,
>but also, I cannot unload the xe after creating this configfs file:
>
>sudo remove /sys/kernel/config/xe/0000\:0*
>rm: cannot remove '0000:00:02.0/survivability_mode': Operation not permitted
>rm: cannot remove '0000:03:00.0/survivability_mode': Operation not permitted

humn... testing on a bmg, it works:

	# # first of all, make sure autoprobe doesn't do what we don't
	# # want:
	# echo 0  > /sys/bus/pci/drivers_autoprobe

	# # load module and set the configuration
	# modprobe xe
	# mkdir /sys/kernel/config/xe/0000:03:00.0
	# echo 1 > /sys/kernel/config/xe/0000\:03\:00.0/survivability_mode

	# # bind the driver and check it enters survivability mode
	# echo 0000:03:00.0 > /sys/module/xe/drivers/pci\:xe/bind
	# dmesg | tail -n1
	[ 1994.807063] xe 0000:03:00.0: In Survivability Mode
	# cat  /sys/bus/pci/drivers/xe/0000\:03\:00.0/survivability_mode 
	Capability Info: 0x138320 - 0x2001ae06
	Postcode Info: 0x138324 - 0x0
	Overflow Info: 0x138328 - 0x0
	Auxiliary Info 0: 0x13832c - 0x0

Unbind first and test we can remove the configuration for next bind:

	# echo 0000:03:00.0 > /sys/module/xe/drivers/pci\:xe/unbind
	# tree /sys/kernel/config/xe
	/sys/kernel/config/xe
	└── 0000:03:00.0
	    └── survivability_mode
	# rmdir /sys/kernel/config/xe/0000:03:00.0/
	# tree /sys/kernel/config/xe
	/sys/kernel/config/xe

Remove module:
	# modprobe -r xe


What doesn't work:

	1) Remove the module without unbinding. This is already the case
	2) Remove the module without unbinding first

For (2) it's basicaly: when you create a configfs connection, configfs
increments the module's refcount. You have to remove them first.

For my surprise, it's possible to remove the config after binding -
there's no error. I need to double check if this wouldn't create some
UAF depending on how we use the config, but for survivability purposes,
I don't think it has an issue with that single bool.

Lucas De Marchi

>
>Tried to unbind and had the same failure.
>
>then with the configfs there we cannot remove the module:
>$ sudo rmmod xe
>rmmod: ERROR: Module xe is in use

what's the `lsmod | grep xe` output?

You should always be able to unbind. There's nothing the driver can do
that would block the unbind. After unbind, you should rmdir every dir in
/sys/kernel/config/xe/*. Note that it's not an rm -r since you can't
remove the inner configuration files, only the directory that is the
"connection" between configfs and the driver. You also can't remove the
xe dir (as it's owned by the module, not the connection to the device),
only  the dirs under it.

Lucas De Marchi


>
>
>So, it looks we have some stuff to adjust here before we can move further,
>but so far things are looking promising indeed
>
>>
>> thanks
>> Lucas De Marchi


More information about the Intel-xe mailing list