[PATCH 0/2] Add configfs support for survivability mode
Riana Tauro
riana.tauro at intel.com
Tue Apr 1 06:18:54 UTC 2025
Hi Lucas
On 4/1/2025 3:15 AM, Lucas De Marchi wrote:
> On Mon, Mar 31, 2025 at 04:19:28PM -0400, Rodrigo Vivi wrote:
>> On Thu, Mar 27, 2025 at 09:40:39AM -0500, Lucas De Marchi wrote:
>>> On Thu, Mar 27, 2025 at 12:12:00PM +0530, Riana Tauro wrote:
>>> > This series proposes to expose attributes via xe configfs
>>> > subsystem. Xe registers a configfs subsystem named 'xe'.
>>> > Userspace can then create directories for the devices they
>>> > want to configure and set appropriate attributes
>>> >
>>> > This is done by
>>> >
>>> > mount -t configfs none /config
>>> > mkdir /config/xe/0000:03:00.0
>>> >
>>>
>>> If we need a new version or to document anywhere in our docs, I'd add a
>>> comment here:
>>>
>>> # If driver is already bound, unbind it as this configuration
>>> # applies only when probing it
>>>
>>> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>>> > echo 1 > sys/kernel/config/xe/0000:03:00.0/survivability_mode
>>> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>>> >
>>> > This is an alternative to introducing module param that causes
>>> > all the connected and supported GPU cards to enter survivability mode.
>>> > Manually entering survivability mode is useful when pcode does not
>>> > report failure, in field repairs and validation
>>> >
>>> > Rev2: use config_groups (Lucas)
>>>
>>> Awesome. I have some other work pending that will make use of
>>> it. I will play with these patches soon.
>>
>> I really liked this new flow and I was giving it a try here right now.
>>
>> However it didn't work. It didn't take me to the survivability mode,
>> but also, I cannot unload the xe after creating this configfs file:
>>
>> sudo remove /sys/kernel/config/xe/0000\:0*
>> rm: cannot remove '0000:00:02.0/survivability_mode': Operation not
>> permitted
>> rm: cannot remove '0000:03:00.0/survivability_mode': Operation not
>> permitted
>
> humn... testing on a bmg, it works:
>
> # # first of all, make sure autoprobe doesn't do what we don't
> # # want:
> # echo 0 > /sys/bus/pci/drivers_autoprobe
>
> # # load module and set the configuration
> # modprobe xe
> # mkdir /sys/kernel/config/xe/0000:03:00.0
> # echo 1 > /sys/kernel/config/xe/0000\:03\:00.0/survivability_mode
>
> # # bind the driver and check it enters survivability mode
> # echo 0000:03:00.0 > /sys/module/xe/drivers/pci\:xe/bind
> # dmesg | tail -n1
> [ 1994.807063] xe 0000:03:00.0: In Survivability Mode
> # cat /sys/bus/pci/drivers/xe/0000\:03\:00.0/survivability_mode
> Capability Info: 0x138320 - 0x2001ae06
> Postcode Info: 0x138324 - 0x0
> Overflow Info: 0x138328 - 0x0
> Auxiliary Info 0: 0x13832c - 0x0
>
> Unbind first and test we can remove the configuration for next bind:
>
> # echo 0000:03:00.0 > /sys/module/xe/drivers/pci\:xe/unbind
> # tree /sys/kernel/config/xe
> /sys/kernel/config/xe
> └── 0000:03:00.0
> └── survivability_mode
> # rmdir /sys/kernel/config/xe/0000:03:00.0/
> # tree /sys/kernel/config/xe
> /sys/kernel/config/xe
>
> Remove module:
> # modprobe -r xe
>
>
> What doesn't work:
>
> 1) Remove the module without unbinding. This is already the case
> 2) Remove the module without unbinding first
Do you mean removing directory?>
> For (2) it's basicaly: when you create a configfs connection, configfs
> increments the module's refcount. You have to remove them first.
yeah it takes a refcount and i think subsystem register fails if that's
not taken. haven't tried >
> For my surprise, it's possible to remove the config after binding -
> there's no error.
When i was trying, even with config_item reference is taken, it had
allowed rmdir. drop_item is void and cannot return a error but didn't
search further as survivability mode didn't require a reference.
Thanks
Riana> I need to double check if this wouldn't create some
> UAF depending on how we use the config, but for survivability purposes,
> I don't think it has an issue with that single bool.
>
> Lucas De Marchi
>
>>
>> Tried to unbind and had the same failure.
>>
>> then with the configfs there we cannot remove the module:
>> $ sudo rmmod xe
>> rmmod: ERROR: Module xe is in use
>
> what's the `lsmod | grep xe` output?
>
> You should always be able to unbind. There's nothing the driver can do
> that would block the unbind. After unbind, you should rmdir every dir in
> /sys/kernel/config/xe/*. Note that it's not an rm -r since you can't
> remove the inner configuration files, only the directory that is the
> "connection" between configfs and the driver. You also can't remove the
> xe dir (as it's owned by the module, not the connection to the device),
> only the dirs under it.
>
> Lucas De Marchi
>
>
>>
>>
>> So, it looks we have some stuff to adjust here before we can move
>> further,
>> but so far things are looking promising indeed
>>
>>>
>>> thanks
>>> Lucas De Marchi
More information about the Intel-xe
mailing list