[PATCH 0/2] Add configfs support for survivability mode

Riana Tauro riana.tauro at intel.com
Tue Apr 1 06:18:54 UTC 2025


Hi Lucas

On 4/1/2025 3:15 AM, Lucas De Marchi wrote:
> On Mon, Mar 31, 2025 at 04:19:28PM -0400, Rodrigo Vivi wrote:
>> On Thu, Mar 27, 2025 at 09:40:39AM -0500, Lucas De Marchi wrote:
>>> On Thu, Mar 27, 2025 at 12:12:00PM +0530, Riana Tauro wrote:
>>> > This series proposes to expose attributes via xe configfs
>>> > subsystem. Xe registers a configfs subsystem named 'xe'.
>>> > Userspace can then create directories for the devices they
>>> > want to configure and set appropriate attributes
>>> >
>>> > This is done by
>>> >
>>> > mount -t configfs none /config
>>> > mkdir /config/xe/0000:03:00.0
>>> >
>>>
>>> If we need a new version or to document anywhere in our docs, I'd add a
>>> comment here:
>>>
>>> # If driver is already bound, unbind it as this configuration
>>> # applies only when probing it
>>>
>>> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>>> > echo 1 > sys/kernel/config/xe/0000:03:00.0/survivability_mode
>>> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>>> >
>>> > This is an alternative to introducing module param that causes
>>> > all the connected and supported GPU cards to enter survivability mode.
>>> > Manually entering survivability mode is useful when pcode does not
>>> > report failure, in field repairs and validation
>>> >
>>> > Rev2: use config_groups (Lucas)
>>>
>>> Awesome. I have some other work pending that will make use of
>>> it. I will play with these patches soon.
>>
>> I really liked this new flow and I was giving it a try here right now.
>>
>> However it didn't work. It didn't take me to the survivability mode,
>> but also, I cannot unload the xe after creating this configfs file:
>>
>> sudo remove /sys/kernel/config/xe/0000\:0*
>> rm: cannot remove '0000:00:02.0/survivability_mode': Operation not 
>> permitted
>> rm: cannot remove '0000:03:00.0/survivability_mode': Operation not 
>> permitted
> 
> humn... testing on a bmg, it works:
> 
>      # # first of all, make sure autoprobe doesn't do what we don't
>      # # want:
>      # echo 0  > /sys/bus/pci/drivers_autoprobe
> 
>      # # load module and set the configuration
>      # modprobe xe
>      # mkdir /sys/kernel/config/xe/0000:03:00.0
>      # echo 1 > /sys/kernel/config/xe/0000\:03\:00.0/survivability_mode
> 
>      # # bind the driver and check it enters survivability mode
>      # echo 0000:03:00.0 > /sys/module/xe/drivers/pci\:xe/bind
>      # dmesg | tail -n1
>      [ 1994.807063] xe 0000:03:00.0: In Survivability Mode
>      # cat  /sys/bus/pci/drivers/xe/0000\:03\:00.0/survivability_mode 
>      Capability Info: 0x138320 - 0x2001ae06
>      Postcode Info: 0x138324 - 0x0
>      Overflow Info: 0x138328 - 0x0
>      Auxiliary Info 0: 0x13832c - 0x0
> 
> Unbind first and test we can remove the configuration for next bind:
> 
>      # echo 0000:03:00.0 > /sys/module/xe/drivers/pci\:xe/unbind
>      # tree /sys/kernel/config/xe
>      /sys/kernel/config/xe
>      └── 0000:03:00.0
>          └── survivability_mode
>      # rmdir /sys/kernel/config/xe/0000:03:00.0/
>      # tree /sys/kernel/config/xe
>      /sys/kernel/config/xe
> 
> Remove module:
>      # modprobe -r xe
> 
> 
> What doesn't work:
> 
>      1) Remove the module without unbinding. This is already the case
>      2) Remove the module without unbinding first
Do you mean removing directory?>
> For (2) it's basicaly: when you create a configfs connection, configfs
> increments the module's refcount. You have to remove them first.
yeah it takes a refcount and i think subsystem register fails if that's 
not taken. haven't tried >
> For my surprise, it's possible to remove the config after binding -
> there's no error. 
When i was trying, even with config_item reference is taken, it had 
allowed rmdir. drop_item is void and cannot return a error but didn't 
search further as survivability mode didn't require a reference.

Thanks
Riana> I need to double check if this wouldn't create some
> UAF depending on how we use the config, but for survivability purposes,
> I don't think it has an issue with that single bool.
> 
> Lucas De Marchi
> 
>>
>> Tried to unbind and had the same failure.
>>
>> then with the configfs there we cannot remove the module:
>> $ sudo rmmod xe
>> rmmod: ERROR: Module xe is in use
> 
> what's the `lsmod | grep xe` output?
> 
> You should always be able to unbind. There's nothing the driver can do
> that would block the unbind. After unbind, you should rmdir every dir in
> /sys/kernel/config/xe/*. Note that it's not an rm -r since you can't
> remove the inner configuration files, only the directory that is the
> "connection" between configfs and the driver. You also can't remove the
> xe dir (as it's owned by the module, not the connection to the device),
> only  the dirs under it.
> 
> Lucas De Marchi
> 
> 
>>
>>
>> So, it looks we have some stuff to adjust here before we can move 
>> further,
>> but so far things are looking promising indeed
>>
>>>
>>> thanks
>>> Lucas De Marchi



More information about the Intel-xe mailing list