[PATCH i-g-t 0/4] Device scan fixes

Lucas De Marchi lucas.demarchi at intel.com
Wed Dec 18 22:04:26 UTC 2024


On Wed, Dec 18, 2024 at 09:16:48AM +0100, Peter Senna Tschudin wrote:
>
>
>On 18.12.2024 06:13, Lucas De Marchi wrote:
>> This started with the goal of fixing xe_wedged (besides the fix to the
>> kernel that is in flight), however the issue of device "disappearing
>> from bus" looks similar to several other issues we occasionally have -
>> it may be the same issue. Try to fix it by forcing scans.
>
>Is the hypothesis that while an IGT test is running something may
>break the association between GPU and device node? I am asking because
>the drm device is always closed after a test ends.

it depends what the test is doing, if the fd is closed etc... note that
multiple subtests are executed without runing a new program. Even some
under-the-hood behavior that we have with e.g. i915 (it keeps an fd open
a dup fd open) may change the behavior if you load/unload or
bind/unbind. Example:

RPL (00:02.0) + BMG (03:00.0)
	# ./build/tools/lsgpu                                   
	card1                    Intel Battlemage (Gen20)          drm:/dev/dri/card1       
	└─renderD129                                               drm:/dev/dri/renderD129  
	card0                    Intel Raptorlake_s (Gen12)        drm:/dev/dri/card0       
	└─renderD128                                               drm:/dev/dri/renderD128  

	# ls -l /dev/dri/by-path/                           
	total 0                                                                         
	lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:00:02.0-card -> ../card0
	lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:00:02.0-render -> ../renderD128
	lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
	lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129 

Rebind RPL:
	# echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/unbind
	# echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/bind

	# ls -l /dev/dri/by-path/
	total 0
	lrwxrwxrwx 1 root root  8 Dec 18 19:40 pci-0000:00:02.0-card -> ../card0
	lrwxrwxrwx 1 root root 13 Dec 18 19:40 pci-0000:00:02.0-render -> ../renderD128
	lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
	lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129

Great, nothing break, right?

Rebind both in a different order (shouldn't happen if both cards are
backed by the same module, but can perfectly happen in a i915 + xe
scenario as the modules may be ready in different order):

	# echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/unbind 
	# echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind 
	# echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind 
	# echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/bind 
	# ls -l /dev/dri/by-path/
	total 0
	lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
	lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
	lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:03:00.0-card -> ../card0
	lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:03:00.0-render -> ../renderD128

Simulate one thing that may happen in igt: but leak an fd:

	# exec 3<>  /dev/dri/by-path/pci-0000:03:00.0-card
	# echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind 
	# echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind 
	# echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
	# ls -l /dev/dri/by-path/
	total 0
	lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
	lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
	lrwxrwxrwx 1 root root  8 Dec 18 19:42 pci-0000:03:00.0-card -> ../card2
	lrwxrwxrwx 1 root root 13 Dec 18 19:42 pci-0000:03:00.0-render -> ../renderD130

So... because we had and fd open, now it became card2 rather than 1.
Also note that simply calling close(fd) in igt doesn't work as the
cached fds will throw igt through the wrong path.

So.... instead of adding bandaid everywhere in igt, I think we should
fix the design to stop caching the wrong thing. IMO a first step would
be "disable the cache and see how much it impacts".

Lucas De Marchi

>
>I tested this by printing the value of _opened_fds_count from
>lib/drmtest.c before each test, and the value is always 0.
>
>_opened_fds_count being zero means that the cache is empty, right? So
>if the cache is empty before each test starts, under which conditions
>can the potential problem manifest?


More information about the igt-dev mailing list