[PATCH i-g-t 0/4] Device scan fixes
Lucas De Marchi
lucas.demarchi at intel.com
Wed Dec 18 22:04:26 UTC 2024
On Wed, Dec 18, 2024 at 09:16:48AM +0100, Peter Senna Tschudin wrote:
>
>
>On 18.12.2024 06:13, Lucas De Marchi wrote:
>> This started with the goal of fixing xe_wedged (besides the fix to the
>> kernel that is in flight), however the issue of device "disappearing
>> from bus" looks similar to several other issues we occasionally have -
>> it may be the same issue. Try to fix it by forcing scans.
>
>Is the hypothesis that while an IGT test is running something may
>break the association between GPU and device node? I am asking because
>the drm device is always closed after a test ends.
it depends what the test is doing, if the fd is closed etc... note that
multiple subtests are executed without runing a new program. Even some
under-the-hood behavior that we have with e.g. i915 (it keeps an fd open
a dup fd open) may change the behavior if you load/unload or
bind/unbind. Example:
RPL (00:02.0) + BMG (03:00.0)
# ./build/tools/lsgpu
card1 Intel Battlemage (Gen20) drm:/dev/dri/card1
└─renderD129 drm:/dev/dri/renderD129
card0 Intel Raptorlake_s (Gen12) drm:/dev/dri/card0
└─renderD128 drm:/dev/dri/renderD128
# ls -l /dev/dri/by-path/
total 0
lrwxrwxrwx 1 root root 8 Dec 18 19:39 pci-0000:00:02.0-card -> ../card0
lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:00:02.0-render -> ../renderD128
lrwxrwxrwx 1 root root 8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129
Rebind RPL:
# echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/unbind
# echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/bind
# ls -l /dev/dri/by-path/
total 0
lrwxrwxrwx 1 root root 8 Dec 18 19:40 pci-0000:00:02.0-card -> ../card0
lrwxrwxrwx 1 root root 13 Dec 18 19:40 pci-0000:00:02.0-render -> ../renderD128
lrwxrwxrwx 1 root root 8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129
Great, nothing break, right?
Rebind both in a different order (shouldn't happen if both cards are
backed by the same module, but can perfectly happen in a i915 + xe
scenario as the modules may be ready in different order):
# echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/unbind
# echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
# echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
# echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/bind
# ls -l /dev/dri/by-path/
total 0
lrwxrwxrwx 1 root root 8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
lrwxrwxrwx 1 root root 8 Dec 18 19:41 pci-0000:03:00.0-card -> ../card0
lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:03:00.0-render -> ../renderD128
Simulate one thing that may happen in igt: but leak an fd:
# exec 3<> /dev/dri/by-path/pci-0000:03:00.0-card
# echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
# echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
# echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
# ls -l /dev/dri/by-path/
total 0
lrwxrwxrwx 1 root root 8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
lrwxrwxrwx 1 root root 8 Dec 18 19:42 pci-0000:03:00.0-card -> ../card2
lrwxrwxrwx 1 root root 13 Dec 18 19:42 pci-0000:03:00.0-render -> ../renderD130
So... because we had and fd open, now it became card2 rather than 1.
Also note that simply calling close(fd) in igt doesn't work as the
cached fds will throw igt through the wrong path.
So.... instead of adding bandaid everywhere in igt, I think we should
fix the design to stop caching the wrong thing. IMO a first step would
be "disable the cache and see how much it impacts".
Lucas De Marchi
>
>I tested this by printing the value of _opened_fds_count from
>lib/drmtest.c before each test, and the value is always 0.
>
>_opened_fds_count being zero means that the cache is empty, right? So
>if the cache is empty before each test starts, under which conditions
>can the potential problem manifest?
More information about the igt-dev
mailing list