[PATCH i-g-t 0/4] Device scan fixes
Peter Senna Tschudin
peter.senna at linux.intel.com
Thu Dec 19 08:44:31 UTC 2024
Hi Lucas,
On 18.12.2024 23:04, Lucas De Marchi wrote:
> On Wed, Dec 18, 2024 at 09:16:48AM +0100, Peter Senna Tschudin wrote:
>>
>>
>> On 18.12.2024 06:13, Lucas De Marchi wrote:
>>> This started with the goal of fixing xe_wedged (besides the fix to the
>>> kernel that is in flight), however the issue of device "disappearing
>>> from bus" looks similar to several other issues we occasionally have -
>>> it may be the same issue. Try to fix it by forcing scans.
>>
>> Is the hypothesis that while an IGT test is running something may
>> break the association between GPU and device node? I am asking because
>> the drm device is always closed after a test ends.
>
> it depends what the test is doing, if the fd is closed etc... note that
> multiple subtests are executed without runing a new program. Even some
> under-the-hood behavior that we have with e.g. i915 (it keeps an fd open
> a dup fd open) may change the behavior if you load/unload or
> bind/unbind. Example:
>
> RPL (00:02.0) + BMG (03:00.0)
> # ./build/tools/lsgpu card1 Intel Battlemage (Gen20) drm:/dev/dri/card1 └─renderD129 drm:/dev/dri/renderD129 card0 Intel Raptorlake_s (Gen12) drm:/dev/dri/card0 └─renderD128 drm:/dev/dri/renderD128
> # ls -l /dev/dri/by-path/ total 0 lrwxrwxrwx 1 root root 8 Dec 18 19:39 pci-0000:00:02.0-card -> ../card0
> lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:00:02.0-render -> ../renderD128
> lrwxrwxrwx 1 root root 8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
> lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129
> Rebind RPL:
> # echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/unbind
> # echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/bind
>
> # ls -l /dev/dri/by-path/
> total 0
> lrwxrwxrwx 1 root root 8 Dec 18 19:40 pci-0000:00:02.0-card -> ../card0
> lrwxrwxrwx 1 root root 13 Dec 18 19:40 pci-0000:00:02.0-render -> ../renderD128
> lrwxrwxrwx 1 root root 8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
> lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129
>
> Great, nothing break, right?
>
> Rebind both in a different order (shouldn't happen if both cards are
> backed by the same module, but can perfectly happen in a i915 + xe
> scenario as the modules may be ready in different order):
>
> # echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/unbind # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind # echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/bind # ls -l /dev/dri/by-path/
> total 0
> lrwxrwxrwx 1 root root 8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
> lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
> lrwxrwxrwx 1 root root 8 Dec 18 19:41 pci-0000:03:00.0-card -> ../card0
> lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:03:00.0-render -> ../renderD128
>
> Simulate one thing that may happen in igt: but leak an fd:
>
> # exec 3<> /dev/dri/by-path/pci-0000:03:00.0-card
> # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
> # ls -l /dev/dri/by-path/
> total 0
> lrwxrwxrwx 1 root root 8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
> lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
> lrwxrwxrwx 1 root root 8 Dec 18 19:42 pci-0000:03:00.0-card -> ../card2
> lrwxrwxrwx 1 root root 13 Dec 18 19:42 pci-0000:03:00.0-render -> ../renderD130
>
> So... because we had and fd open, now it became card2 rather than 1.
> Also note that simply calling close(fd) in igt doesn't work as the
> cached fds will throw igt through the wrong path.
>
> So.... instead of adding bandaid everywhere in igt, I think we should
> fix the design to stop caching the wrong thing. IMO a first step would
> be "disable the cache and see how much it impacts".
>
> Lucas De Marchi
>
>>
>> I tested this by printing the value of _opened_fds_count from
>> lib/drmtest.c before each test, and the value is always 0.
>>
>> _opened_fds_count being zero means that the cache is empty, right? So
>> if the cache is empty before each test starts, under which conditions
>> can the potential problem manifest?
I could finally detect the issue by(The patch should be applied on top of my facts series):
- Creating peter_drm_stats(): prints drm cache entries with fd and card number
- Instrumenting igt at core_hotunplug@hotrebind with calls to igt_facts() and peter_drm_stats()
My goal is to create a test for us to test possible fixes. Can you help me creating a test that
is more meaningful for us to use as benchmark? See the output:
Starting subtest: hotrebind
...
[5147.892783] [FACT Before it starts] new: hardware.pci.drm_card_at_addr.0000:03:00.0: card1
[5147.897991] [DRM_CACHE] Before it starts: fd 5, card: 1
[5148.968237] [DRM_CACHE] After local_drm_open_driver(): fd 5, card: 1
Unloaded audio driver snd_hda_intel
Realoading snd_hda_intel
[5150.025062] [FACT After driver_unbind() and driver_bind()] changed: hardware.pci.drm_card_at_addr.0000:03:00.0: card1 -> card0
// Here the cache is wrong
[5150.030141] [DRM_CACHE] After driver_unbind() and driver_bind(): fd 5, card: 1
Opened device: \/dev\/dri\/card0
Opened device: \/dev\/dri\/renderD128
// healthcheck() fixes the cache by calling igt_devices_scan(true)
[5151.100086] [DRM_CACHE] After healthcheck(): fd 8, card: 0
Subtest hotrebind: SUCCESS (3.227s)
-------------- next part --------------
diff --git a/lib/drmtest.c b/lib/drmtest.c
index 2dd4540b8..6b7c35e20 100644
--- a/lib/drmtest.c
+++ b/lib/drmtest.c
@@ -35,6 +35,7 @@
#include <sys/ioctl.h>
#include <string.h>
#include <sys/mman.h>
+#include <sys/sysmacros.h>
#include <signal.h>
#include <pciaccess.h>
#include <stdlib.h>
@@ -418,6 +419,33 @@ static bool _is_already_opened(const char *path, int as_idx)
return false;
}
+void peter_drm_stats(const char *msg)
+{
+ struct timespec uptime_ts;
+ char *uptime = NULL;
+
+ if (clock_gettime(CLOCK_BOOTTIME, &uptime_ts) != 0)
+ return;
+
+ asprintf(&uptime,
+ "%ld.%06ld",
+ uptime_ts.tv_sec,
+ uptime_ts.tv_nsec / 1000);
+
+ for (int i = 0; i < _opened_fds_count; i++) {
+ unsigned long int st_rdev;
+ int fd, card;
+ fd = _opened_fds[i].fd;
+ st_rdev = _opened_fds[i].stat.st_rdev;
+ card = gnu_dev_minor(st_rdev);
+ igt_info("[%s] [DRM_CACHE] %s: fd %d, card: %d\n",
+ uptime,
+ msg ? msg : "",
+ fd,
+ card);
+ }
+}
+
static int __search_and_open(const char *base, int offset, unsigned int chipset, int as_idx)
{
const char *forced;
diff --git a/lib/drmtest.h b/lib/drmtest.h
index 27e5a18e2..d3713e9b9 100644
--- a/lib/drmtest.h
+++ b/lib/drmtest.h
@@ -145,6 +145,7 @@ bool is_vc4_device(int fd);
bool is_xe_device(int fd);
bool is_intel_device(int fd);
enum intel_driver get_intel_driver(int fd);
+extern void peter_drm_stats(const char *msg);
/**
* do_or_die:
diff --git a/tests/core_hotunplug.c b/tests/core_hotunplug.c
index 7f17f4423..838404d8b 100644
--- a/tests/core_hotunplug.c
+++ b/tests/core_hotunplug.c
@@ -39,6 +39,8 @@
#include "igt_kmod.h"
#include "igt_sysfs.h"
#include "sw_sync.h"
+#include "igt_facts.h"
+
/**
* TEST: core hotunplug
* Description: Examine behavior of a driver on device hot unplug
@@ -614,15 +616,29 @@ static void hotunplug_rescan(struct hotunplug *priv)
static void hotrebind(struct hotunplug *priv)
{
+ igt_facts_lists_init();
+
+ igt_facts("Before it starts");
+ peter_drm_stats("Before it starts");
+
pre_check(priv);
priv->fd.drm = local_drm_open_driver(false, "", " for hot rebind");
+ igt_facts("After local_drm_open_driver()");
+ peter_drm_stats("After local_drm_open_driver()");
+
driver_unbind(priv, "hot ", 60);
driver_bind(priv, 0);
+ igt_facts("After driver_unbind() and driver_bind()");
+ peter_drm_stats("After driver_unbind() and driver_bind()");
+
igt_assert_f(healthcheck(priv, false), "%s\n", priv->failure);
+
+ igt_facts("After healthcheck()");
+ peter_drm_stats("After healthcheck()");
}
static void hotreplug(struct hotunplug *priv)
More information about the igt-dev
mailing list