[PATCH i-g-t 0/4] Device scan fixes

Peter Senna Tschudin peter.senna at linux.intel.com
Thu Dec 19 08:44:31 UTC 2024


Hi Lucas,

On 18.12.2024 23:04, Lucas De Marchi wrote:
> On Wed, Dec 18, 2024 at 09:16:48AM +0100, Peter Senna Tschudin wrote:
>>
>>
>> On 18.12.2024 06:13, Lucas De Marchi wrote:
>>> This started with the goal of fixing xe_wedged (besides the fix to the
>>> kernel that is in flight), however the issue of device "disappearing
>>> from bus" looks similar to several other issues we occasionally have -
>>> it may be the same issue. Try to fix it by forcing scans.
>>
>> Is the hypothesis that while an IGT test is running something may
>> break the association between GPU and device node? I am asking because
>> the drm device is always closed after a test ends.
> 
> it depends what the test is doing, if the fd is closed etc... note that
> multiple subtests are executed without runing a new program. Even some
> under-the-hood behavior that we have with e.g. i915 (it keeps an fd open
> a dup fd open) may change the behavior if you load/unload or
> bind/unbind. Example:
> 
> RPL (00:02.0) + BMG (03:00.0)
>     # ./build/tools/lsgpu                                       card1                    Intel Battlemage (Gen20)          drm:/dev/dri/card1           └─renderD129                                               drm:/dev/dri/renderD129      card0                    Intel Raptorlake_s (Gen12)        drm:/dev/dri/card0           └─renderD128                                               drm:/dev/dri/renderD128 
>     # ls -l /dev/dri/by-path/                               total 0                                                                             lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:00:02.0-card -> ../card0
>     lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:00:02.0-render -> ../renderD128
>     lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
>     lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129
> Rebind RPL:
>     # echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/unbind
>     # echo 0000:00:02.0 | sudo tee /sys/bus/pci/drivers/xe/bind
> 
>     # ls -l /dev/dri/by-path/
>     total 0
>     lrwxrwxrwx 1 root root  8 Dec 18 19:40 pci-0000:00:02.0-card -> ../card0
>     lrwxrwxrwx 1 root root 13 Dec 18 19:40 pci-0000:00:02.0-render -> ../renderD128
>     lrwxrwxrwx 1 root root  8 Dec 18 19:39 pci-0000:03:00.0-card -> ../card1
>     lrwxrwxrwx 1 root root 13 Dec 18 19:39 pci-0000:03:00.0-render -> ../renderD129
> 
> Great, nothing break, right?
> 
> Rebind both in a different order (shouldn't happen if both cards are
> backed by the same module, but can perfectly happen in a i915 + xe
> scenario as the modules may be ready in different order):
> 
>     # echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind     # echo 0000:00:02.0 > /sys/bus/pci/drivers/xe/bind     # ls -l /dev/dri/by-path/
>     total 0
>     lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
>     lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
>     lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:03:00.0-card -> ../card0
>     lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:03:00.0-render -> ../renderD128
> 
> Simulate one thing that may happen in igt: but leak an fd:
> 
>     # exec 3<>  /dev/dri/by-path/pci-0000:03:00.0-card
>     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind     # echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>     # ls -l /dev/dri/by-path/
>     total 0
>     lrwxrwxrwx 1 root root  8 Dec 18 19:41 pci-0000:00:02.0-card -> ../card1
>     lrwxrwxrwx 1 root root 13 Dec 18 19:41 pci-0000:00:02.0-render -> ../renderD129
>     lrwxrwxrwx 1 root root  8 Dec 18 19:42 pci-0000:03:00.0-card -> ../card2
>     lrwxrwxrwx 1 root root 13 Dec 18 19:42 pci-0000:03:00.0-render -> ../renderD130
> 
> So... because we had and fd open, now it became card2 rather than 1.
> Also note that simply calling close(fd) in igt doesn't work as the
> cached fds will throw igt through the wrong path.
> 
> So.... instead of adding bandaid everywhere in igt, I think we should
> fix the design to stop caching the wrong thing. IMO a first step would
> be "disable the cache and see how much it impacts".
> 
> Lucas De Marchi
> 
>>
>> I tested this by printing the value of _opened_fds_count from
>> lib/drmtest.c before each test, and the value is always 0.
>>
>> _opened_fds_count being zero means that the cache is empty, right? So
>> if the cache is empty before each test starts, under which conditions
>> can the potential problem manifest?


I could finally detect the issue by(The patch should be applied on top of my facts series):
 - Creating peter_drm_stats(): prints drm cache entries with fd and card number
 - Instrumenting igt at core_hotunplug@hotrebind with calls to igt_facts() and peter_drm_stats()

My goal is to create a test for us to test possible fixes. Can you help me creating a test that
is more meaningful for us to use as benchmark? See the output:

Starting subtest: hotrebind
...
[5147.892783] [FACT Before it starts] new: hardware.pci.drm_card_at_addr.0000:03:00.0: card1
[5147.897991] [DRM_CACHE] Before it starts: fd 5, card: 1
[5148.968237] [DRM_CACHE] After local_drm_open_driver(): fd 5, card: 1
Unloaded audio driver snd_hda_intel
Realoading snd_hda_intel
[5150.025062] [FACT After driver_unbind() and driver_bind()] changed: hardware.pci.drm_card_at_addr.0000:03:00.0: card1 -> card0

// Here the cache is wrong
[5150.030141] [DRM_CACHE] After driver_unbind() and driver_bind(): fd 5, card: 1

Opened device: \/dev\/dri\/card0
Opened device: \/dev\/dri\/renderD128

// healthcheck() fixes the cache by calling igt_devices_scan(true)
[5151.100086] [DRM_CACHE] After healthcheck(): fd 8, card: 0

Subtest hotrebind: SUCCESS (3.227s)
-------------- next part --------------
diff --git a/lib/drmtest.c b/lib/drmtest.c
index 2dd4540b8..6b7c35e20 100644
--- a/lib/drmtest.c
+++ b/lib/drmtest.c
@@ -35,6 +35,7 @@
 #include <sys/ioctl.h>
 #include <string.h>
 #include <sys/mman.h>
+#include <sys/sysmacros.h>
 #include <signal.h>
 #include <pciaccess.h>
 #include <stdlib.h>
@@ -418,6 +419,33 @@ static bool _is_already_opened(const char *path, int as_idx)
 	return false;
 }
 
+void peter_drm_stats(const char *msg)
+{
+	struct timespec uptime_ts;
+	char *uptime = NULL;
+
+	if (clock_gettime(CLOCK_BOOTTIME, &uptime_ts) != 0)
+		return;
+
+	asprintf(&uptime,
+		 "%ld.%06ld",
+		 uptime_ts.tv_sec,
+		 uptime_ts.tv_nsec / 1000);
+
+	for (int i = 0; i < _opened_fds_count; i++) {
+		unsigned long int st_rdev;
+		int fd, card;
+		fd = _opened_fds[i].fd;
+		st_rdev = _opened_fds[i].stat.st_rdev;
+		card =  gnu_dev_minor(st_rdev);
+		igt_info("[%s] [DRM_CACHE] %s: fd %d, card: %d\n",
+			 uptime,
+			 msg ? msg : "",
+			 fd,
+			 card);
+	}
+}
+
 static int __search_and_open(const char *base, int offset, unsigned int chipset, int as_idx)
 {
 	const char *forced;
diff --git a/lib/drmtest.h b/lib/drmtest.h
index 27e5a18e2..d3713e9b9 100644
--- a/lib/drmtest.h
+++ b/lib/drmtest.h
@@ -145,6 +145,7 @@ bool is_vc4_device(int fd);
 bool is_xe_device(int fd);
 bool is_intel_device(int fd);
 enum intel_driver get_intel_driver(int fd);
+extern void peter_drm_stats(const char *msg);
 
 /**
  * do_or_die:
diff --git a/tests/core_hotunplug.c b/tests/core_hotunplug.c
index 7f17f4423..838404d8b 100644
--- a/tests/core_hotunplug.c
+++ b/tests/core_hotunplug.c
@@ -39,6 +39,8 @@
 #include "igt_kmod.h"
 #include "igt_sysfs.h"
 #include "sw_sync.h"
+#include "igt_facts.h"
+
 /**
  * TEST: core hotunplug
  * Description: Examine behavior of a driver on device hot unplug
@@ -614,15 +616,29 @@ static void hotunplug_rescan(struct hotunplug *priv)
 
 static void hotrebind(struct hotunplug *priv)
 {
+	igt_facts_lists_init();
+
+	igt_facts("Before it starts");
+	peter_drm_stats("Before it starts");
+
 	pre_check(priv);
 
 	priv->fd.drm = local_drm_open_driver(false, "", " for hot rebind");
 
+	igt_facts("After local_drm_open_driver()");
+	peter_drm_stats("After local_drm_open_driver()");
+
 	driver_unbind(priv, "hot ", 60);
 
 	driver_bind(priv, 0);
 
+	igt_facts("After driver_unbind() and driver_bind()");
+	peter_drm_stats("After driver_unbind() and driver_bind()");
+
 	igt_assert_f(healthcheck(priv, false), "%s\n", priv->failure);
+
+	igt_facts("After healthcheck()");
+	peter_drm_stats("After healthcheck()");
 }
 
 static void hotreplug(struct hotunplug *priv)


More information about the igt-dev mailing list