[Intel-gfx] [igt-dev] [PATCH i-g-t v3] tests/core_hotunplug: Restore i915 debugfs health check

Janusz Krzysztofik janusz.krzysztofik at linux.intel.com
Thu Oct 15 12:25:21 UTC 2020


On Thu, 2020-10-15 at 09:15 +0200, Marcin Bernatowicz wrote:
> On Tue, 2020-10-13 at 13:02 +0200, Janusz Krzysztofik wrote:
> > Removal of igt_fork_hang_detector() from local_i915_healthcheck() by
> > commit 1fbd127bd4e1 ("core_hotplug: Teach the healthcheck how to
> > check
> > execution status") resulted in unintentional removal of an important
> > though implicit test feature of detecting, reporting as failures and
> > recovering from potential misses of debugfs subdirs of hot rebound
> > i915
> > devices.  As a consequence, unexpected failures or skips of other
> > unrelated but subsequently run tests have been observed on CI.
> > 
> > On the other hand, removal of the debugfs issue detection and subtest
> > failures from right after hot rebinding the driver enabled the better
> > version of the i915 GPU health check fixed by the same commit to
> > detect
> > and report other issues potentially triggered by device late close.
> > 
> > Restore the missing test feature by introducing an explicit sysfs
> > health check, not limited to i915,  that verifies existence of device
> > sysfs and debugfs areas.  Also, split hotrebind/hotreplug scenarios
> > into a pair of each, one that performs the health check right after
> > hot
> > rebind/replug and delegates the device late close step to a follow up
> > recovery phase, while the other one checks device health only after
> > late closing it.
> > 
> > v2: Give GPU health check a better chance to detect issues - run it
> >     before sysfs health checks.
> > v3: Run sysfs health check on any hardware, not only i915.
> > 
> > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at linux.intel.com
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > ---
> > Even if the root cause has occurred to be sitting on the IGT lib side
> > and has been already fixed by commit 937526629344 ("lib: Don't fail
> > debugfs lookup on an expected absent drm device"), I think we should
> > restore the debugfs health check just in case new issues with similar
> > symptoms appear in the future and start affecting subsequent tests
> > silently.
> > 
> > Thanks,
> > Janusz
> > 
> >  tests/core_hotunplug.c | 68 ++++++++++++++++++++++++++++++++++++++
> > ----
> >  1 file changed, 62 insertions(+), 6 deletions(-)
> > 
> > diff --git a/tests/core_hotunplug.c b/tests/core_hotunplug.c
> > index 70669c590..cdc07c85d 100644
> > --- a/tests/core_hotunplug.c
> > +++ b/tests/core_hotunplug.c
> > @@ -308,7 +308,7 @@ static void node_healthcheck(struct hotunplug
> > *priv, unsigned flags)
> >  		priv->failure = "Unrecoverable test failure";
> >  		if (local_i915_healthcheck(fd_drm, "") &&
> >  		    (!(flags & FLAG_RECOVER) ||
> > local_i915_recover(fd_drm)))
> > -			priv->failure = "Healthcheck failure!";
> > +			priv->failure = "GPU healthcheck failure!";
> >  		else
> >  			priv->failure = NULL;
> >  
> > @@ -317,6 +317,16 @@ static void node_healthcheck(struct hotunplug
> > *priv, unsigned flags)
> >  		priv->failure = NULL;
> >  	}
> >  
> > +	if (!priv->failure) {
> > +		char path[200];
> > +
> > +		priv->failure = "Device sysfs healthckeck failure!";
> > +		local_debug("%s\n", "running device sysfs
> > healthcheck");
> > +		igt_assert(igt_sysfs_path(fd_drm, path, sizeof(path)));
> > +		igt_assert(igt_debugfs_path(fd_drm, path,
> > sizeof(path)));
> > +		priv->failure = NULL;
> > +	}
> > +
> 
> LGTM,
> Reviewed-by: Marcin Bernatowicz <marcin.bernatowicz at linux.intel.com>

Thank you Marcin, pushed.

Janusz

> 
> >  	fd_drm = close_device(fd_drm, "", "health checked ");
> >  	if (closed || fd_drm < -1)	/* update status for
> > post_healthcheck */
> >  		priv->fd.drm_hc = fd_drm;
> > @@ -437,7 +447,7 @@ static void hotunplug_rescan(struct hotunplug
> > *priv)
> >  	healthcheck(priv, false);
> >  }
> >  
> > -static void hotrebind_lateclose(struct hotunplug *priv)
> > +static void hotrebind(struct hotunplug *priv)
> >  {
> >  	igt_assert_eq(priv->fd.drm, -1);
> >  	igt_assert_eq(priv->fd.drm_hc, -1);
> > @@ -448,6 +458,30 @@ static void hotrebind_lateclose(struct hotunplug
> > *priv)
> >  	driver_bind(priv, 0);
> >  
> >  	healthcheck(priv, false);
> > +}
> > +
> > +static void hotreplug(struct hotunplug *priv)
> > +{
> > +	igt_assert_eq(priv->fd.drm, -1);
> > +	igt_assert_eq(priv->fd.drm_hc, -1);
> > +	priv->fd.drm = local_drm_open_driver(false, "", " for hot
> > replug");
> > +
> > +	device_unplug(priv, "hot ", 60);
> > +
> > +	bus_rescan(priv, 0);
> > +
> > +	healthcheck(priv, false);
> > +}
> > +
> > +static void hotrebind_lateclose(struct hotunplug *priv)
> > +{
> > +	igt_assert_eq(priv->fd.drm, -1);
> > +	igt_assert_eq(priv->fd.drm_hc, -1);
> > +	priv->fd.drm = local_drm_open_driver(false, "", " for hot
> > rebind");
> > +
> > +	driver_unbind(priv, "hot ", 60);
> > +
> > +	driver_bind(priv, 0);
> >  
> >  	priv->fd.drm = close_device(priv->fd.drm, "late ", "unbound ");
> >  	igt_assert_eq(priv->fd.drm, -1);
> > @@ -465,8 +499,6 @@ static void hotreplug_lateclose(struct hotunplug
> > *priv)
> >  
> >  	bus_rescan(priv, 0);
> >  
> > -	healthcheck(priv, false);
> > -
> >  	priv->fd.drm = close_device(priv->fd.drm, "late ", "removed ");
> >  	igt_assert_eq(priv->fd.drm, -1);
> >  
> > @@ -570,7 +602,31 @@ igt_main
> >  		post_healthcheck(&priv);
> >  
> >  	igt_subtest_group {
> > -		igt_describe("Check if the driver hot unbound from a
> > still open device can be cleanly rebound, then the old instance
> > released");
> > +		igt_describe("Check if the driver can be cleanly
> > rebound to a device with a still open hot unbound driver instance");
> > +		igt_subtest("hotrebind")
> > +			hotrebind(&priv);
> > +
> > +		igt_fixture
> > +			recover(&priv);
> > +	}
> > +
> > +	igt_fixture
> > +		post_healthcheck(&priv);
> > +
> > +	igt_subtest_group {
> > +		igt_describe("Check if a hot unplugged and still open
> > device can be cleanly restored");
> > +		igt_subtest("hotreplug")
> > +			hotreplug(&priv);
> > +
> > +		igt_fixture
> > +			recover(&priv);
> > +	}
> > +
> > +	igt_fixture
> > +		post_healthcheck(&priv);
> > +
> > +	igt_subtest_group {
> > +		igt_describe("Check if a hot unbound driver instance
> > still open after hot rebind can be cleanly released");
> >  		igt_subtest("hotrebind-lateclose")
> >  			hotrebind_lateclose(&priv);
> >  
> > @@ -582,7 +638,7 @@ igt_main
> >  		post_healthcheck(&priv);
> >  
> >  	igt_subtest_group {
> > -		igt_describe("Check if a still open while hot unplugged
> > device can be cleanly restored, then the old instance released");
> > +		igt_describe("Check if an instance of a still open
> > while hot replugged device can be cleanly released");
> >  		igt_subtest("hotreplug-lateclose")
> >  			hotreplug_lateclose(&priv);
> >  



More information about the Intel-gfx mailing list