[igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload

Tue May 7 06:24:30 UTC 2019

On Monday, May 6, 2019 11:21:58 AM CEST Daniel Vetter wrote:
> On Mon, May 06, 2019 at 10:44:11AM +0200, Janusz Krzysztofik wrote:
> > Hi Daniel,
> > 
> > On Tuesday, April 30, 2019 5:05:48 PM CEST Daniel Vetter wrote:
> > > On Tue, Apr 30, 2019 at 01:29:15PM +0200, Janusz Krzysztofik wrote:
> > > > From: Janusz Krzysztofik <janusz.krzysztofik at intel.com>
> > > > 
> > > > Put some workload on a device, then try to either remove (unplug) the
> > > > device from its bus, or unbind the device's driver from it, possibly
> > > > followed by module unload, depending on which specific subtest has been
> > > > selected.  If succeeded, rescan the device's bus if needed and perform
> > > > health checks on the device with the driver possibly loaded back.
> > > > 
> > > > If module unload is requested, the workload is run in a sub-process,
> > > > not directly from the test, as it is expected to crash while still
> > > > keeping the device open for as long as its process has not exited.
> > > > 
> > > > The driver hot unbind / device hot unplug operation is expected to
> > > > succeed and the background workload sub-process to crash in a
> > > > reasonable time, however long timeouts are used to let kernel level
> > > > timeouts pop up first if hit by a bug.
> > > > 
> > > > The driver is ready for extending it with an arbitrary workload
> > > > functions as needed.  For now, a workload based on igt_dummyload is
> > > > implemented, hence subtests work only on i915 driver and are skipped on
> > > > other hardware, unless they provide their implementation of
> > > > igt_spin_new() and friends, or other workloads are implemented.
> > > > 
> > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at intel.com>
> > > 
> > > High level comments and apologies that I didn't look at v2-v7 in between.

v1-v3 were submitted internally, so you actually joined and commented first on 
first public submission which I should have had marked as v4 (I hadn't, 
sorry).

> > > 
> > > This all seems extremely complex for a simple batch spinner subtest ...
> > 
> > My initial intention was to build a simple hot unplug/unbind only test. I 
> > proposed to use an arbitrary external command as a workload.  Then, on 
> > Antonio's advice, I switched to the spinner based internal workload and I 
> > agree that was a good move.  Then, Petri and you, Daniel, requested to extend 
> > the scope of the test with device recovery and health checking.  Also, a few 
> > people, including you, Daniel, requested availability of more workload type 
> > options.  As a result, I've decided to build a *framework* for testing driver 
> > unbind + rebind / device unplug + bus rescan behavior under different workload 
> > types, easily extendable with more workload options as needed, with one 
> > example workload type - dummy load or spin batch - initially implemented.  
> > That was at least my intention for v6-8.  I wouldn't call it a simple batch 
> > spinner subtest any longer.
> 
> Maybe my review got wrong, but I just meant that there's more tests to
> write here.

That was clear for me, however I probably misunderstood your intentions in 
regard to device/driver recovery after successful unplug/unbind.

> Generally I think having the framework/generic solution before
> you have all the applications is the wrong way to build something. Usually
> it results in something which is generic in all the wrong ways, but not in
> the ones you will actually need. So complexity with no gain. Better to
> - add a few tests first with copypasting/minimal changes
> - refactor helpers once you see the real patterns
> - no framework, that's the midlayer mistake, see
>   https://blog.ffwll.ch/2016/12/midlayers-once-more-with-feeling.html and
>   all the articles linked from there.

OK, thanks for your recommendations and the references.

> > > do we really need all that complexity with 2nd process 
> > 
> > If we drop module unload option then no, we don't need 2nd process.  
> 
> Why does module unload require a 2nd process? We don't need a 2nd process
> in our other module unload tests either.

That's not longer the problem as we're going to drop the module unload step, 
but just to provide you with an explanation of my approach:
In case of the spin workload, references are held after the workload crashes 
and it's not possible to unload the module unless we put them.   Since those 
references are internal to IGT libraries and not exposed to a user, putting 
them is only possible with functions provided by IGT.  Those functions are 
full of checks affecting subtest results and using them to clean up resources 
related to a no longer existing device would result in a subtest failure or 
skip at least.  The most simple way to get rid of those issues is to enclose 
those references in a subprocess and wait for their automatic release on its 
completion.

> > > and watchers 
> > 
> > That was primarily needed for successful module unload.  If we drop that 
> > option and you think driver rebind / bus rescan operations can be performed 
> > blindly, without checking for completion of background workload, then I can 
> > drop the watchers.
> 
> Well we _have_ to do unbinds without checking the background workload has
> completed. That's the entire point of testing hotunplug. 

I agree, and the test performs all unbinds that way, i.e., without checking 
the background workload has completed.  Waiting for background workload 
completion applies only to what I'm considering a device recovery phase, and 
not to the "main" unbind/unplug test phase in any way.

> It's also why
> there's lots of work to do here, because the kernel is totally not ready
> for this.
> 
> First stopping everything and then unloading isn't an interesting test,

Since its introduction, the module unload step was intended as a part of a 
post-subtest device recovery phase, not the subtest merit.  I added that step 
because I thought that would be the most reliable way of satisfying the CI 
requirement on restoring the device to the state ready for next tests without 
reboot or real device power-on reset on real hardware bus replug.

> that's more or less exactly what our various module unload tests are
> doing already.

Yes, and in v5-v7 I was even using the existing i915_module_load test as an 
external helper command performing device recovery and healthcheck phases in 
order to avoid reimplementing its code here.

> > > and a bunch of callbacks and everything, just do to a hotremove testcase?
> > 
> > I can still drop the framework and switch back to the initial simple structure 
> > with one or two fixed subtests if you don't like my structural approach.
> 
> See above for why, I think that will result in better code in the end.

OK.

> > > Very first patch looked much more reasonable, aside from that it broke CI
> > > since it didn't rebind the driver. 
> > 
> > Sorry, my understanding of your and Petri's comments was a bit different, I 
> > thought that by more than best effort you meant doing everything possible to 
> > restore the device to be ready for next test without reboot, and module unload 
> > and reload seemed the most reliable option to me.  Now I can see that there 
> > were probably two different requirements.  You were considering the test 
> > incomplete because it was performing only the unbind/unplug part and not 
> > rebind/rescan, while Petri was probably interested mostly in the device being 
> > ready for next tests without reboot, no matter which way.
> 
> Well it's the same request, and rebind/rescan /should/ result in a working
> device again. If not, then I guess we also have a bug on our hotreplug
> code. Which again is worth testing for.
> 
> > > We can always add complexity later on
> > > once we have dma-buf/dma-fence/kms/whatever else substests here.
> > 
> > OK, as you wish.
> > 
> > > Also, I think we should have at least one hotremove-only-nothing-special
> > > subtest here, i.e. without even the busy batch.
> > 
> > That seems trivial to adjust the framework so it accepted NULL workload, if 
> > the framework survived. Anyway, I'll do that.  Should I put it in a separate 
> > NULL workload subtest function to be called from igt_main?  Or add it to the 
> > spin workload subtest function specifically as an option?
> 
> Separate test as the first subtest.

OK.

> Maybe even include the "shut
> everything off first" logic from module unload,

Do you mean i915_module_load.c?

> to have the most baseline test possible.
> 
> > > I'm also not sure why we also put module unload tests in there. 
> > 
> > As I tried to explain above, I introduced module unload in order to satisfy 
> > the CI requirement on the device being ready for next test without reboot as 
> > much as possible.
> 
> Hm, but why? What does module reload help in this regard that a rebind
> can't do? Aside from testing module reload, which is a developer feature
> and already tested elsewhere.

As I said, I decided to use module unload as I thought it would be the most 
reliable way of device recovery from simulated unplug in case no real power-on 
reset is performed.  I didn't insist on keeping it there at all, I only tried 
to explain why I did that.  As that can't help in any way to recover the 
device so it is ready for next tests as CI requested then I'll be happy to 
remove it and stick to pure driver rebind / bus rescan operations.

> I'm also not seeing much interactions between hotunplug and module unload.
> 
> The one interesting testcase I see is trying to unload the module after we
> hotunplugged, while the driver is still in use somewhere (open drm fd,
> open dma-buf fd, open dma-fence fd). That should result in a failure, and
> it's useful to validate that the kernel is handling the module refcounting
> correctly in all these cases. But that's a specific negative testcase (and
> actually being able to unload would be a failure and likely result in a
> kernel oops), I'm not seeing the benefit of reloading the module.

That's perfectly clear for me, the optional module unload step will not be 
there on next iteration.

> > > Compared
> > > to hotunplug of a discrete gfx card (external one over usb or thunderbolt
> > > or whatever), which is something users can do, module unload is explicitly a 
> > > developer only feature.
> > 
> > My approach was to be able to test driver behavior under any hot unload 
> > operation available to a user, no matter if developer oriented or not, so we 
> > can make the driver resistant to users performing potentially dangerous hot 
> > unbind/unplug operations available to them, intentionally or not.
> 
> Yes I agree with that, we need to test hotunplug.
> 
> btw the real fun isn't the unbind in sysfs, but physically unplugging a
> pci-e or thunderbolt/usb-c gfx card. Imo that's why we need to have this,
> and the best way to test that hotunplug is through the sysfs unbind
> support (it's not exactly the same since this way we'll never see failing
> pci transactions, which are an entirely different kind of fun).

I fully agree, however please note that what I'm calling device hot unplug is 
probably still an interesting sysfs option aside driver hot unbind.

> > > We do not expect module unload to work under all
> > > possible conditions (it doesn't). 
> > 
> > Do you think that driver rebind operation has more chances to succeed, 
> > especially on a device on which a bus unplug operation was not actually 
> > performed but only simulated via sysfs, on a device which then has been left 
> > in an unpredictable state and hasn't undergone a hardware power-on reset on 
> > physical bus re-plug?
> 
> There's definitely potential for bugs, but I don't see how module reload
> helps. Module reload is essentially:
> 
> - unbind devices
> - unload module
> - reload module
> - rebind all devices
> 
> The only additional magic that module unload can paper over is that it's
> disallowed while anyone is still using any devices (assuming the module
> refcount code is correct). That's not the case for unbind/hotunplug. But
> that's it, there's no additional magic code being run when you unload the
> module. Hence why I don't understand why you want to do that.

Not any longer :-)

> > > I'd drop that part and focus completely
> > > on the hotremove/unbind testcase here.
> > 
> > Driver unbind / device unplug via sysfs can also be considered developer only 
> > features. Do you think we should drop driver unbind option, leaving only 
> > device unplug via sysfs for which we may have no good non-developer 
> > alternative?
> 
> Yanking the cable for e.g. usb-c/thunderbolt external gpu is very much a
> user action. That's why we care.
> 
> We didn't care for unbind (I wontfix closed all the bugs myself) while
> intel only created built-in gpus because it's indeed fairly pointless to
> unbind these.
> 
> Other bit I don't quite get: What's the difference between unbind and
> unplug?

I'm not sure what information you are missing.  What the test is doing is:

driver unbind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/unbind
vs.
virtual device unplug: echo 1 >/sys/bus/<bus>/devices/<device>/remove

Panic call traces look a bit different, you may want to compare the following two:
https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb1/igt@core_hot_reload@spin-unbind.html
https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb4/igt@core_hot_reload@spin-unplug.html

rebind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/bind
vs.
replug: echo 1 >/sys/bus/<bus>/rescan

With no panics accompanying driver unbind / device unplug under active spin 
workload on older hardware, the recovery phase is however still giving a 
different result for each of those two methods:
https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw1/igt@core_hot_reload@spin-unbind.html
https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw5/igt@core_hot_reload@spin-unplug.html

BTW, trybot results confirm that module unload really doesn't help:
https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw8/igt@core_hot_reload@spin-unplug-unload.html

Thanks,
Janusz

> -Daniel
> 
> > 
> > Thanks,
> > Janusz
> > 
> > 
> > > -Daniel
> > > 
> > > > ---
> > > >  tests/Makefile.sources  |   1 +
> > > >  tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
> > > >  tests/meson.build       |   1 +
> > > >  3 files changed, 410 insertions(+)
> > > >  create mode 100644 tests/core_hot_reload.c
> > > > 
> > > > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > > > index 7f921f6c..452d8ed7 100644
> > > > --- a/tests/Makefile.sources
> > > > +++ b/tests/Makefile.sources
> > > > @@ -16,6 +16,7 @@ TESTS_progs = \
> > > >  	core_getclient \
> > > >  	core_getstats \
> > > >  	core_getversion \
> > > > +	core_hot_reload \
> > > >  	core_setmaster_vs_auth \
> > > >  	debugfs_test \
> > > >  	drm_import_export \
> > > > diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
> > > > new file mode 100644
> > > > index 00000000..6673f55c
> > > > --- /dev/null
> > > > +++ b/tests/core_hot_reload.c
> > > > @@ -0,0 +1,408 @@
> > > > +/*
> > > > + * Copyright © 2019 Intel Corporation
> > > > + *
> > > > + * Permission is hereby granted, free of charge, to any person obtaining 
> > a
> > > > + * copy of this software and associated documentation files (the 
> > "Software"),
> > > > + * to deal in the Software without restriction, including without 
> > limitation
> > > > + * the rights to use, copy, modify, merge, publish, distribute, 
> > sublicense,
> > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > + * Software is furnished to do so, subject to the following conditions:
> > > > + *
> > > > + * The above copyright notice and this permission notice (including the 
> > next
> > > > + * paragraph) shall be included in all copies or substantial portions of 
> > the
> > > > + * Software.
> > > > + *
> > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
> > EXPRESS OR
> > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
> > MERCHANTABILITY,
> > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT 
> > SHALL
> > > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR 
> > OTHER
> > > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
> > ARISING
> > > > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
> > DEALINGS
> > > > + * IN THE SOFTWARE.
> > > > + */
> > > > +
> > > > +#include "igt.h"
> > > > +#include "igt_device.h"
> > > > +#include "igt_dummyload.h"
> > > > +#include "igt_kmod.h"
> > > > +#include "igt_sysfs.h"
> > > > +
> > > > +#include <getopt.h>
> > > > +#include <limits.h>
> > > > +#include <string.h>
> > > > +#include <unistd.h>
> > > > +
> > > > +#include <sys/types.h>
> > > > +#include <sys/wait.h>
> > > > +
> > > > +/**
> > > > + * A post-action device recovery function:
> > > > + * @priv: a pointer to private data required for device recovery
> > > > + *
> > > > + * Make the device re-appear
> > > > + */
> > > > +typedef void (*recover_t)(const void *priv);
> > > > +
> > > > +/**
> > > > + * A test action function:
> > > > + * @dir: file descriptor of an open device sysfs directory
> > > > + * @module: module name, non-NULL indicates post-action module unload 
> > requested
> > > > + * @recover: for returning a pointer to a post-action device recovery 
> > function
> > > > + * @priv: for returning a pointer to data to be passed to @recover
> > > > + *
> > > > + * Make the device disappear
> > > > + */
> > > > +typedef void (*action_t)(int device, const char *module,
> > > > +			 recover_t *recover, const void **priv);
> > > > +
> > > > +/**
> > > > + * A workload completion wait function:
> > > > + * @device: open device file descriptor
> > > > + * @priv: a pointer to private data required by the wait function
> > > > + *
> > > > + * Wait for completion of background workload
> > > > + */
> > > > +typedef void (*workload_wait_t)(int device, void *priv);
> > > > +
> > > > +/**
> > > > + * A workload function:
> > > > + * @device: open device file descriptor
> > > > + * @arg: a optional string argument passed to the workload function
> > > > + * @workload_wait: for returning a pointer to workload completion wait 
> > function
> > > > + * @priv: for returning a pointer to data to be passed to @workload_wait
> > > > + *
> > > > + * Put some long lasting load on the device
> > > > + */
> > > > +typedef void (*workload_t)(int device, const char *arg,
> > > > +			   workload_wait_t *workload_wait, void 
> > **priv);
> > > > +
> > > > +/**
> > > > + * Pairs of test action / device recovery functions
> > > > + */
> > > > +
> > > > +/* Unbind / re-bind */
> > > > +
> > > > +struct rebind_data {
> > > > +	int driver;	/* open file descriptor of driver sysfs directory */
> > > > +	char *device;	/* bus specific device address as string */
> > > > +};
> > > > +
> > > > +/* Re-bind the driver to the device */
> > > > +static void driver_bind(const void *priv)
> > > > +{
> > > > +	const struct rebind_data *data = priv;
> > > > +
> > > > +	igt_set_timeout(60, "Driver re-bind timeout!");
> > > > +	igt_sysfs_set(data->driver, "bind", data->device);
> > > > +
> > > > +	close(data->driver);
> > > > +}
> > > > +
> > > > +/* Unbind the driver from the device */
> > > > +static void driver_unbind(int device, const char *module,
> > > > +			  recover_t *recover, const void **priv)
> > > > +{
> > > > +	static char path[PATH_MAX];
> > > > +	static struct rebind_data data;
> > > > +	int len;
> > > > +
> > > > +	/* collect information required for driver bind/unbind */
> > > > +	data.driver = openat(device, "device/driver", O_DIRECTORY);
> > > > +	igt_assert(data.driver >= 0);
> > > > +
> > > > +	len = readlinkat(device, "device", path, sizeof(path) - 1);
> > > > +	path[len] = '\0';
> > > > +	data.device = strrchr(path, '/') + 1;
> > > > +
> > > > +	/* unbind the driver */
> > > > +	igt_set_timeout(60, "Driver unbind timeout!");
> > > > +	igt_sysfs_set(data.driver, "unbind", data.device);
> > > > +
> > > > +	/* pass back info on how to recover the device */
> > > > +	if (module) {
> > > > +		/* don't try to rebind if module will be unloaded */
> > > > +		*recover = NULL;
> > > > +	} else {
> > > > +		*recover = driver_bind;
> > > > +		*priv = &data;
> > > > +	}
> > > > +}
> > > > +
> > > > +/* Unplug / re-plug */
> > > > +
> > > > +/* Re-discover the device by rescanning its bus */
> > > > +static void bus_rescan(const void *priv)
> > > > +{
> > > > +	const int *bus = priv;
> > > > +
> > > > +	igt_set_timeout(60, "Bus rescan timeout!");
> > > > +	igt_sysfs_set(*bus, "rescan", "1");
> > > > +
> > > > +	close(*bus);
> > > > +}
> > > > +
> > > > +/* Remove (virtually unplug) the device from its bus */
> > > > +static void device_unplug(int device, const char *module,
> > > > +			  recover_t *recover, const void **priv)
> > > > +{
> > > > +	static int bus;
> > > > +
> > > > +	/* collect information required for bus rescan */
> > > > +	bus = openat(device, "device/subsystem", O_DIRECTORY);
> > > > +	igt_assert(bus >= 0);
> > > > +
> > > > +	/* remove the device */
> > > > +	igt_set_timeout(60, "Device unplug timeout!");
> > > > +	igt_sysfs_set(device, "device/remove", "1");
> > > > +
> > > > +	/* pass back info on how to recover the device */
> > > > +	*recover = bus_rescan;
> > > > +	*priv = &bus;
> > > > +}
> > > > +
> > > > +/* Each test action function must be registered in the following table */
> > > > +static const struct {
> > > > +	const char *name;	/* unique test action name used in test 
> > names */
> > > > +	action_t function;	/* test action function pointer */
> > > > +} actions[] = {
> > > > +	{ "unbind", driver_unbind, },
> > > > +	{ "unplug", device_unplug, },
> > > > +};
> > > > +
> > > > +/**
> > > > + * Pairs of workload / wait completion functions
> > > > + */
> > > > +
> > > > +/* A workload using igt_spin_run() */
> > > > +
> > > > +/* Wait for completaion of dummy load */
> > > > +static void dummy_wait(int device, void *priv)
> > > > +{
> > > > +	igt_spin_t *spin = priv;
> > > > +
> > > > +	/* wait until the spin no longer runs, don't fail on error */
> > > > +	if (gem_wait(device, spin->handle, NULL))
> > > > +		__gem_set_domain(device, spin->handle,
> > > > +				 I915_GEM_DOMAIN_GTT, 
> > I915_GEM_DOMAIN_GTT);
> > > > +}
> > > > +
> > > > +/* Run dummy load */
> > > > +static void dummy_load(int device, const char *arg,
> > > > +		       workload_wait_t *workload_wait, void **priv)
> > > > +{
> > > > +	igt_spin_t *spin;
> > > > +
> > > > +	/* submit a job */
> > > > +	spin = igt_spin_new(device);
> > > > +
> > > > +	*workload_wait = dummy_wait;
> > > > +	*priv = spin;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Each workload function must be registered in the following table.
> > > > + * A function may be registered more than once under different workload 
> > names,
> > > > + * that makes sense as long as a different argument is specified for each 
> > name.
> > > > + */
> > > > +static const struct {
> > > > +	const char *name;	/* unique workload name used in test names 
> > */
> > > > +	workload_t function;	/* workload function pointer */
> > > > +	const char *arg;	/* optional constant string argument */
> > > > +} workloads[] = {
> > > > +	{ "spin", dummy_load, NULL, },
> > > > +};
> > > > +
> > > > +/**
> > > > + * Framework
> > > > + */
> > > > +
> > > > +static void healthcheck(int chipset)
> > > > +{
> > > > +	int device;
> > > > +
> > > > +	device = __drm_open_driver(chipset);
> > > > +	igt_assert(device >= 0);
> > > > +
> > > > +	if (chipset == DRIVER_INTEL)
> > > > +		gem_test_engine(device, ALL_ENGINES);
> > > > +
> > > > +	close(device);
> > > > +}
> > > > +
> > > > +static void module_unload(int chipset, const char *module)
> > > > +{
> > > > +	if (chipset == DRIVER_INTEL)
> > > > +		igt_assert(igt_i915_driver_unload() == 
> > IGT_EXIT_SUCCESS);
> > > > +	else
> > > > +		igt_assert(igt_kmod_unload(module, 0) == 0);
> > > > +}
> > > > +
> > > > +static void run_action(int device, action_t action, const char *module,
> > > > +		      recover_t *recover, const void **priv)
> > > > +{
> > > > +	int dir;
> > > > +
> > > > +	dir = igt_sysfs_open(device);
> > > > +	igt_assert(dir >= 0);
> > > > +
> > > > +	action(dir, module, recover, priv);
> > > > +
> > > > +	close(dir);
> > > > +}
> > > > +
> > > > +static void wait_helper(int device, void *priv)
> > > > +{
> > > > +	struct igt_helper_process *proc = priv;
> > > > +
> > > > +	/* wait until the workload subprocess has completed */
> > > > +	igt_ignore_warn(igt_wait_helper(proc));
> > > > +}
> > > > +
> > > > +static void run_workload(int device, workload_t workload, const char 
> > *arg,
> > > > +			 const char *module, workload_wait_t 
> > *workload_wait,
> > > > +			 void **priv)
> > > > +{
> > > > +	if (module) {
> > > > +		/* run workload in a subprocess so the module is put on 
> > crash */
> > > > +		static struct igt_helper_process proc;
> > > > +		int wstatus, ret;
> > > > +
> > > > +		bzero(&proc, sizeof(proc));
> > > > +
> > > > +		igt_fork_helper(&proc) {
> > > > +			/* suppress igt_log messages */
> > > > +			igt_log_level = IGT_LOG_NONE;
> > > > +
> > > > +			/* intercept igt_fail/skip() long jumps */
> > > > +			if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0) 
> > {
> > > > +				workload(device, arg, 
> > workload_wait, priv);
> > > > +
> > > > +				(*workload_wait)(device, 
> > *priv);
> > > > +
> > > > +				/* success if not diverted by 
> > igt_fail/skip() */
> > > > +				igt_success();
> > > > +			}
> > > > +
> > > > +			/* pass exit code back to the caller */
> > > > +			igt_exit();
> > > > +		}
> > > > +		/* let the background process start doing its job or 
> > fail */
> > > > +		sleep(2);
> > > > +		/* fail or skip on workload premature completion */
> > > > +		ret = waitpid(proc.pid, &wstatus, WNOHANG);
> > > > +		if (ret < 0)
> > > > +			igt_fail(IGT_EXIT_INVALID);
> > > > +		if (ret) {
> > > > +			if (!WIFEXITED(wstatus))
> > > > +				igt_fail(IGT_EXIT_INVALID);
> > > > +			if (WEXITSTATUS(wstatus) == 
> > IGT_EXIT_SUCCESS)
> > > > +				igt_fail(IGT_EXIT_INVALID);
> > > > +			if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
> > > > +				igt_skip(NULL);
> > > > +			igt_fail(WEXITSTATUS(wstatus));
> > > > +		}
> > > > +
> > > > +		/* pass back info on how to wait for helper completion 
> > */
> > > > +		*workload_wait = wait_helper;
> > > > +		*priv = &proc;
> > > > +	} else {
> > > > +		/* run the requested workload directly */
> > > > +		workload(device, arg, workload_wait, priv);
> > > > +	}
> > > > +}
> > > > +
> > > > +static void run_subtest(int chipset, int workload, int action,
> > > > +			const char *module)
> > > > +{
> > > > +	workload_wait_t workload_wait;
> > > > +	void *workload_priv;
> > > > +	recover_t recover;
> > > > +	const void *recover_priv;
> > > > +	int device;
> > > > +
> > > > +	igt_subtest_f("%s-%s%s", workloads[workload].name, 
> > actions[action].name,
> > > > +		      module ? "-unload" : "") {
> > > > +		device = __drm_open_driver(chipset);
> > > > +		igt_assert(device >= 0);
> > > > +
> > > > +		/* spawn the requested workload */
> > > > +		igt_debug("spawning background workload\n");
> > > > +		run_workload(device, workloads[workload].function,
> > > > +			     workloads[workload].arg, module,
> > > > +			     &workload_wait, &workload_priv);
> > > > +
> > > > +		/* run the requested test action */
> > > > +		igt_debug("running test action\n");
> > > > +		run_action(device, actions[action].function, module,
> > > > +			   &recover, &recover_priv);
> > > > +
> > > > +		if (workload_wait) {
> > > > +			igt_debug("waiting for workload 
> > completion\n");
> > > > +			workload_wait(device, workload_priv);
> > > > +		}
> > > > +
> > > > +		close(device);
> > > > +
> > > > +		if (module) {
> > > > +			igt_debug("unloading %s\n", module);
> > > > +			module_unload(chipset, module);
> > > > +		}
> > > > +
> > > > +		if (recover) {
> > > > +			igt_debug("recovering device\n");
> > > > +			recover(recover_priv);
> > > > +			igt_reset_timeout();
> > > > +		}
> > > > +
> > > > +		igt_debug("running healthcheck\n");
> > > > +		healthcheck(chipset);
> > > > +	}
> > > > +}
> > > > +
> > > > +igt_main {
> > > > +	int device, chipset;
> > > > +	char *module;
> > > > +	int i, j;
> > > > +
> > > > +	igt_fixture {
> > > > +		char path[PATH_MAX];
> > > > +		int dir, len;
> > > > +
> > > > +		/**
> > > > +		 * Since some subtests depend on successful unload of a 
> > driver
> > > > +		 * module, don't use drm_open_driver() as it keeps a 
> > device file
> > > > +		 * descriptor open for exit handler use and that 
> > effectively
> > > > +		 * prevents the module from being unloaded.
> > > > +		 */
> > > > +		device = __drm_open_driver(DRIVER_ANY);
> > > > +		igt_assert(device >= 0);
> > > > +
> > > > +		if (is_i915_device(device)) {
> > > > +			chipset = DRIVER_INTEL;
> > > > +			module = strdup("i915");
> > > > +		} else {
> > > > +			chipset = DRIVER_ANY;
> > > > +
> > > > +			/* Capture module name to be unloaded */
> > > > +			dir = igt_sysfs_open(device);
> > > > +			len = readlinkat(dir, "device/driver/
> > module", path,
> > > > +					 sizeof(path) - 1);
> > > > +			close(dir);
> > > > +			path[len] = '\0';
> > > > +			module = strdup(strrchr(path, '/') + 1);
> > > > +		}
> > > > +		close(device);
> > > > +
> > > > +		igt_info("Running the test on driver \"%s\", chipset 
> > mask %#0x\n",
> > > > +			 module, chipset);
> > > > +	}
> > > > +
> > > > +	for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
> > > > +		for (j = 0; j < sizeof(actions) / sizeof(*actions); j+
> > +) {
> > > > +			/* with module unload */
> > > > +			run_subtest(chipset, i, j, module);
> > > > +			/* without module unload */
> > > > +			run_subtest(chipset, i, j, NULL);
> > > > +		}
> > > > +	}
> > > > +}
> > > > diff --git a/tests/meson.build b/tests/meson.build
> > > > index 711979b4..0d418035 100644
> > > > --- a/tests/meson.build
> > > > +++ b/tests/meson.build
> > > > @@ -3,6 +3,7 @@ test_progs = [
> > > >  	'core_getclient',
> > > >  	'core_getstats',
> > > >  	'core_getversion',
> > > > +	'core_hot_reload',
> > > >  	'core_setmaster_vs_auth',
> > > >  	'debugfs_test',
> > > >  	'drm_import_export',
> > > 
> > > 
> > 
> > 
> > 
> > 
> 
>