[igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload

Janusz Krzysztofik janusz.krzysztofik at linux.intel.com
Tue May 7 10:44:19 UTC 2019


On Tuesday, May 7, 2019 11:14:20 AM CEST Daniel Vetter wrote:
> On Tue, May 07, 2019 at 08:24:30AM +0200, Janusz Krzysztofik wrote:
> > On Monday, May 6, 2019 11:21:58 AM CEST Daniel Vetter wrote:
> > > On Mon, May 06, 2019 at 10:44:11AM +0200, Janusz Krzysztofik wrote:
> > > > Hi Daniel,
> > > > 
> > > > On Tuesday, April 30, 2019 5:05:48 PM CEST Daniel Vetter wrote:
> > > > > On Tue, Apr 30, 2019 at 01:29:15PM +0200, Janusz Krzysztofik wrote:
> > > > > > From: Janusz Krzysztofik <janusz.krzysztofik at intel.com>
> > > > > > 
> > > > > > Put some workload on a device, then try to either remove (unplug) the
> > > > > > device from its bus, or unbind the device's driver from it, possibly
> > > > > > followed by module unload, depending on which specific subtest has been
> > > > > > selected.  If succeeded, rescan the device's bus if needed and perform
> > > > > > health checks on the device with the driver possibly loaded back.
> > > > > > 
> > > > > > If module unload is requested, the workload is run in a sub-process,
> > > > > > not directly from the test, as it is expected to crash while still
> > > > > > keeping the device open for as long as its process has not exited.
> > > > > > 
> > > > > > The driver hot unbind / device hot unplug operation is expected to
> > > > > > succeed and the background workload sub-process to crash in a
> > > > > > reasonable time, however long timeouts are used to let kernel level
> > > > > > timeouts pop up first if hit by a bug.
> > > > > > 
> > > > > > The driver is ready for extending it with an arbitrary workload
> > > > > > functions as needed.  For now, a workload based on igt_dummyload is
> > > > > > implemented, hence subtests work only on i915 driver and are skipped on
> > > > > > other hardware, unless they provide their implementation of
> > > > > > igt_spin_new() and friends, or other workloads are implemented.
> > > > > > 
> > > > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at intel.com>
> > > > > 
> > > > > High level comments and apologies that I didn't look at v2-v7 in between.
> > 
> > v1-v3 were submitted internally, so you actually joined and commented first on 
> > first public submission which I should have had marked as v4 (I hadn't, 
> > sorry).
> > 
> > > > > 
> > > > > This all seems extremely complex for a simple batch spinner subtest ...
> > > > 
> > > > My initial intention was to build a simple hot unplug/unbind only test. I 
> > > > proposed to use an arbitrary external command as a workload.  Then, on 
> > > > Antonio's advice, I switched to the spinner based internal workload and I 
> > > > agree that was a good move.  Then, Petri and you, Daniel, requested to extend 
> > > > the scope of the test with device recovery and health checking.  Also, a few 
> > > > people, including you, Daniel, requested availability of more workload type 
> > > > options.  As a result, I've decided to build a *framework* for testing driver 
> > > > unbind + rebind / device unplug + bus rescan behavior under different workload 
> > > > types, easily extendable with more workload options as needed, with one 
> > > > example workload type - dummy load or spin batch - initially implemented.  
> > > > That was at least my intention for v6-8.  I wouldn't call it a simple batch 
> > > > spinner subtest any longer.
> > > 
> > > Maybe my review got wrong, but I just meant that there's more tests to
> > > write here.
> > 
> > That was clear for me, however I probably misunderstood your intentions in 
> > regard to device/driver recovery after successful unplug/unbind.
> > 
> > > Generally I think having the framework/generic solution before
> > > you have all the applications is the wrong way to build something. Usually
> > > it results in something which is generic in all the wrong ways, but not in
> > > the ones you will actually need. So complexity with no gain. Better to
> > > - add a few tests first with copypasting/minimal changes
> > > - refactor helpers once you see the real patterns
> > > - no framework, that's the midlayer mistake, see
> > >   https://blog.ffwll.ch/2016/12/midlayers-once-more-with-feeling.html and
> > >   all the articles linked from there.
> > 
> > OK, thanks for your recommendations and the references.
> > 
> > > > > do we really need all that complexity with 2nd process 
> > > > 
> > > > If we drop module unload option then no, we don't need 2nd process.  
> > > 
> > > Why does module unload require a 2nd process? We don't need a 2nd process
> > > in our other module unload tests either.
> > 
> > That's not longer the problem as we're going to drop the module unload step, 
> > but just to provide you with an explanation of my approach:
> > In case of the spin workload, references are held after the workload crashes 
> > and it's not possible to unload the module unless we put them.   Since those 
> > references are internal to IGT libraries and not exposed to a user, putting 
> > them is only possible with functions provided by IGT.  Those functions are 
> > full of checks affecting subtest results and using them to clean up resources 
> > related to a no longer existing device would result in a subtest failure or 
> > skip at least.  The most simple way to get rid of those issues is to enclose 
> > those references in a subprocess and wait for their automatic release on its 
> > completion.
> 
> 
> Hm which references? Closing the file descriptors is all we should need to
> be doing to make the module unloadable. 

That's exactly what I meant.  Unfortunately some of those file descriptors are 
private to IGT lib.

> I think an explicit helper
> function to do that (exported from core lib) is much better than killing a
> process (or waiting for that process to die). It's more explicit code at
> least (and that's generally better for testcases).

Do you think it's worth of effort to extend core lib with less assertive 
variants of existing functions, useful specifically for the hotunplug test and 
maybe no others?  I have identified quite a few such functions, however with 
the approach of not making to much cleanups before recovery you suggest I'm 
not sure if still needed, maybe only for a subtest with module unload.

> > > > > and watchers 
> > > > 
> > > > That was primarily needed for successful module unload.  If we drop that 
> > > > option and you think driver rebind / bus rescan operations can be performed 
> > > > blindly, without checking for completion of background workload, then I can 
> > > > drop the watchers.
> > > 
> > > Well we _have_ to do unbinds without checking the background workload has
> > > completed. That's the entire point of testing hotunplug. 
> > 
> > I agree, and the test performs all unbinds that way, i.e., without checking 
> > the background workload has completed.  Waiting for background workload 
> > completion applies only to what I'm considering a device recovery phase, and 
> > not to the "main" unbind/unplug test phase in any way.
> 
> Ah ok. At least for rescan I think would also make sense to not wait,
> that's another interesting (and even more evil) testcase. This would check
> for issues around assigning device node minor numbers. We'd only need one
> such case, and all it needs to do really is keep the drm device fd open.

OK.

> > > It's also why
> > > there's lots of work to do here, because the kernel is totally not ready
> > > for this.
> > > 
> > > First stopping everything and then unloading isn't an interesting test,
> > 
> > Since its introduction, the module unload step was intended as a part of a 
> > post-subtest device recovery phase, not the subtest merit.  I added that step 
> > because I thought that would be the most reliable way of satisfying the CI 
> > requirement on restoring the device to the state ready for next tests without 
> > reboot or real device power-on reset on real hardware bus replug.
> 
> Yes I understand that. But what are you trying to recover from with a
> module reload? Just code sharing as you explain below, or other reasons?

Nothing specific.  Oriented on successful recovery of the device so it's ready 
for next tests without reboot, I just intuitively tried to avoid rediscovering 
it, possibly in a completely unpredictable state after the fake unplug, and 
that intuition, probably mixed with my ignorance, suggested me to use module 
unload before bus rescan.

> > > that's more or less exactly what our various module unload tests are
> > > doing already.
> > 
> > Yes, and in v5-v7 I was even using the existing i915_module_load test as an 
> > external helper command performing device recovery and healthcheck phases in 
> > order to avoid reimplementing its code here.
> > 
> > > > > and a bunch of callbacks and everything, just do to a hotremove testcase?
> > > > 
> > > > I can still drop the framework and switch back to the initial simple structure 
> > > > with one or two fixed subtests if you don't like my structural approach.
> > > 
> > > See above for why, I think that will result in better code in the end.
> > 
> > OK.
> > 
> > > > > Very first patch looked much more reasonable, aside from that it broke CI
> > > > > since it didn't rebind the driver. 
> > > > 
> > > > Sorry, my understanding of your and Petri's comments was a bit different, I 
> > > > thought that by more than best effort you meant doing everything possible to 
> > > > restore the device to be ready for next test without reboot, and module unload 
> > > > and reload seemed the most reliable option to me.  Now I can see that there 
> > > > were probably two different requirements.  You were considering the test 
> > > > incomplete because it was performing only the unbind/unplug part and not 
> > > > rebind/rescan, while Petri was probably interested mostly in the device being 
> > > > ready for next tests without reboot, no matter which way.
> > > 
> > > Well it's the same request, and rebind/rescan /should/ result in a working
> > > device again. If not, then I guess we also have a bug on our hotreplug
> > > code. Which again is worth testing for.
> > > 
> > > > > We can always add complexity later on
> > > > > once we have dma-buf/dma-fence/kms/whatever else substests here.
> > > > 
> > > > OK, as you wish.
> > > > 
> > > > > Also, I think we should have at least one hotremove-only-nothing-special
> > > > > subtest here, i.e. without even the busy batch.
> > > > 
> > > > That seems trivial to adjust the framework so it accepted NULL workload, if 
> > > > the framework survived. Anyway, I'll do that.  Should I put it in a separate 
> > > > NULL workload subtest function to be called from igt_main?  Or add it to the 
> > > > spin workload subtest function specifically as an option?
> > > 
> > > Separate test as the first subtest.
> > 
> > OK.
> > 
> > > Maybe even include the "shut
> > > everything off first" logic from module unload,
> > 
> > Do you mean i915_module_load.c?
> 
> On 2nd thought, for a hotunplug we shouldn't need any of that. Those "shut
> everything off first" steps are just to lower the module use count, so
> we're allowed to unload the module. The unplug code we run should take
> care of all that already for us for a (fake) hotunplug. So module reload
> should just work. But then I still don't understand where you see the
> benefit in unloading/reloading the module.

Not any longer ;-)

> > > to have the most baseline test possible.
> > > 
> > > > > I'm also not sure why we also put module unload tests in there. 
> > > > 
> > > > As I tried to explain above, I introduced module unload in order to satisfy 
> > > > the CI requirement on the device being ready for next test without reboot as 
> > > > much as possible.
> > > 
> > > Hm, but why? What does module reload help in this regard that a rebind
> > > can't do? Aside from testing module reload, which is a developer feature
> > > and already tested elsewhere.
> > 
> > As I said, I decided to use module unload as I thought it would be the most 
> > reliable way of device recovery from simulated unplug in case no real power-on 
> > reset is performed.  I didn't insist on keeping it there at all, I only tried 
> > to explain why I did that.  As that can't help in any way to recover the 
> > device so it is ready for next tests as CI requested then I'll be happy to 
> > remove it and stick to pure driver rebind / bus rescan operations.
> > 
> > > I'm also not seeing much interactions between hotunplug and module unload.
> > > 
> > > The one interesting testcase I see is trying to unload the module after we
> > > hotunplugged, while the driver is still in use somewhere (open drm fd,
> > > open dma-buf fd, open dma-fence fd). That should result in a failure, and
> > > it's useful to validate that the kernel is handling the module refcounting
> > > correctly in all these cases. But that's a specific negative testcase (and
> > > actually being able to unload would be a failure and likely result in a
> > > kernel oops), I'm not seeing the benefit of reloading the module.
> > 
> > That's perfectly clear for me, the optional module unload step will not be 
> > there on next iteration.
> 
> That might be overshooting slightly. There is at least one interesting
> hotunplug testcase involving module unload. But it's more a special case,
> not something we need to do for all subtests.

OK, I can try to implement it if I'm sure what you think it should do exactly.

> > > > > Compared
> > > > > to hotunplug of a discrete gfx card (external one over usb or thunderbolt
> > > > > or whatever), which is something users can do, module unload is explicitly a 
> > > > > developer only feature.
> > > > 
> > > > My approach was to be able to test driver behavior under any hot unload 
> > > > operation available to a user, no matter if developer oriented or not, so we 
> > > > can make the driver resistant to users performing potentially dangerous hot 
> > > > unbind/unplug operations available to them, intentionally or not.
> > > 
> > > Yes I agree with that, we need to test hotunplug.
> > > 
> > > btw the real fun isn't the unbind in sysfs, but physically unplugging a
> > > pci-e or thunderbolt/usb-c gfx card. Imo that's why we need to have this,
> > > and the best way to test that hotunplug is through the sysfs unbind
> > > support (it's not exactly the same since this way we'll never see failing
> > > pci transactions, which are an entirely different kind of fun).
> > 
> > I fully agree, however please note that what I'm calling device hot unplug is 
> > probably still an interesting sysfs option aside driver hot unbind.
> > 
> > > > > We do not expect module unload to work under all
> > > > > possible conditions (it doesn't). 
> > > > 
> > > > Do you think that driver rebind operation has more chances to succeed, 
> > > > especially on a device on which a bus unplug operation was not actually 
> > > > performed but only simulated via sysfs, on a device which then has been left 
> > > > in an unpredictable state and hasn't undergone a hardware power-on reset on 
> > > > physical bus re-plug?
> > > 
> > > There's definitely potential for bugs, but I don't see how module reload
> > > helps. Module reload is essentially:
> > > 
> > > - unbind devices
> > > - unload module
> > > - reload module
> > > - rebind all devices
> > > 
> > > The only additional magic that module unload can paper over is that it's
> > > disallowed while anyone is still using any devices (assuming the module
> > > refcount code is correct). That's not the case for unbind/hotunplug. But
> > > that's it, there's no additional magic code being run when you unload the
> > > module. Hence why I don't understand why you want to do that.
> > 
> > Not any longer :-)
> > 
> > > > > I'd drop that part and focus completely
> > > > > on the hotremove/unbind testcase here.
> > > > 
> > > > Driver unbind / device unplug via sysfs can also be considered developer only 
> > > > features. Do you think we should drop driver unbind option, leaving only 
> > > > device unplug via sysfs for which we may have no good non-developer 
> > > > alternative?
> > > 
> > > Yanking the cable for e.g. usb-c/thunderbolt external gpu is very much a
> > > user action. That's why we care.
> > > 
> > > We didn't care for unbind (I wontfix closed all the bugs myself) while
> > > intel only created built-in gpus because it's indeed fairly pointless to
> > > unbind these.
> > > 
> > > Other bit I don't quite get: What's the difference between unbind and
> > > unplug?
> > 
> > I'm not sure what information you are missing.  What the test is doing is:
> > 
> > driver unbind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/unbind
> > vs.
> > virtual device unplug: echo 1 >/sys/bus/<bus>/devices/<device>/remove
> 
> I was missing the above I guess.
> 
> So looking at kernel code the difference is that when we unbind the entire
> driver, we loop over all currrently bound devices and do the same as the
> remove sysfs file. So kinda redundant, I'd drop the driver unbind
> testcase.

OK.

> Also, are you digging around in the kernel already and trying to
> understand what's going on and how it all ties together? And have you
> started to look at the bugs this uncovers in the kernel, or who's supposed
> to work on that side of this effort?

I'm trying to.  A first step was:
https://github.com/freedesktop/drm-intel/commit/d69990e0c399e4f7f9b50505d3285e5de991148a

I've already tested two other patches:
https://patchwork.freedesktop.org/series/60053/
https://patchwork.freedesktop.org/series/60051/

Now I'm trying to resolve the GEM_BUG_ON(vma->obj != obj) issue which popped 
up with both above patches applied. I'm not yet sure to what extent it has 
been simply uncovered vs. just triggered by my second patch. 

> Bonus points if we unbind the same device as we'd pick for the drm fd
> (there's some selection logic, and if you go through /sys/class/drm you
> should find the right device). This is relevant for discrete/multi-gpu
> systems.

I think I took that into account seriously enough while planing the subtest 
actions.  Sysfs operations are performed on nodes resolved from the device 
file descriptor.

> > Panic call traces look a bit different, you may want to compare the following two:
> > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb1/igt@core_hot_reload@spin-unbind.html
> > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb4/igt@core_hot_reload@spin-unplug.html
> 
> Hm I think we also need a hotunplug testcase that does absolutely nothing
> first, i.e. no spin batches, no open drm files, nothing else.

OK.

> > rebind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/bind
> > vs.
> > replug: echo 1 >/sys/bus/<bus>/rescan
> > 
> > With no panics accompanying driver unbind / device unplug under active spin 
> > workload on older hardware, the recovery phase is however still giving a 
> > different result for each of those two methods:
> > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw1/igt@core_hot_reload@spin-unbind.html
> > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw5/igt@core_hot_reload@spin-unplug.html
> 
> Shouldn't really be a difference, but maybe there's timing changes that
> slightly influence the outcome.

I rather thought of bus operations still being available for the driver to 
shut down the device (more) cleanly on unbind, but again, that's an intuitive 
guess rather than real knowledge.

Thanks,
Janusz


> > BTW, trybot results confirm that module unload really doesn't help:
> > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw8/igt@core_hot_reload@spin-unplug-unload.html
> 
> Yeah worst case we have an additional module refcount bug and then module
> unload will make things worse. I can't come up with a scenario where
> module unload would help (there's reasons it's a developer-only thing,
> it's really hard to get right).
> 
> Cheers, Daniel
> 
> 
> > 
> > Thanks,
> > Janusz
> > 
> > 
> > > -Daniel
> > > 
> > > > 
> > > > Thanks,
> > > > Janusz
> > > > 
> > > > 
> > > > > -Daniel
> > > > > 
> > > > > > ---
> > > > > >  tests/Makefile.sources  |   1 +
> > > > > >  tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
> > > > > >  tests/meson.build       |   1 +
> > > > > >  3 files changed, 410 insertions(+)
> > > > > >  create mode 100644 tests/core_hot_reload.c
> > > > > > 
> > > > > > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > > > > > index 7f921f6c..452d8ed7 100644
> > > > > > --- a/tests/Makefile.sources
> > > > > > +++ b/tests/Makefile.sources
> > > > > > @@ -16,6 +16,7 @@ TESTS_progs = \
> > > > > >  	core_getclient \
> > > > > >  	core_getstats \
> > > > > >  	core_getversion \
> > > > > > +	core_hot_reload \
> > > > > >  	core_setmaster_vs_auth \
> > > > > >  	debugfs_test \
> > > > > >  	drm_import_export \
> > > > > > diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
> > > > > > new file mode 100644
> > > > > > index 00000000..6673f55c
> > > > > > --- /dev/null
> > > > > > +++ b/tests/core_hot_reload.c
> > > > > > @@ -0,0 +1,408 @@
> > > > > > +/*
> > > > > > + * Copyright © 2019 Intel Corporation
> > > > > > + *
> > > > > > + * Permission is hereby granted, free of charge, to any person obtaining 
> > > > a
> > > > > > + * copy of this software and associated documentation files (the 
> > > > "Software"),
> > > > > > + * to deal in the Software without restriction, including without 
> > > > limitation
> > > > > > + * the rights to use, copy, modify, merge, publish, distribute, 
> > > > sublicense,
> > > > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > > > + * Software is furnished to do so, subject to the following conditions:
> > > > > > + *
> > > > > > + * The above copyright notice and this permission notice (including the 
> > > > next
> > > > > > + * paragraph) shall be included in all copies or substantial portions of 
> > > > the
> > > > > > + * Software.
> > > > > > + *
> > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
> > > > EXPRESS OR
> > > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
> > > > MERCHANTABILITY,
> > > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT 
> > > > SHALL
> > > > > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR 
> > > > OTHER
> > > > > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
> > > > ARISING
> > > > > > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
> > > > DEALINGS
> > > > > > + * IN THE SOFTWARE.
> > > > > > + */
> > > > > > +
> > > > > > +#include "igt.h"
> > > > > > +#include "igt_device.h"
> > > > > > +#include "igt_dummyload.h"
> > > > > > +#include "igt_kmod.h"
> > > > > > +#include "igt_sysfs.h"
> > > > > > +
> > > > > > +#include <getopt.h>
> > > > > > +#include <limits.h>
> > > > > > +#include <string.h>
> > > > > > +#include <unistd.h>
> > > > > > +
> > > > > > +#include <sys/types.h>
> > > > > > +#include <sys/wait.h>
> > > > > > +
> > > > > > +/**
> > > > > > + * A post-action device recovery function:
> > > > > > + * @priv: a pointer to private data required for device recovery
> > > > > > + *
> > > > > > + * Make the device re-appear
> > > > > > + */
> > > > > > +typedef void (*recover_t)(const void *priv);
> > > > > > +
> > > > > > +/**
> > > > > > + * A test action function:
> > > > > > + * @dir: file descriptor of an open device sysfs directory
> > > > > > + * @module: module name, non-NULL indicates post-action module unload 
> > > > requested
> > > > > > + * @recover: for returning a pointer to a post-action device recovery 
> > > > function
> > > > > > + * @priv: for returning a pointer to data to be passed to @recover
> > > > > > + *
> > > > > > + * Make the device disappear
> > > > > > + */
> > > > > > +typedef void (*action_t)(int device, const char *module,
> > > > > > +			 recover_t *recover, const void **priv);
> > > > > > +
> > > > > > +/**
> > > > > > + * A workload completion wait function:
> > > > > > + * @device: open device file descriptor
> > > > > > + * @priv: a pointer to private data required by the wait function
> > > > > > + *
> > > > > > + * Wait for completion of background workload
> > > > > > + */
> > > > > > +typedef void (*workload_wait_t)(int device, void *priv);
> > > > > > +
> > > > > > +/**
> > > > > > + * A workload function:
> > > > > > + * @device: open device file descriptor
> > > > > > + * @arg: a optional string argument passed to the workload function
> > > > > > + * @workload_wait: for returning a pointer to workload completion wait 
> > > > function
> > > > > > + * @priv: for returning a pointer to data to be passed to @workload_wait
> > > > > > + *
> > > > > > + * Put some long lasting load on the device
> > > > > > + */
> > > > > > +typedef void (*workload_t)(int device, const char *arg,
> > > > > > +			   workload_wait_t *workload_wait, void 
> > > > **priv);
> > > > > > +
> > > > > > +/**
> > > > > > + * Pairs of test action / device recovery functions
> > > > > > + */
> > > > > > +
> > > > > > +/* Unbind / re-bind */
> > > > > > +
> > > > > > +struct rebind_data {
> > > > > > +	int driver;	/* open file descriptor of driver sysfs directory */
> > > > > > +	char *device;	/* bus specific device address as string */
> > > > > > +};
> > > > > > +
> > > > > > +/* Re-bind the driver to the device */
> > > > > > +static void driver_bind(const void *priv)
> > > > > > +{
> > > > > > +	const struct rebind_data *data = priv;
> > > > > > +
> > > > > > +	igt_set_timeout(60, "Driver re-bind timeout!");
> > > > > > +	igt_sysfs_set(data->driver, "bind", data->device);
> > > > > > +
> > > > > > +	close(data->driver);
> > > > > > +}
> > > > > > +
> > > > > > +/* Unbind the driver from the device */
> > > > > > +static void driver_unbind(int device, const char *module,
> > > > > > +			  recover_t *recover, const void **priv)
> > > > > > +{
> > > > > > +	static char path[PATH_MAX];
> > > > > > +	static struct rebind_data data;
> > > > > > +	int len;
> > > > > > +
> > > > > > +	/* collect information required for driver bind/unbind */
> > > > > > +	data.driver = openat(device, "device/driver", O_DIRECTORY);
> > > > > > +	igt_assert(data.driver >= 0);
> > > > > > +
> > > > > > +	len = readlinkat(device, "device", path, sizeof(path) - 1);
> > > > > > +	path[len] = '\0';
> > > > > > +	data.device = strrchr(path, '/') + 1;
> > > > > > +
> > > > > > +	/* unbind the driver */
> > > > > > +	igt_set_timeout(60, "Driver unbind timeout!");
> > > > > > +	igt_sysfs_set(data.driver, "unbind", data.device);
> > > > > > +
> > > > > > +	/* pass back info on how to recover the device */
> > > > > > +	if (module) {
> > > > > > +		/* don't try to rebind if module will be unloaded */
> > > > > > +		*recover = NULL;
> > > > > > +	} else {
> > > > > > +		*recover = driver_bind;
> > > > > > +		*priv = &data;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/* Unplug / re-plug */
> > > > > > +
> > > > > > +/* Re-discover the device by rescanning its bus */
> > > > > > +static void bus_rescan(const void *priv)
> > > > > > +{
> > > > > > +	const int *bus = priv;
> > > > > > +
> > > > > > +	igt_set_timeout(60, "Bus rescan timeout!");
> > > > > > +	igt_sysfs_set(*bus, "rescan", "1");
> > > > > > +
> > > > > > +	close(*bus);
> > > > > > +}
> > > > > > +
> > > > > > +/* Remove (virtually unplug) the device from its bus */
> > > > > > +static void device_unplug(int device, const char *module,
> > > > > > +			  recover_t *recover, const void **priv)
> > > > > > +{
> > > > > > +	static int bus;
> > > > > > +
> > > > > > +	/* collect information required for bus rescan */
> > > > > > +	bus = openat(device, "device/subsystem", O_DIRECTORY);
> > > > > > +	igt_assert(bus >= 0);
> > > > > > +
> > > > > > +	/* remove the device */
> > > > > > +	igt_set_timeout(60, "Device unplug timeout!");
> > > > > > +	igt_sysfs_set(device, "device/remove", "1");
> > > > > > +
> > > > > > +	/* pass back info on how to recover the device */
> > > > > > +	*recover = bus_rescan;
> > > > > > +	*priv = &bus;
> > > > > > +}
> > > > > > +
> > > > > > +/* Each test action function must be registered in the following table */
> > > > > > +static const struct {
> > > > > > +	const char *name;	/* unique test action name used in test 
> > > > names */
> > > > > > +	action_t function;	/* test action function pointer */
> > > > > > +} actions[] = {
> > > > > > +	{ "unbind", driver_unbind, },
> > > > > > +	{ "unplug", device_unplug, },
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * Pairs of workload / wait completion functions
> > > > > > + */
> > > > > > +
> > > > > > +/* A workload using igt_spin_run() */
> > > > > > +
> > > > > > +/* Wait for completaion of dummy load */
> > > > > > +static void dummy_wait(int device, void *priv)
> > > > > > +{
> > > > > > +	igt_spin_t *spin = priv;
> > > > > > +
> > > > > > +	/* wait until the spin no longer runs, don't fail on error */
> > > > > > +	if (gem_wait(device, spin->handle, NULL))
> > > > > > +		__gem_set_domain(device, spin->handle,
> > > > > > +				 I915_GEM_DOMAIN_GTT, 
> > > > I915_GEM_DOMAIN_GTT);
> > > > > > +}
> > > > > > +
> > > > > > +/* Run dummy load */
> > > > > > +static void dummy_load(int device, const char *arg,
> > > > > > +		       workload_wait_t *workload_wait, void **priv)
> > > > > > +{
> > > > > > +	igt_spin_t *spin;
> > > > > > +
> > > > > > +	/* submit a job */
> > > > > > +	spin = igt_spin_new(device);
> > > > > > +
> > > > > > +	*workload_wait = dummy_wait;
> > > > > > +	*priv = spin;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Each workload function must be registered in the following table.
> > > > > > + * A function may be registered more than once under different workload 
> > > > names,
> > > > > > + * that makes sense as long as a different argument is specified for each 
> > > > name.
> > > > > > + */
> > > > > > +static const struct {
> > > > > > +	const char *name;	/* unique workload name used in test names 
> > > > */
> > > > > > +	workload_t function;	/* workload function pointer */
> > > > > > +	const char *arg;	/* optional constant string argument */
> > > > > > +} workloads[] = {
> > > > > > +	{ "spin", dummy_load, NULL, },
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * Framework
> > > > > > + */
> > > > > > +
> > > > > > +static void healthcheck(int chipset)
> > > > > > +{
> > > > > > +	int device;
> > > > > > +
> > > > > > +	device = __drm_open_driver(chipset);
> > > > > > +	igt_assert(device >= 0);
> > > > > > +
> > > > > > +	if (chipset == DRIVER_INTEL)
> > > > > > +		gem_test_engine(device, ALL_ENGINES);
> > > > > > +
> > > > > > +	close(device);
> > > > > > +}
> > > > > > +
> > > > > > +static void module_unload(int chipset, const char *module)
> > > > > > +{
> > > > > > +	if (chipset == DRIVER_INTEL)
> > > > > > +		igt_assert(igt_i915_driver_unload() == 
> > > > IGT_EXIT_SUCCESS);
> > > > > > +	else
> > > > > > +		igt_assert(igt_kmod_unload(module, 0) == 0);
> > > > > > +}
> > > > > > +
> > > > > > +static void run_action(int device, action_t action, const char *module,
> > > > > > +		      recover_t *recover, const void **priv)
> > > > > > +{
> > > > > > +	int dir;
> > > > > > +
> > > > > > +	dir = igt_sysfs_open(device);
> > > > > > +	igt_assert(dir >= 0);
> > > > > > +
> > > > > > +	action(dir, module, recover, priv);
> > > > > > +
> > > > > > +	close(dir);
> > > > > > +}
> > > > > > +
> > > > > > +static void wait_helper(int device, void *priv)
> > > > > > +{
> > > > > > +	struct igt_helper_process *proc = priv;
> > > > > > +
> > > > > > +	/* wait until the workload subprocess has completed */
> > > > > > +	igt_ignore_warn(igt_wait_helper(proc));
> > > > > > +}
> > > > > > +
> > > > > > +static void run_workload(int device, workload_t workload, const char 
> > > > *arg,
> > > > > > +			 const char *module, workload_wait_t 
> > > > *workload_wait,
> > > > > > +			 void **priv)
> > > > > > +{
> > > > > > +	if (module) {
> > > > > > +		/* run workload in a subprocess so the module is put on 
> > > > crash */
> > > > > > +		static struct igt_helper_process proc;
> > > > > > +		int wstatus, ret;
> > > > > > +
> > > > > > +		bzero(&proc, sizeof(proc));
> > > > > > +
> > > > > > +		igt_fork_helper(&proc) {
> > > > > > +			/* suppress igt_log messages */
> > > > > > +			igt_log_level = IGT_LOG_NONE;
> > > > > > +
> > > > > > +			/* intercept igt_fail/skip() long jumps */
> > > > > > +			if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0) 
> > > > {
> > > > > > +				workload(device, arg, 
> > > > workload_wait, priv);
> > > > > > +
> > > > > > +				(*workload_wait)(device, 
> > > > *priv);
> > > > > > +
> > > > > > +				/* success if not diverted by 
> > > > igt_fail/skip() */
> > > > > > +				igt_success();
> > > > > > +			}
> > > > > > +
> > > > > > +			/* pass exit code back to the caller */
> > > > > > +			igt_exit();
> > > > > > +		}
> > > > > > +		/* let the background process start doing its job or 
> > > > fail */
> > > > > > +		sleep(2);
> > > > > > +		/* fail or skip on workload premature completion */
> > > > > > +		ret = waitpid(proc.pid, &wstatus, WNOHANG);
> > > > > > +		if (ret < 0)
> > > > > > +			igt_fail(IGT_EXIT_INVALID);
> > > > > > +		if (ret) {
> > > > > > +			if (!WIFEXITED(wstatus))
> > > > > > +				igt_fail(IGT_EXIT_INVALID);
> > > > > > +			if (WEXITSTATUS(wstatus) == 
> > > > IGT_EXIT_SUCCESS)
> > > > > > +				igt_fail(IGT_EXIT_INVALID);
> > > > > > +			if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
> > > > > > +				igt_skip(NULL);
> > > > > > +			igt_fail(WEXITSTATUS(wstatus));
> > > > > > +		}
> > > > > > +
> > > > > > +		/* pass back info on how to wait for helper completion 
> > > > */
> > > > > > +		*workload_wait = wait_helper;
> > > > > > +		*priv = &proc;
> > > > > > +	} else {
> > > > > > +		/* run the requested workload directly */
> > > > > > +		workload(device, arg, workload_wait, priv);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +static void run_subtest(int chipset, int workload, int action,
> > > > > > +			const char *module)
> > > > > > +{
> > > > > > +	workload_wait_t workload_wait;
> > > > > > +	void *workload_priv;
> > > > > > +	recover_t recover;
> > > > > > +	const void *recover_priv;
> > > > > > +	int device;
> > > > > > +
> > > > > > +	igt_subtest_f("%s-%s%s", workloads[workload].name, 
> > > > actions[action].name,
> > > > > > +		      module ? "-unload" : "") {
> > > > > > +		device = __drm_open_driver(chipset);
> > > > > > +		igt_assert(device >= 0);
> > > > > > +
> > > > > > +		/* spawn the requested workload */
> > > > > > +		igt_debug("spawning background workload\n");
> > > > > > +		run_workload(device, workloads[workload].function,
> > > > > > +			     workloads[workload].arg, module,
> > > > > > +			     &workload_wait, &workload_priv);
> > > > > > +
> > > > > > +		/* run the requested test action */
> > > > > > +		igt_debug("running test action\n");
> > > > > > +		run_action(device, actions[action].function, module,
> > > > > > +			   &recover, &recover_priv);
> > > > > > +
> > > > > > +		if (workload_wait) {
> > > > > > +			igt_debug("waiting for workload 
> > > > completion\n");
> > > > > > +			workload_wait(device, workload_priv);
> > > > > > +		}
> > > > > > +
> > > > > > +		close(device);
> > > > > > +
> > > > > > +		if (module) {
> > > > > > +			igt_debug("unloading %s\n", module);
> > > > > > +			module_unload(chipset, module);
> > > > > > +		}
> > > > > > +
> > > > > > +		if (recover) {
> > > > > > +			igt_debug("recovering device\n");
> > > > > > +			recover(recover_priv);
> > > > > > +			igt_reset_timeout();
> > > > > > +		}
> > > > > > +
> > > > > > +		igt_debug("running healthcheck\n");
> > > > > > +		healthcheck(chipset);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +igt_main {
> > > > > > +	int device, chipset;
> > > > > > +	char *module;
> > > > > > +	int i, j;
> > > > > > +
> > > > > > +	igt_fixture {
> > > > > > +		char path[PATH_MAX];
> > > > > > +		int dir, len;
> > > > > > +
> > > > > > +		/**
> > > > > > +		 * Since some subtests depend on successful unload of a 
> > > > driver
> > > > > > +		 * module, don't use drm_open_driver() as it keeps a 
> > > > device file
> > > > > > +		 * descriptor open for exit handler use and that 
> > > > effectively
> > > > > > +		 * prevents the module from being unloaded.
> > > > > > +		 */
> > > > > > +		device = __drm_open_driver(DRIVER_ANY);
> > > > > > +		igt_assert(device >= 0);
> > > > > > +
> > > > > > +		if (is_i915_device(device)) {
> > > > > > +			chipset = DRIVER_INTEL;
> > > > > > +			module = strdup("i915");
> > > > > > +		} else {
> > > > > > +			chipset = DRIVER_ANY;
> > > > > > +
> > > > > > +			/* Capture module name to be unloaded */
> > > > > > +			dir = igt_sysfs_open(device);
> > > > > > +			len = readlinkat(dir, "device/driver/
> > > > module", path,
> > > > > > +					 sizeof(path) - 1);
> > > > > > +			close(dir);
> > > > > > +			path[len] = '\0';
> > > > > > +			module = strdup(strrchr(path, '/') + 1);
> > > > > > +		}
> > > > > > +		close(device);
> > > > > > +
> > > > > > +		igt_info("Running the test on driver \"%s\", chipset 
> > > > mask %#0x\n",
> > > > > > +			 module, chipset);
> > > > > > +	}
> > > > > > +
> > > > > > +	for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
> > > > > > +		for (j = 0; j < sizeof(actions) / sizeof(*actions); j+
> > > > +) {
> > > > > > +			/* with module unload */
> > > > > > +			run_subtest(chipset, i, j, module);
> > > > > > +			/* without module unload */
> > > > > > +			run_subtest(chipset, i, j, NULL);
> > > > > > +		}
> > > > > > +	}
> > > > > > +}
> > > > > > diff --git a/tests/meson.build b/tests/meson.build
> > > > > > index 711979b4..0d418035 100644
> > > > > > --- a/tests/meson.build
> > > > > > +++ b/tests/meson.build
> > > > > > @@ -3,6 +3,7 @@ test_progs = [
> > > > > >  	'core_getclient',
> > > > > >  	'core_getstats',
> > > > > >  	'core_getversion',
> > > > > > +	'core_hot_reload',
> > > > > >  	'core_setmaster_vs_auth',
> > > > > >  	'debugfs_test',
> > > > > >  	'drm_import_export',
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > igt-dev mailing list
> > igt-dev at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/igt-dev
> 
> 






More information about the igt-dev mailing list