[igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload

Daniel Vetter daniel at ffwll.ch
Tue May 7 13:32:10 UTC 2019


On Tue, May 7, 2019 at 12:44 PM Janusz Krzysztofik
<janusz.krzysztofik at linux.intel.com> wrote:
>
> On Tuesday, May 7, 2019 11:14:20 AM CEST Daniel Vetter wrote:
> > On Tue, May 07, 2019 at 08:24:30AM +0200, Janusz Krzysztofik wrote:
> > > On Monday, May 6, 2019 11:21:58 AM CEST Daniel Vetter wrote:
> > > > On Mon, May 06, 2019 at 10:44:11AM +0200, Janusz Krzysztofik wrote:
> > > > > Hi Daniel,
> > > > >
> > > > > On Tuesday, April 30, 2019 5:05:48 PM CEST Daniel Vetter wrote:
> > > > > > On Tue, Apr 30, 2019 at 01:29:15PM +0200, Janusz Krzysztofik wrote:
> > > > > > > From: Janusz Krzysztofik <janusz.krzysztofik at intel.com>
> > > > > > >
> > > > > > > Put some workload on a device, then try to either remove (unplug) the
> > > > > > > device from its bus, or unbind the device's driver from it, possibly
> > > > > > > followed by module unload, depending on which specific subtest has been
> > > > > > > selected.  If succeeded, rescan the device's bus if needed and perform
> > > > > > > health checks on the device with the driver possibly loaded back.
> > > > > > >
> > > > > > > If module unload is requested, the workload is run in a sub-process,
> > > > > > > not directly from the test, as it is expected to crash while still
> > > > > > > keeping the device open for as long as its process has not exited.
> > > > > > >
> > > > > > > The driver hot unbind / device hot unplug operation is expected to
> > > > > > > succeed and the background workload sub-process to crash in a
> > > > > > > reasonable time, however long timeouts are used to let kernel level
> > > > > > > timeouts pop up first if hit by a bug.
> > > > > > >
> > > > > > > The driver is ready for extending it with an arbitrary workload
> > > > > > > functions as needed.  For now, a workload based on igt_dummyload is
> > > > > > > implemented, hence subtests work only on i915 driver and are skipped on
> > > > > > > other hardware, unless they provide their implementation of
> > > > > > > igt_spin_new() and friends, or other workloads are implemented.
> > > > > > >
> > > > > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at intel.com>
> > > > > >
> > > > > > High level comments and apologies that I didn't look at v2-v7 in between.
> > >
> > > v1-v3 were submitted internally, so you actually joined and commented first on
> > > first public submission which I should have had marked as v4 (I hadn't,
> > > sorry).
> > >
> > > > > >
> > > > > > This all seems extremely complex for a simple batch spinner subtest ...
> > > > >
> > > > > My initial intention was to build a simple hot unplug/unbind only test. I
> > > > > proposed to use an arbitrary external command as a workload.  Then, on
> > > > > Antonio's advice, I switched to the spinner based internal workload and I
> > > > > agree that was a good move.  Then, Petri and you, Daniel, requested to extend
> > > > > the scope of the test with device recovery and health checking.  Also, a few
> > > > > people, including you, Daniel, requested availability of more workload type
> > > > > options.  As a result, I've decided to build a *framework* for testing driver
> > > > > unbind + rebind / device unplug + bus rescan behavior under different workload
> > > > > types, easily extendable with more workload options as needed, with one
> > > > > example workload type - dummy load or spin batch - initially implemented.
> > > > > That was at least my intention for v6-8.  I wouldn't call it a simple batch
> > > > > spinner subtest any longer.
> > > >
> > > > Maybe my review got wrong, but I just meant that there's more tests to
> > > > write here.
> > >
> > > That was clear for me, however I probably misunderstood your intentions in
> > > regard to device/driver recovery after successful unplug/unbind.
> > >
> > > > Generally I think having the framework/generic solution before
> > > > you have all the applications is the wrong way to build something. Usually
> > > > it results in something which is generic in all the wrong ways, but not in
> > > > the ones you will actually need. So complexity with no gain. Better to
> > > > - add a few tests first with copypasting/minimal changes
> > > > - refactor helpers once you see the real patterns
> > > > - no framework, that's the midlayer mistake, see
> > > >   https://blog.ffwll.ch/2016/12/midlayers-once-more-with-feeling.html and
> > > >   all the articles linked from there.
> > >
> > > OK, thanks for your recommendations and the references.
> > >
> > > > > > do we really need all that complexity with 2nd process
> > > > >
> > > > > If we drop module unload option then no, we don't need 2nd process.
> > > >
> > > > Why does module unload require a 2nd process? We don't need a 2nd process
> > > > in our other module unload tests either.
> > >
> > > That's not longer the problem as we're going to drop the module unload step,
> > > but just to provide you with an explanation of my approach:
> > > In case of the spin workload, references are held after the workload crashes
> > > and it's not possible to unload the module unless we put them.   Since those
> > > references are internal to IGT libraries and not exposed to a user, putting
> > > them is only possible with functions provided by IGT.  Those functions are
> > > full of checks affecting subtest results and using them to clean up resources
> > > related to a no longer existing device would result in a subtest failure or
> > > skip at least.  The most simple way to get rid of those issues is to enclose
> > > those references in a subprocess and wait for their automatic release on its
> > > completion.
> >
> >
> > Hm which references? Closing the file descriptors is all we should need to
> > be doing to make the module unloadable.
>
> That's exactly what I meant.  Unfortunately some of those file descriptors are
> private to IGT lib.
>
> > I think an explicit helper
> > function to do that (exported from core lib) is much better than killing a
> > process (or waiting for that process to die). It's more explicit code at
> > least (and that's generally better for testcases).
>
> Do you think it's worth of effort to extend core lib with less assertive
> variants of existing functions, useful specifically for the hotunplug test and
> maybe no others?  I have identified quite a few such functions, however with
> the approach of not making to much cleanups before recovery you suggest I'm
> not sure if still needed, maybe only for a subtest with module unload.

Yes, we have lots of these already. The usual approach is that the
normal function has all the checks, and the one with a __ prefix has a
return value indicating whether things worked or not, leaving checking
to the caller. Especially with low-level ioctls wrappers this is a
very common pattern.

> > > > > > and watchers
> > > > >
> > > > > That was primarily needed for successful module unload.  If we drop that
> > > > > option and you think driver rebind / bus rescan operations can be performed
> > > > > blindly, without checking for completion of background workload, then I can
> > > > > drop the watchers.
> > > >
> > > > Well we _have_ to do unbinds without checking the background workload has
> > > > completed. That's the entire point of testing hotunplug.
> > >
> > > I agree, and the test performs all unbinds that way, i.e., without checking
> > > the background workload has completed.  Waiting for background workload
> > > completion applies only to what I'm considering a device recovery phase, and
> > > not to the "main" unbind/unplug test phase in any way.
> >
> > Ah ok. At least for rescan I think would also make sense to not wait,
> > that's another interesting (and even more evil) testcase. This would check
> > for issues around assigning device node minor numbers. We'd only need one
> > such case, and all it needs to do really is keep the drm device fd open.
>
> OK.
>
> > > > It's also why
> > > > there's lots of work to do here, because the kernel is totally not ready
> > > > for this.
> > > >
> > > > First stopping everything and then unloading isn't an interesting test,
> > >
> > > Since its introduction, the module unload step was intended as a part of a
> > > post-subtest device recovery phase, not the subtest merit.  I added that step
> > > because I thought that would be the most reliable way of satisfying the CI
> > > requirement on restoring the device to the state ready for next tests without
> > > reboot or real device power-on reset on real hardware bus replug.
> >
> > Yes I understand that. But what are you trying to recover from with a
> > module reload? Just code sharing as you explain below, or other reasons?
>
> Nothing specific.  Oriented on successful recovery of the device so it's ready
> for next tests without reboot, I just intuitively tried to avoid rediscovering
> it, possibly in a completely unpredictable state after the fake unplug, and
> that intuition, probably mixed with my ignorance, suggested me to use module
> unload before bus rescan.
>
> > > > that's more or less exactly what our various module unload tests are
> > > > doing already.
> > >
> > > Yes, and in v5-v7 I was even using the existing i915_module_load test as an
> > > external helper command performing device recovery and healthcheck phases in
> > > order to avoid reimplementing its code here.
> > >
> > > > > > and a bunch of callbacks and everything, just do to a hotremove testcase?
> > > > >
> > > > > I can still drop the framework and switch back to the initial simple structure
> > > > > with one or two fixed subtests if you don't like my structural approach.
> > > >
> > > > See above for why, I think that will result in better code in the end.
> > >
> > > OK.
> > >
> > > > > > Very first patch looked much more reasonable, aside from that it broke CI
> > > > > > since it didn't rebind the driver.
> > > > >
> > > > > Sorry, my understanding of your and Petri's comments was a bit different, I
> > > > > thought that by more than best effort you meant doing everything possible to
> > > > > restore the device to be ready for next test without reboot, and module unload
> > > > > and reload seemed the most reliable option to me.  Now I can see that there
> > > > > were probably two different requirements.  You were considering the test
> > > > > incomplete because it was performing only the unbind/unplug part and not
> > > > > rebind/rescan, while Petri was probably interested mostly in the device being
> > > > > ready for next tests without reboot, no matter which way.
> > > >
> > > > Well it's the same request, and rebind/rescan /should/ result in a working
> > > > device again. If not, then I guess we also have a bug on our hotreplug
> > > > code. Which again is worth testing for.
> > > >
> > > > > > We can always add complexity later on
> > > > > > once we have dma-buf/dma-fence/kms/whatever else substests here.
> > > > >
> > > > > OK, as you wish.
> > > > >
> > > > > > Also, I think we should have at least one hotremove-only-nothing-special
> > > > > > subtest here, i.e. without even the busy batch.
> > > > >
> > > > > That seems trivial to adjust the framework so it accepted NULL workload, if
> > > > > the framework survived. Anyway, I'll do that.  Should I put it in a separate
> > > > > NULL workload subtest function to be called from igt_main?  Or add it to the
> > > > > spin workload subtest function specifically as an option?
> > > >
> > > > Separate test as the first subtest.
> > >
> > > OK.
> > >
> > > > Maybe even include the "shut
> > > > everything off first" logic from module unload,
> > >
> > > Do you mean i915_module_load.c?
> >
> > On 2nd thought, for a hotunplug we shouldn't need any of that. Those "shut
> > everything off first" steps are just to lower the module use count, so
> > we're allowed to unload the module. The unplug code we run should take
> > care of all that already for us for a (fake) hotunplug. So module reload
> > should just work. But then I still don't understand where you see the
> > benefit in unloading/reloading the module.
>
> Not any longer ;-)
>
> > > > to have the most baseline test possible.
> > > >
> > > > > > I'm also not sure why we also put module unload tests in there.
> > > > >
> > > > > As I tried to explain above, I introduced module unload in order to satisfy
> > > > > the CI requirement on the device being ready for next test without reboot as
> > > > > much as possible.
> > > >
> > > > Hm, but why? What does module reload help in this regard that a rebind
> > > > can't do? Aside from testing module reload, which is a developer feature
> > > > and already tested elsewhere.
> > >
> > > As I said, I decided to use module unload as I thought it would be the most
> > > reliable way of device recovery from simulated unplug in case no real power-on
> > > reset is performed.  I didn't insist on keeping it there at all, I only tried
> > > to explain why I did that.  As that can't help in any way to recover the
> > > device so it is ready for next tests as CI requested then I'll be happy to
> > > remove it and stick to pure driver rebind / bus rescan operations.
> > >
> > > > I'm also not seeing much interactions between hotunplug and module unload.
> > > >
> > > > The one interesting testcase I see is trying to unload the module after we
> > > > hotunplugged, while the driver is still in use somewhere (open drm fd,
> > > > open dma-buf fd, open dma-fence fd). That should result in a failure, and
> > > > it's useful to validate that the kernel is handling the module refcounting
> > > > correctly in all these cases. But that's a specific negative testcase (and
> > > > actually being able to unload would be a failure and likely result in a
> > > > kernel oops), I'm not seeing the benefit of reloading the module.
> > >
> > > That's perfectly clear for me, the optional module unload step will not be
> > > there on next iteration.
> >
> > That might be overshooting slightly. There is at least one interesting
> > hotunplug testcase involving module unload. But it's more a special case,
> > not something we need to do for all subtests.
>
> OK, I can try to implement it if I'm sure what you think it should do exactly.

I forgot to explain what it was :-) Or at least I can't find it
anymore in this huge thread.

- hotunplug device
- keep one userspace reference alive (with open drm device fd or similar)
- try to unload, check that this fails
- drop the reference (by closing the fd)
- make sure unload/reload suceed now

Very basic, but makes sure that dangling references keep the module
locked - if that's not the case then userspace could issue an IOCTL,
which would call into that unloaded module and result in a kernel
Oops.

There's more varianst we could do here, but since module unload is a
developer feature I think there's better places to invest all that
effort. But one basic sanity check like this can't hurt.

> > > > > > Compared
> > > > > > to hotunplug of a discrete gfx card (external one over usb or thunderbolt
> > > > > > or whatever), which is something users can do, module unload is explicitly a
> > > > > > developer only feature.
> > > > >
> > > > > My approach was to be able to test driver behavior under any hot unload
> > > > > operation available to a user, no matter if developer oriented or not, so we
> > > > > can make the driver resistant to users performing potentially dangerous hot
> > > > > unbind/unplug operations available to them, intentionally or not.
> > > >
> > > > Yes I agree with that, we need to test hotunplug.
> > > >
> > > > btw the real fun isn't the unbind in sysfs, but physically unplugging a
> > > > pci-e or thunderbolt/usb-c gfx card. Imo that's why we need to have this,
> > > > and the best way to test that hotunplug is through the sysfs unbind
> > > > support (it's not exactly the same since this way we'll never see failing
> > > > pci transactions, which are an entirely different kind of fun).
> > >
> > > I fully agree, however please note that what I'm calling device hot unplug is
> > > probably still an interesting sysfs option aside driver hot unbind.
> > >
> > > > > > We do not expect module unload to work under all
> > > > > > possible conditions (it doesn't).
> > > > >
> > > > > Do you think that driver rebind operation has more chances to succeed,
> > > > > especially on a device on which a bus unplug operation was not actually
> > > > > performed but only simulated via sysfs, on a device which then has been left
> > > > > in an unpredictable state and hasn't undergone a hardware power-on reset on
> > > > > physical bus re-plug?
> > > >
> > > > There's definitely potential for bugs, but I don't see how module reload
> > > > helps. Module reload is essentially:
> > > >
> > > > - unbind devices
> > > > - unload module
> > > > - reload module
> > > > - rebind all devices
> > > >
> > > > The only additional magic that module unload can paper over is that it's
> > > > disallowed while anyone is still using any devices (assuming the module
> > > > refcount code is correct). That's not the case for unbind/hotunplug. But
> > > > that's it, there's no additional magic code being run when you unload the
> > > > module. Hence why I don't understand why you want to do that.
> > >
> > > Not any longer :-)
> > >
> > > > > > I'd drop that part and focus completely
> > > > > > on the hotremove/unbind testcase here.
> > > > >
> > > > > Driver unbind / device unplug via sysfs can also be considered developer only
> > > > > features. Do you think we should drop driver unbind option, leaving only
> > > > > device unplug via sysfs for which we may have no good non-developer
> > > > > alternative?
> > > >
> > > > Yanking the cable for e.g. usb-c/thunderbolt external gpu is very much a
> > > > user action. That's why we care.
> > > >
> > > > We didn't care for unbind (I wontfix closed all the bugs myself) while
> > > > intel only created built-in gpus because it's indeed fairly pointless to
> > > > unbind these.
> > > >
> > > > Other bit I don't quite get: What's the difference between unbind and
> > > > unplug?
> > >
> > > I'm not sure what information you are missing.  What the test is doing is:
> > >
> > > driver unbind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/unbind
> > > vs.
> > > virtual device unplug: echo 1 >/sys/bus/<bus>/devices/<device>/remove
> >
> > I was missing the above I guess.
> >
> > So looking at kernel code the difference is that when we unbind the entire
> > driver, we loop over all currrently bound devices and do the same as the
> > remove sysfs file. So kinda redundant, I'd drop the driver unbind
> > testcase.
>
> OK.
>
> > Also, are you digging around in the kernel already and trying to
> > understand what's going on and how it all ties together? And have you
> > started to look at the bugs this uncovers in the kernel, or who's supposed
> > to work on that side of this effort?
>
> I'm trying to.  A first step was:
> https://github.com/freedesktop/drm-intel/commit/d69990e0c399e4f7f9b50505d3285e5de991148a

Yup, that's about 1/n patches we need.

> I've already tested two other patches:
> https://patchwork.freedesktop.org/series/60053/
> https://patchwork.freedesktop.org/series/60051/

Usually we solve this stuff by reference counting structures, to make
sure they don't disappear before the last user is finished cleaning
stuff up. Forced cleanup tends to be really hard to understand,
validate and often is impossible to implement because of locking
inversions.

> Now I'm trying to resolve the GEM_BUG_ON(vma->obj != obj) issue which popped
> up with both above patches applied. I'm not yet sure to what extent it has
> been simply uncovered vs. just triggered by my second patch.

No idea, could very well just be that you now blow up later with the
next oops. Hotunplug is so untested I'd be surprised if it's not full
of bugs.

Another option would be to first have a more minimal hotunplug test
without any busy batches pending, and make sure that works reliable
(i.e. no buffer, no context, no nothing, no open fd, absolutely
nothing really allocated from userspace). Then extend from there
step-by-step, e.g. first just open a drm device fd, then allocate a
buffer, then mmap, then maybe try to write to that buffer after
unplug, try to do an ioctl after unplug (go through all of them).

Only once you've gone through all the more basic things go to the much
more evil stuff like having a batch in-flight, or a kms request
in-flight. That way it should be easier to make forward progress on
these bugs without risking the world. Plus you can start merging the
igt side already, to make sure no one regresses anymore.

> > Bonus points if we unbind the same device as we'd pick for the drm fd
> > (there's some selection logic, and if you go through /sys/class/drm you
> > should find the right device). This is relevant for discrete/multi-gpu
> > systems.
>
> I think I took that into account seriously enough while planing the subtest
> actions.  Sysfs operations are performed on nodes resolved from the device
> file descriptor.

Ah great, tbh I didn't check that in your patch.

> > > Panic call traces look a bit different, you may want to compare the following two:
> > > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb1/igt@core_hot_reload@spin-unbind.html
> > > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb4/igt@core_hot_reload@spin-unplug.html
> >
> > Hm I think we also need a hotunplug testcase that does absolutely nothing
> > first, i.e. no spin batches, no open drm files, nothing else.
>
> OK.
>
> > > rebind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/bind
> > > vs.
> > > replug: echo 1 >/sys/bus/<bus>/rescan
> > >
> > > With no panics accompanying driver unbind / device unplug under active spin
> > > workload on older hardware, the recovery phase is however still giving a
> > > different result for each of those two methods:
> > > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw1/igt@core_hot_reload@spin-unbind.html
> > > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw5/igt@core_hot_reload@spin-unplug.html
> >
> > Shouldn't really be a difference, but maybe there's timing changes that
> > slightly influence the outcome.
>
> I rather thought of bus operations still being available for the driver to
> shut down the device (more) cleanly on unbind, but again, that's an intuitive
> guess rather than real knowledge.

All driver unbind does is also remove the driver from the linux device
model. All that does is:
- force unbind on all devices
- prevent binding to new devices.

Not really providing anything beyond unbind/rebind to a specific
device (or what we test already with module reload). The actual
unbinding from a specific device is exactly the same logic (plus/minus
bugs in driver core sysfs files, I just recently fixed one of these,
but that's not our primary concern with validating i915 code).

Cheers, Daniel
>
> Thanks,
> Janusz
>
>
> > > BTW, trybot results confirm that module unload really doesn't help:
> > > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw8/igt@core_hot_reload@spin-unplug-unload.html
> >
> > Yeah worst case we have an additional module refcount bug and then module
> > unload will make things worse. I can't come up with a scenario where
> > module unload would help (there's reasons it's a developer-only thing,
> > it's really hard to get right).
> >
> > Cheers, Daniel
> >
> >
> > >
> > > Thanks,
> > > Janusz
> > >
> > >
> > > > -Daniel
> > > >
> > > > >
> > > > > Thanks,
> > > > > Janusz
> > > > >
> > > > >
> > > > > > -Daniel
> > > > > >
> > > > > > > ---
> > > > > > >  tests/Makefile.sources  |   1 +
> > > > > > >  tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
> > > > > > >  tests/meson.build       |   1 +
> > > > > > >  3 files changed, 410 insertions(+)
> > > > > > >  create mode 100644 tests/core_hot_reload.c
> > > > > > >
> > > > > > > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > > > > > > index 7f921f6c..452d8ed7 100644
> > > > > > > --- a/tests/Makefile.sources
> > > > > > > +++ b/tests/Makefile.sources
> > > > > > > @@ -16,6 +16,7 @@ TESTS_progs = \
> > > > > > >     core_getclient \
> > > > > > >     core_getstats \
> > > > > > >     core_getversion \
> > > > > > > +   core_hot_reload \
> > > > > > >     core_setmaster_vs_auth \
> > > > > > >     debugfs_test \
> > > > > > >     drm_import_export \
> > > > > > > diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
> > > > > > > new file mode 100644
> > > > > > > index 00000000..6673f55c
> > > > > > > --- /dev/null
> > > > > > > +++ b/tests/core_hot_reload.c
> > > > > > > @@ -0,0 +1,408 @@
> > > > > > > +/*
> > > > > > > + * Copyright © 2019 Intel Corporation
> > > > > > > + *
> > > > > > > + * Permission is hereby granted, free of charge, to any person obtaining
> > > > > a
> > > > > > > + * copy of this software and associated documentation files (the
> > > > > "Software"),
> > > > > > > + * to deal in the Software without restriction, including without
> > > > > limitation
> > > > > > > + * the rights to use, copy, modify, merge, publish, distribute,
> > > > > sublicense,
> > > > > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > > > > + * Software is furnished to do so, subject to the following conditions:
> > > > > > > + *
> > > > > > > + * The above copyright notice and this permission notice (including the
> > > > > next
> > > > > > > + * paragraph) shall be included in all copies or substantial portions of
> > > > > the
> > > > > > > + * Software.
> > > > > > > + *
> > > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > > > > EXPRESS OR
> > > > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > > > > MERCHANTABILITY,
> > > > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT
> > > > > SHALL
> > > > > > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > > > OTHER
> > > > > > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > > > ARISING
> > > > > > > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> > > > > DEALINGS
> > > > > > > + * IN THE SOFTWARE.
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include "igt.h"
> > > > > > > +#include "igt_device.h"
> > > > > > > +#include "igt_dummyload.h"
> > > > > > > +#include "igt_kmod.h"
> > > > > > > +#include "igt_sysfs.h"
> > > > > > > +
> > > > > > > +#include <getopt.h>
> > > > > > > +#include <limits.h>
> > > > > > > +#include <string.h>
> > > > > > > +#include <unistd.h>
> > > > > > > +
> > > > > > > +#include <sys/types.h>
> > > > > > > +#include <sys/wait.h>
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * A post-action device recovery function:
> > > > > > > + * @priv: a pointer to private data required for device recovery
> > > > > > > + *
> > > > > > > + * Make the device re-appear
> > > > > > > + */
> > > > > > > +typedef void (*recover_t)(const void *priv);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * A test action function:
> > > > > > > + * @dir: file descriptor of an open device sysfs directory
> > > > > > > + * @module: module name, non-NULL indicates post-action module unload
> > > > > requested
> > > > > > > + * @recover: for returning a pointer to a post-action device recovery
> > > > > function
> > > > > > > + * @priv: for returning a pointer to data to be passed to @recover
> > > > > > > + *
> > > > > > > + * Make the device disappear
> > > > > > > + */
> > > > > > > +typedef void (*action_t)(int device, const char *module,
> > > > > > > +                    recover_t *recover, const void **priv);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * A workload completion wait function:
> > > > > > > + * @device: open device file descriptor
> > > > > > > + * @priv: a pointer to private data required by the wait function
> > > > > > > + *
> > > > > > > + * Wait for completion of background workload
> > > > > > > + */
> > > > > > > +typedef void (*workload_wait_t)(int device, void *priv);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * A workload function:
> > > > > > > + * @device: open device file descriptor
> > > > > > > + * @arg: a optional string argument passed to the workload function
> > > > > > > + * @workload_wait: for returning a pointer to workload completion wait
> > > > > function
> > > > > > > + * @priv: for returning a pointer to data to be passed to @workload_wait
> > > > > > > + *
> > > > > > > + * Put some long lasting load on the device
> > > > > > > + */
> > > > > > > +typedef void (*workload_t)(int device, const char *arg,
> > > > > > > +                      workload_wait_t *workload_wait, void
> > > > > **priv);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Pairs of test action / device recovery functions
> > > > > > > + */
> > > > > > > +
> > > > > > > +/* Unbind / re-bind */
> > > > > > > +
> > > > > > > +struct rebind_data {
> > > > > > > +   int driver;     /* open file descriptor of driver sysfs directory */
> > > > > > > +   char *device;   /* bus specific device address as string */
> > > > > > > +};
> > > > > > > +
> > > > > > > +/* Re-bind the driver to the device */
> > > > > > > +static void driver_bind(const void *priv)
> > > > > > > +{
> > > > > > > +   const struct rebind_data *data = priv;
> > > > > > > +
> > > > > > > +   igt_set_timeout(60, "Driver re-bind timeout!");
> > > > > > > +   igt_sysfs_set(data->driver, "bind", data->device);
> > > > > > > +
> > > > > > > +   close(data->driver);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Unbind the driver from the device */
> > > > > > > +static void driver_unbind(int device, const char *module,
> > > > > > > +                     recover_t *recover, const void **priv)
> > > > > > > +{
> > > > > > > +   static char path[PATH_MAX];
> > > > > > > +   static struct rebind_data data;
> > > > > > > +   int len;
> > > > > > > +
> > > > > > > +   /* collect information required for driver bind/unbind */
> > > > > > > +   data.driver = openat(device, "device/driver", O_DIRECTORY);
> > > > > > > +   igt_assert(data.driver >= 0);
> > > > > > > +
> > > > > > > +   len = readlinkat(device, "device", path, sizeof(path) - 1);
> > > > > > > +   path[len] = '\0';
> > > > > > > +   data.device = strrchr(path, '/') + 1;
> > > > > > > +
> > > > > > > +   /* unbind the driver */
> > > > > > > +   igt_set_timeout(60, "Driver unbind timeout!");
> > > > > > > +   igt_sysfs_set(data.driver, "unbind", data.device);
> > > > > > > +
> > > > > > > +   /* pass back info on how to recover the device */
> > > > > > > +   if (module) {
> > > > > > > +           /* don't try to rebind if module will be unloaded */
> > > > > > > +           *recover = NULL;
> > > > > > > +   } else {
> > > > > > > +           *recover = driver_bind;
> > > > > > > +           *priv = &data;
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Unplug / re-plug */
> > > > > > > +
> > > > > > > +/* Re-discover the device by rescanning its bus */
> > > > > > > +static void bus_rescan(const void *priv)
> > > > > > > +{
> > > > > > > +   const int *bus = priv;
> > > > > > > +
> > > > > > > +   igt_set_timeout(60, "Bus rescan timeout!");
> > > > > > > +   igt_sysfs_set(*bus, "rescan", "1");
> > > > > > > +
> > > > > > > +   close(*bus);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Remove (virtually unplug) the device from its bus */
> > > > > > > +static void device_unplug(int device, const char *module,
> > > > > > > +                     recover_t *recover, const void **priv)
> > > > > > > +{
> > > > > > > +   static int bus;
> > > > > > > +
> > > > > > > +   /* collect information required for bus rescan */
> > > > > > > +   bus = openat(device, "device/subsystem", O_DIRECTORY);
> > > > > > > +   igt_assert(bus >= 0);
> > > > > > > +
> > > > > > > +   /* remove the device */
> > > > > > > +   igt_set_timeout(60, "Device unplug timeout!");
> > > > > > > +   igt_sysfs_set(device, "device/remove", "1");
> > > > > > > +
> > > > > > > +   /* pass back info on how to recover the device */
> > > > > > > +   *recover = bus_rescan;
> > > > > > > +   *priv = &bus;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Each test action function must be registered in the following table */
> > > > > > > +static const struct {
> > > > > > > +   const char *name;       /* unique test action name used in test
> > > > > names */
> > > > > > > +   action_t function;      /* test action function pointer */
> > > > > > > +} actions[] = {
> > > > > > > +   { "unbind", driver_unbind, },
> > > > > > > +   { "unplug", device_unplug, },
> > > > > > > +};
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Pairs of workload / wait completion functions
> > > > > > > + */
> > > > > > > +
> > > > > > > +/* A workload using igt_spin_run() */
> > > > > > > +
> > > > > > > +/* Wait for completaion of dummy load */
> > > > > > > +static void dummy_wait(int device, void *priv)
> > > > > > > +{
> > > > > > > +   igt_spin_t *spin = priv;
> > > > > > > +
> > > > > > > +   /* wait until the spin no longer runs, don't fail on error */
> > > > > > > +   if (gem_wait(device, spin->handle, NULL))
> > > > > > > +           __gem_set_domain(device, spin->handle,
> > > > > > > +                            I915_GEM_DOMAIN_GTT,
> > > > > I915_GEM_DOMAIN_GTT);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Run dummy load */
> > > > > > > +static void dummy_load(int device, const char *arg,
> > > > > > > +                  workload_wait_t *workload_wait, void **priv)
> > > > > > > +{
> > > > > > > +   igt_spin_t *spin;
> > > > > > > +
> > > > > > > +   /* submit a job */
> > > > > > > +   spin = igt_spin_new(device);
> > > > > > > +
> > > > > > > +   *workload_wait = dummy_wait;
> > > > > > > +   *priv = spin;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Each workload function must be registered in the following table.
> > > > > > > + * A function may be registered more than once under different workload
> > > > > names,
> > > > > > > + * that makes sense as long as a different argument is specified for each
> > > > > name.
> > > > > > > + */
> > > > > > > +static const struct {
> > > > > > > +   const char *name;       /* unique workload name used in test names
> > > > > */
> > > > > > > +   workload_t function;    /* workload function pointer */
> > > > > > > +   const char *arg;        /* optional constant string argument */
> > > > > > > +} workloads[] = {
> > > > > > > +   { "spin", dummy_load, NULL, },
> > > > > > > +};
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Framework
> > > > > > > + */
> > > > > > > +
> > > > > > > +static void healthcheck(int chipset)
> > > > > > > +{
> > > > > > > +   int device;
> > > > > > > +
> > > > > > > +   device = __drm_open_driver(chipset);
> > > > > > > +   igt_assert(device >= 0);
> > > > > > > +
> > > > > > > +   if (chipset == DRIVER_INTEL)
> > > > > > > +           gem_test_engine(device, ALL_ENGINES);
> > > > > > > +
> > > > > > > +   close(device);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void module_unload(int chipset, const char *module)
> > > > > > > +{
> > > > > > > +   if (chipset == DRIVER_INTEL)
> > > > > > > +           igt_assert(igt_i915_driver_unload() ==
> > > > > IGT_EXIT_SUCCESS);
> > > > > > > +   else
> > > > > > > +           igt_assert(igt_kmod_unload(module, 0) == 0);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void run_action(int device, action_t action, const char *module,
> > > > > > > +                 recover_t *recover, const void **priv)
> > > > > > > +{
> > > > > > > +   int dir;
> > > > > > > +
> > > > > > > +   dir = igt_sysfs_open(device);
> > > > > > > +   igt_assert(dir >= 0);
> > > > > > > +
> > > > > > > +   action(dir, module, recover, priv);
> > > > > > > +
> > > > > > > +   close(dir);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void wait_helper(int device, void *priv)
> > > > > > > +{
> > > > > > > +   struct igt_helper_process *proc = priv;
> > > > > > > +
> > > > > > > +   /* wait until the workload subprocess has completed */
> > > > > > > +   igt_ignore_warn(igt_wait_helper(proc));
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void run_workload(int device, workload_t workload, const char
> > > > > *arg,
> > > > > > > +                    const char *module, workload_wait_t
> > > > > *workload_wait,
> > > > > > > +                    void **priv)
> > > > > > > +{
> > > > > > > +   if (module) {
> > > > > > > +           /* run workload in a subprocess so the module is put on
> > > > > crash */
> > > > > > > +           static struct igt_helper_process proc;
> > > > > > > +           int wstatus, ret;
> > > > > > > +
> > > > > > > +           bzero(&proc, sizeof(proc));
> > > > > > > +
> > > > > > > +           igt_fork_helper(&proc) {
> > > > > > > +                   /* suppress igt_log messages */
> > > > > > > +                   igt_log_level = IGT_LOG_NONE;
> > > > > > > +
> > > > > > > +                   /* intercept igt_fail/skip() long jumps */
> > > > > > > +                   if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0)
> > > > > {
> > > > > > > +                           workload(device, arg,
> > > > > workload_wait, priv);
> > > > > > > +
> > > > > > > +                           (*workload_wait)(device,
> > > > > *priv);
> > > > > > > +
> > > > > > > +                           /* success if not diverted by
> > > > > igt_fail/skip() */
> > > > > > > +                           igt_success();
> > > > > > > +                   }
> > > > > > > +
> > > > > > > +                   /* pass exit code back to the caller */
> > > > > > > +                   igt_exit();
> > > > > > > +           }
> > > > > > > +           /* let the background process start doing its job or
> > > > > fail */
> > > > > > > +           sleep(2);
> > > > > > > +           /* fail or skip on workload premature completion */
> > > > > > > +           ret = waitpid(proc.pid, &wstatus, WNOHANG);
> > > > > > > +           if (ret < 0)
> > > > > > > +                   igt_fail(IGT_EXIT_INVALID);
> > > > > > > +           if (ret) {
> > > > > > > +                   if (!WIFEXITED(wstatus))
> > > > > > > +                           igt_fail(IGT_EXIT_INVALID);
> > > > > > > +                   if (WEXITSTATUS(wstatus) ==
> > > > > IGT_EXIT_SUCCESS)
> > > > > > > +                           igt_fail(IGT_EXIT_INVALID);
> > > > > > > +                   if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
> > > > > > > +                           igt_skip(NULL);
> > > > > > > +                   igt_fail(WEXITSTATUS(wstatus));
> > > > > > > +           }
> > > > > > > +
> > > > > > > +           /* pass back info on how to wait for helper completion
> > > > > */
> > > > > > > +           *workload_wait = wait_helper;
> > > > > > > +           *priv = &proc;
> > > > > > > +   } else {
> > > > > > > +           /* run the requested workload directly */
> > > > > > > +           workload(device, arg, workload_wait, priv);
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void run_subtest(int chipset, int workload, int action,
> > > > > > > +                   const char *module)
> > > > > > > +{
> > > > > > > +   workload_wait_t workload_wait;
> > > > > > > +   void *workload_priv;
> > > > > > > +   recover_t recover;
> > > > > > > +   const void *recover_priv;
> > > > > > > +   int device;
> > > > > > > +
> > > > > > > +   igt_subtest_f("%s-%s%s", workloads[workload].name,
> > > > > actions[action].name,
> > > > > > > +                 module ? "-unload" : "") {
> > > > > > > +           device = __drm_open_driver(chipset);
> > > > > > > +           igt_assert(device >= 0);
> > > > > > > +
> > > > > > > +           /* spawn the requested workload */
> > > > > > > +           igt_debug("spawning background workload\n");
> > > > > > > +           run_workload(device, workloads[workload].function,
> > > > > > > +                        workloads[workload].arg, module,
> > > > > > > +                        &workload_wait, &workload_priv);
> > > > > > > +
> > > > > > > +           /* run the requested test action */
> > > > > > > +           igt_debug("running test action\n");
> > > > > > > +           run_action(device, actions[action].function, module,
> > > > > > > +                      &recover, &recover_priv);
> > > > > > > +
> > > > > > > +           if (workload_wait) {
> > > > > > > +                   igt_debug("waiting for workload
> > > > > completion\n");
> > > > > > > +                   workload_wait(device, workload_priv);
> > > > > > > +           }
> > > > > > > +
> > > > > > > +           close(device);
> > > > > > > +
> > > > > > > +           if (module) {
> > > > > > > +                   igt_debug("unloading %s\n", module);
> > > > > > > +                   module_unload(chipset, module);
> > > > > > > +           }
> > > > > > > +
> > > > > > > +           if (recover) {
> > > > > > > +                   igt_debug("recovering device\n");
> > > > > > > +                   recover(recover_priv);
> > > > > > > +                   igt_reset_timeout();
> > > > > > > +           }
> > > > > > > +
> > > > > > > +           igt_debug("running healthcheck\n");
> > > > > > > +           healthcheck(chipset);
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > > +igt_main {
> > > > > > > +   int device, chipset;
> > > > > > > +   char *module;
> > > > > > > +   int i, j;
> > > > > > > +
> > > > > > > +   igt_fixture {
> > > > > > > +           char path[PATH_MAX];
> > > > > > > +           int dir, len;
> > > > > > > +
> > > > > > > +           /**
> > > > > > > +            * Since some subtests depend on successful unload of a
> > > > > driver
> > > > > > > +            * module, don't use drm_open_driver() as it keeps a
> > > > > device file
> > > > > > > +            * descriptor open for exit handler use and that
> > > > > effectively
> > > > > > > +            * prevents the module from being unloaded.
> > > > > > > +            */
> > > > > > > +           device = __drm_open_driver(DRIVER_ANY);
> > > > > > > +           igt_assert(device >= 0);
> > > > > > > +
> > > > > > > +           if (is_i915_device(device)) {
> > > > > > > +                   chipset = DRIVER_INTEL;
> > > > > > > +                   module = strdup("i915");
> > > > > > > +           } else {
> > > > > > > +                   chipset = DRIVER_ANY;
> > > > > > > +
> > > > > > > +                   /* Capture module name to be unloaded */
> > > > > > > +                   dir = igt_sysfs_open(device);
> > > > > > > +                   len = readlinkat(dir, "device/driver/
> > > > > module", path,
> > > > > > > +                                    sizeof(path) - 1);
> > > > > > > +                   close(dir);
> > > > > > > +                   path[len] = '\0';
> > > > > > > +                   module = strdup(strrchr(path, '/') + 1);
> > > > > > > +           }
> > > > > > > +           close(device);
> > > > > > > +
> > > > > > > +           igt_info("Running the test on driver \"%s\", chipset
> > > > > mask %#0x\n",
> > > > > > > +                    module, chipset);
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
> > > > > > > +           for (j = 0; j < sizeof(actions) / sizeof(*actions); j+
> > > > > +) {
> > > > > > > +                   /* with module unload */
> > > > > > > +                   run_subtest(chipset, i, j, module);
> > > > > > > +                   /* without module unload */
> > > > > > > +                   run_subtest(chipset, i, j, NULL);
> > > > > > > +           }
> > > > > > > +   }
> > > > > > > +}
> > > > > > > diff --git a/tests/meson.build b/tests/meson.build
> > > > > > > index 711979b4..0d418035 100644
> > > > > > > --- a/tests/meson.build
> > > > > > > +++ b/tests/meson.build
> > > > > > > @@ -3,6 +3,7 @@ test_progs = [
> > > > > > >     'core_getclient',
> > > > > > >     'core_getstats',
> > > > > > >     'core_getversion',
> > > > > > > +   'core_hot_reload',
> > > > > > >     'core_setmaster_vs_auth',
> > > > > > >     'debugfs_test',
> > > > > > >     'drm_import_export',
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > igt-dev mailing list
> > > igt-dev at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/igt-dev
> >
> >
>
>
>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


More information about the igt-dev mailing list