[igt-dev] [PATCH] tests: Add a new test for driver/device hot remove

Daniel Vetter daniel at ffwll.ch
Tue Apr 2 08:47:20 UTC 2019


On Mon, Apr 01, 2019 at 08:55:37AM +0000, Krzysztofik, Janusz wrote:
> On Monday, April 1, 2019 9:22:34 AM CEST Daniel Vetter wrote:
> > On Fri, Mar 29, 2019 at 12:47:32PM +0200, Petri Latvala wrote:
> > > On Thu, Mar 28, 2019 at 05:47:19PM +0100, Janusz Krzysztofik wrote:
> > > > Run a dummy load in background to put some workload on a device, then
> > > > try
> > > > to either remove (unplug) the device from its bus, or unbind the
> > > > device's
> > > > driver from it, depending on which subtest has been selected.
> > > > 
> > > > The driver hot unbind / device hot unplug operation is expected to
> > > > succeed
> > > > in a reasonable time, however long timeouts are used to allow kernel
> > > > level timeouts pop up first if any.
> > > > 
> > > > Please note that if running both subtests consecutively, the second one
> > > > is always skipped as the device is no longer available as soon as the
> > > > first subtest is completed.  The most reliable way to run another
> > > > subtest
> > > > is to reboot the system first, then select next subtest to be run.
> > > 
> > > This is a requirement that won't fly for CI use. Is the
> > > rebinding/whatever of the device not possible to do? By the test?
> > > 
> > > Does it also apply to running other test binaries after running the
> > > first subtest? As in, is it just a timing issue?
> > 
> > Yeah like the module reload testcase this testcase needs to restore the
> > driver to working state afterwards. I think there's a corresponding rebind
> > sysfs file too.
> 
> I see your point, however I don't see how that could be done.
> 
> I think we can't say that the module reload test restores the driver to 
> working state.  It only TRIES to to that, and that's the merit of that test to 
> check if module reload succeeds or not.  I think there is no way to implement 
> "restore the driver to working state" that would work under all circumstances, 
> even if there is something wrong (a bug?) that causes it failing.  In other 
> words, I think a user should never assume the driver is in working state after 
> the i915_module_load test is run as that operation may simply fail.  The user 
> should look at the test result to learn about the driver state.
> 
> How the unplug/unbind test should proceed if it occurs there is something 
> wrong after module reload?  FAIL? SKIP? Or still SUCCESS, assuming unplug 
> itself succeeds? How the test should pass the information on module reload 
> results to a user? If we start asserting results of module reload to ensure 
> the driver is in working state, that will be a different test, not the 
> intended one, I believe.
> 
> The best scenario for CI I can see is to schedule i915_module_load just 
> following drm_hot_remove.
> 
> Am I missing something?

We have higher quality standards than "best effort, probably blows up". If
stuff does blow up we have mitigation measures in CI, as Petri explains,
but fundamentally we just need to fix the bugs.

And there's going to be tons of huge bugs in hotunplug of the entire
driver, that's why I joked you need an entire team of engineers helping
you fix stuff. This happened when we started seriously testing module
reload, or gpu reset, or atomic modesetting or any other big feature.

Wrt implementation: i915_hot_remove (it's not really a drm test) needs to
do all that on its own. And if the driver isn't restored (good sanity
check is to issue a batch with an MI write and check it still completes
and does something), then the test should fail. Because in reality if the
user unplugs and replugs, and the gpu/driver don't work, they're not going
to be happy. CI doesn't impose anything stricter here.

> > btw since you started with this, bunch more subtests we need to validate
> > this:
> > - unbind while an ioctl is called. Especially fun with all the ones that
> >   can block. We need a subtest for each ioctl to make sure stuff works
> >   correctly. At least, complex ioctl probably need more than that.
> > 
> > - unplug while atomic commit pending
> > 
> > - unplug with dma-buf exported, with various external users doing stuff to
> >   that dma-buf
> > 
> > - unplug with dma-fence exported (either as syncpt, drm_syncobj or
> >   implicitly attached to a dma-buf), this one probably needs lots of
> >   a few variants to cover everything.
> > 
> > There's probably a lot more (you should discover endless amounts of oops,
> > this area is full of bugs), but this should at least get you started.
> 
> Many thanks for this list. As an i915 beginner, I was really lost while trying 
> to define workloads which are worth of testing.  I'll probably ask some more 
> question later if I'm in a position to continue this effort (see below ;-) ).

btw, can you pls update the JIRA for this (or make one if we don't have
one yet). This is going to be a huge effort, would be good to make sure we
don't forget about any of the corner cases and validation ideas.
-Daniel

> 
> > And you probably want an entire team of engineers fixing the kernel bugs
> > you uncover :-/
> 
> :-) I think that's still worth of effort, anyway ;-)
> 
> Thanks,
> Janusz
> 
> > 
> > Cheers, Daniel
> > 
> > > > +	igt_subtest("unplug") {
> > > > +		igt_spin_t *spin;
> > > > +
> > > > +		/* Verify if a suitable DRM device/driver exists */
> > > > +		igt_skip_on(device < 0);
> > > > +
> > > > +		/* Run background workload */
> > > > +		spin = igt_spin_batch_new(device, .flags = 
> IGT_SPIN_POLL_RUN);
> > > 
> > > igt_spin_batch_new requires DRIVER_INTEL, doesn't it?
> > > 
> > > > +		igt_spin_busywait_until_running(spin);
> > > > +
> > > > +		/* Make the device disappear */
> > > > +		igt_set_timeout(60, "Device unplug timeout!");
> > > > +		device_unplug(device);
> > > > +		igt_reset_timeout();
> > > > +
> > > > +		close(device);
> > > > +		device = -1;
> > > > +	}
> > > > +
> > > > +	igt_subtest("unbind") {
> > > > +		igt_spin_t *spin;
> > > > +
> > > > +		/* Verify if a suitable DRM device/driver exists */
> > > > +		igt_skip_on(device < 0);
> > > 
> > > Ah, I see. The skip if running consecutively is because the previous
> > > subtest closed the fd.
> > > 
> > > Another fixture before this subtest, get the device plugged back,
> > > reopen driver?
> 
> --------------------------------------------------------------------
> 
> Intel Technology Poland sp. z o.o.
> ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
> 
> Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek
> przegladanie lub rozpowszechnianie jest zabronione.
> This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by
> others is strictly prohibited.
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the igt-dev mailing list