[PATCH v5] drm/xe: Add driver load error injection

Wed Sep 11 21:45:06 UTC 2024

On Wed, Sep 11, 2024 at 04:48:11PM GMT, Rodrigo Vivi wrote:
>On Wed, Sep 11, 2024 at 12:40:04PM +0200, Francois Dugast wrote:
>> On Tue, Sep 10, 2024 at 04:33:21PM -0500, Lucas De Marchi wrote:
>> > On Tue, Sep 10, 2024 at 05:11:34PM GMT, Rodrigo Vivi wrote:
>> > > On Tue, Sep 10, 2024 at 05:22:41PM +0200, Francois Dugast wrote:
>> > > > Those new macros inject errors by overriding return codes. They must
>> > > > manually be called, preferably at the very beginning of the function
>> > > > that will fault, otherwise if not possible by turning this pattern:
>> > > >
>> > > >     err = foo();
>> > > >     if (err)
>> > > >         return err;
>> > > >
>> > > > into:
>> > > >
>> > > >     err = foo();
>> > > >     err = xe_device_inject_driver_probe_error(xe, err);
>> > > >     if (err)
>> > > >         return err;
>> > > >
>> > > > When CONFIG_DRM_XE_DEBUG is not set, this has no effect.
>> > > >
>> > > > When CONFIG_DRM_XE_DEBUG is set, the error code at checkpoint X will
>> > > > be overridden when the module argument inject_driver_load_error is
>> > > > set to value X. By doing so, it is possible to test proper error
>> > > > handling and improve robustness for current and future code. A few
>> > > > injection points are added in this patch but more need to be added.
>> > > > One way to use this error injection at driver probe is:
>> > > >
>> > > >     for i in {1..200}; do
>> > > >         echo "Run $i"
>> > > >         modprobe xe inject_driver_probe_error=$i;
>> > > >         rmmod xe;
>> > > >     done
>> > >
>> > > can we have an IGT test so we ensure that CI is tracking and we are working
>> > > to close the existing issues?
>> >
>> > yeah.. that would be great. I think it would make more sense to use
>> > bind/unbind in igt.
>
>Hmm... but that would require a deferred_probe and then the bind to force the reprobe...
>kind of complicate things here...

I don't get it. The only thing required is to set a param, try bind, rinse
and repeat until module loads or fails with an extraneous error.

typing something here without really compiling and giving much thought

	device = arg_or_scan(); // 0000:00:02.0, 0000:03:00.0, etc...
	d = open("/sys/bus/pci", O_RDONLY);
	igt_sysfs_set(d, "drivers_autoprobe", "0");
	close(d);

	igt_assert_eq(modprobe("xe"), 0);

	d = open("/sys/module/xe/parameters/", O_RDONLY);

	for (n = 1, ret = INJECTION_FAULT_CODE; ret; n++) {
		igt_sysfs_printf(d, "inject_driver_probe_error", "%d", n);
		ret = bind(device);
		if (!ret) {
			igt_debug("Tested %d fault injection points, probe completed\n", n);
		}
		igt_assert(ret == 0 || ret == INJECTION_FAULT_CODE);
	}

	unbind(device);
	close(d);
	// reset it back so it doesn't affect other tests
	igt_sysfs_set(d, "drivers_autoprobe", "1");

when migrating to real fault-injection, I think we'd basically list the
functions where the fault injection can be added and replace the loop
above by writing the function name in the proper debugfs file, together
with other needed settings.

>
>> >
>> > >
>> > > >
>> > > > In the future this is expected to be replaced by the infrastructure
>> > > > provided by fault-inject.h
>> > >
>> > > I was taking a look at the fault-inject again. It could easily be a
>> > > global fault_attr with a module sysfs entry, then during the test
>> > > you load the module, then unbind the device, then change the fault-inject
>> > > probability and time and then bind it back what will reprobe, but now
>> > > with the fault-injected.
>> > >
>> > > The only problem with the fault-inject idea is that it would require
>> > > a very granular thing with multiple fault_attr, one per failure.
>> >
>> > when going with a real fault-injection, I'd actually try to cover it per
>> > function as described here:
>> >
>> > https://docs.kernel.org/fault-injection/fault-injection.html
>> > 	/sys/kernel/debug/fail_function/inject:
>> >
>> > 	Format: { ‘function-name’ | ‘!function-name’ | ‘’ }
>> >
>> > 	specifies the target function of error injection by name. If the
>> > 	function name leads ‘!’ prefix, given function is removed from injection
>> > 	list. If nothing specified (‘’) injection list is cleared.
>> >
>> > Integration via ALLOW_ERROR_INJECTION() is similar to the
>> > KUNIT_STATIC_STUB_REDIRECT() we already use.
>> >
>> > In my review I didn't bother to go with fault-inject directly because we
>> > will probably need to refactor the code so the failure points are in
>> > their own functions. Something we don't have today. Short term it's
>> > important to fix the current/unknown problems. Mid term we can convert
>> > things piece meal.
>> >
>> > Are we on the same page?
>>
>> It is also my intention with this patch, get something in with minimal risk
>> and changes so we can soon focus on solving potential issues it highlights.
>>
>> In parallel I am preparing a RFC based on fault-inject with a proposal how
>> we can use fail_function with a few real examples from our code that we can
>> take more time to discuss thoroughly.
>
>I'm also on the same page. Let's do it.
>
>But we need to at least:
>1. fix the documentation return statement
>2. fix checkpatch on module_param_named_unsafe huge line

and maybe make it already rw so we can try the bind approach above.

>3. IGT ?!

if we are running it manually, I think it should be enough.
Because running this in CI for every patch series will be a
pain until the test becomes an always-passing thing.

Lucas De Marchi