[PATCH 1/2] drm/xe: Introduce flag to indicate possible fault injection

Thu May 15 10:25:02 UTC 2025


On 14.05.2025 23:05, John Harrison wrote:
> On 5/12/2025 1:20 PM, Michal Wajdeczko wrote:
>> On 12.05.2025 20:36, John Harrison wrote:
>>> On 5/12/2025 9:19 AM, Michal Wajdeczko wrote:
>>>> When running some fault injection tests the driver might generate
>>>> a lot of error logs which might unnecessary stress our CI systems.
>>>>
>>>> Introduce a flag exposed in debugfs that can be used by the fault
>>>> injection tests to give the driver a hint to suppress non-essential
>>>> error logs or dumps that might be otherwise generated.
>>> Why use debugfs?
>> because IMO it's more tailored solution, as actually we just want some
>> flag to control level of diagnostics generated by the driver on error,
>> and as you already mentioned, modparam likely would be not accepted
> Tailored to what? It is certainly not tailored to testing module load
> because debugfs does not exist until module load is completed. And a lot
> of the fault injection tests are testing the error paths during load.
> debugfs will simply not work at all for those.

ouch, my bad, I guess I was (too tired and) too much focused on the
scenarios that injects errors at runtime that I fully missed the
limitations of the probe time

> 
>>
>>> The whole point of the fault injection test is that it
>>> replaces an existing function with something that just returns an error
>>> code. Seems the perfect way to have a function which is simply "return
>>> 0;" in normal usage but modified by the injection test to return an
>>> error code when a test is running. Which is what the original patch was
>>> doing. That way everything is self contained, there are no extraneous
>>> interfaces in unrelated subsystems.
>> while at the first look this idea might solve the problem of controlling
>> the diagnostics level, the issue is that it would require exposing from
>> the driver either a dummy state function or promoting an existing
>> diagnostic function to be a fault-injectable. in both cases such
> As opposed to requiring an entire new debugfs interface? A single
> exposed function seems simpler to me.

it's still one new entry, regardless it's added to the debugfs or
fault-injection framework

and IMO using/controlling the flag using debugfs is much simpler and
cleaner than whole configuration using fault-injection framework

> 
>> functions would then require configuration steps where any of advanced
>> fault-injection parameters will never be fully explored (like
>> probability, space) so this again sounds like overshooting
> I have no idea what you mean by this. What configuration do you need for
> a function which is literally just 'return true/false'? You can do
> whatever fancy fault injection test you like. That is totally
> independent of nobbling one 'is_fault_injection_acitve()' function to
> prevent excess debug spam during any and every fault injection test.

compare configuration steps using fault injection framework:

$ FAILTYPE=fail_function
$ FAILFUNC=is_fault_injection_active

$ echo > /sys/kernel/debug/$FAILTYPE/inject
$ echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject
$ printf %#x -19 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval

$ echo N > /sys/kernel/debug/$FAILTYPE/task-filter
$ echo 10 > /sys/kernel/debug/$FAILTYPE/probability
$ echo 0 > /sys/kernel/debug/$FAILTYPE/interval
$ echo -1 > /sys/kernel/debug/$FAILTYPE/times
$ echo 0 > /sys/kernel/debug/$FAILTYPE/space
$ echo 1 > /sys/kernel/debug/$FAILTYPE/verbose

with the same done over debugfs:

$ echo Y > /sys/kernel/debug/dri/0000:00:02.0/is_fault_injection_active

and while we know this will not work during the probe time, since on the
driver side it's still a single bool flag, we can try to control its
initial value using configfs, and while it would require few more steps
it still be simpler/shorter than above fault-injection steps

> 
>>
>> IMO instead of adding more and more functions as "fault-injectable" and
>> marking them then as 100% fail, we should try to find more atomic points
>> of failure (like no memory, no GGTT space, no VRAM, dead FW, broken FW
>> comm, lost MMIO) and then take advantage of the framework that would
>> inject faults all over with some randomness or deterministic and
>> potentially overlapping with each other. this would allow to write more
>> generic test that will list existing "injectable" functions, without a
>> need to hardcoding them in the test, where clearly our dummy state
>> function or diagnostic function wont fit and will require again special
>> handling
> I'm not sure if you are arguing for a re-write of the fault injection
> framework in the kernel as a whole or just complaining that our IGT

our IGT approach and usage

> implementation is sub-optimal? But if you look at the corresponding IGT
> patch, the code to nobble the 'is_injection_active()' function is
> trivial - it just re-uses code that already exits for nobbling the test
> target functions. So it is just one extra call at the start of each
> test. There is no complexity or fragile hard-coding. And it does not
> matter what fault injection test is going to be run - how complex or
> trivial.

but even if we say that we just reuse the existing code, since it was
primary designed for different purposes, IMO we still are using wrong
tools as there are more appropriate solutions to control a flag inside
the driver (like debugfs and configfs)

> 
>>> Unless you are wanting a more generic 'test_in_progress' function that
>>> is not specific to fault injection, then using debugfs seems like an
>>> unnecessary complication.
>> yes, more generic approach is always better, and with debugfs/flag
>> approach it would likely enable us to suppress some dumps while doing
>> some CAT tests or live kunit negative GuC tests
> Except when generic means unusable because it is not available yet (see
> above about module load testing). And kunit tests already have an
> infrastructure and API all of their own for controlling how executing is
> done during the test. Not sure what you mean by CAT tests?

something like igt at xe_exec_reset@cat-error

> 
> John.
> 
>>
>>> John.
>>>
>>>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>>> Cc: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>>>> Cc: John Harrison <john.c.harrison at intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_debugfs.c      |  5 +++++
>>>>    drivers/gpu/drm/xe/xe_device.h       | 12 ++++++++++++
>>>>    drivers/gpu/drm/xe/xe_device_types.h |  9 +++++++++
>>>>    3 files changed, 26 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/
>>>> xe_debugfs.c
>>>> index d0503959a8ed..0567a57597d3 100644
>>>> --- a/drivers/gpu/drm/xe/xe_debugfs.c
>>>> +++ b/drivers/gpu/drm/xe/xe_debugfs.c
>>>> @@ -235,4 +235,9 @@ void xe_debugfs_register(struct xe_device *xe)
>>>>        xe_pxp_debugfs_register(xe->pxp);
>>>>          fault_create_debugfs_attr("fail_gt_reset", root,
>>>> &gt_reset_failure);
>>>> +
>>>> +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>>>> +    debugfs_create_bool("fault_injection_in_progress", 0600, root,
>>>> +                &xe->fault_injection_in_progress);
>>>> +#endif
>>>>    }
>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/
>>>> xe_device.h
>>>> index 0bc3bc8e6803..ea25d8161050 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>> @@ -209,4 +209,16 @@ void xe_file_put(struct xe_file *xef);
>>>>    #define LNL_FLUSH_WORK(wrk__) \
>>>>        flush_work(wrk__)
>>>>    +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>>>> +static inline bool xe_fault_injection_in_progress(struct xe_device
>>>> *xe)
>>>> +{
>>>> +    return xe->fault_injection_in_progress;
>>>> +}
>>>> +#else
>>>> +static inline bool xe_fault_injection_in_progress(struct xe_device
>>>> *xe)
>>>> +{
>>>> +    return false;
>>>> +}
>>>> +#endif
>>>> +
>>>>    #endif
>>>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/
>>>> xe/xe_device_types.h
>>>> index 06c65dace026..513a811a3121 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>>>> @@ -578,6 +578,15 @@ struct xe_device {
>>>>        u8 vm_inject_error_position;
>>>>    #endif
>>>>    +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>>>> +    /**
>>>> +     * @fault_injection_in_progress: flag used by the fault injection
>>>> +     * tests to allow the driver to suppress non-essential error dumps
>>>> +     * that might be otherwise generated due to an injected fault.
>>>> +     */
>>>> +    bool fault_injection_in_progress;
>>>> +#endif
>>>> +
>>>>        /* private: */
>>>>      #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>