[PATCH 1/2] drm/xe: Introduce flag to indicate possible fault injection

Wed May 21 21:15:06 UTC 2025

On 5/15/2025 3:25 AM, Michal Wajdeczko wrote:
> On 14.05.2025 23:05, John Harrison wrote:
>> On 5/12/2025 1:20 PM, Michal Wajdeczko wrote:
>>> On 12.05.2025 20:36, John Harrison wrote:
>>>> On 5/12/2025 9:19 AM, Michal Wajdeczko wrote:
>>>>> When running some fault injection tests the driver might generate
>>>>> a lot of error logs which might unnecessary stress our CI systems.
>>>>>
>>>>> Introduce a flag exposed in debugfs that can be used by the fault
>>>>> injection tests to give the driver a hint to suppress non-essential
>>>>> error logs or dumps that might be otherwise generated.
>>>> Why use debugfs?
>>> because IMO it's more tailored solution, as actually we just want some
>>> flag to control level of diagnostics generated by the driver on error,
>>> and as you already mentioned, modparam likely would be not accepted
>> Tailored to what? It is certainly not tailored to testing module load
>> because debugfs does not exist until module load is completed. And a lot
>> of the fault injection tests are testing the error paths during load.
>> debugfs will simply not work at all for those.
> ouch, my bad, I guess I was (too tired and) too much focused on the
> scenarios that injects errors at runtime that I fully missed the
> limitations of the probe time
>
>>>> The whole point of the fault injection test is that it
>>>> replaces an existing function with something that just returns an error
>>>> code. Seems the perfect way to have a function which is simply "return
>>>> 0;" in normal usage but modified by the injection test to return an
>>>> error code when a test is running. Which is what the original patch was
>>>> doing. That way everything is self contained, there are no extraneous
>>>> interfaces in unrelated subsystems.
>>> while at the first look this idea might solve the problem of controlling
>>> the diagnostics level, the issue is that it would require exposing from
>>> the driver either a dummy state function or promoting an existing
>>> diagnostic function to be a fault-injectable. in both cases such
>> As opposed to requiring an entire new debugfs interface? A single
>> exposed function seems simpler to me.
> it's still one new entry, regardless it's added to the debugfs or
> fault-injection framework
>
> and IMO using/controlling the flag using debugfs is much simpler and
> cleaner than whole configuration using fault-injection framework
>
>>> functions would then require configuration steps where any of advanced
>>> fault-injection parameters will never be fully explored (like
>>> probability, space) so this again sounds like overshooting
>> I have no idea what you mean by this. What configuration do you need for
>> a function which is literally just 'return true/false'? You can do
>> whatever fancy fault injection test you like. That is totally
>> independent of nobbling one 'is_fault_injection_acitve()' function to
>> prevent excess debug spam during any and every fault injection test.
> compare configuration steps using fault injection framework:
>
> $ FAILTYPE=fail_function
> $ FAILFUNC=is_fault_injection_active
>
> $ echo > /sys/kernel/debug/$FAILTYPE/inject
> $ echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject
> $ printf %#x -19 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval
>
> $ echo N > /sys/kernel/debug/$FAILTYPE/task-filter
> $ echo 10 > /sys/kernel/debug/$FAILTYPE/probability
> $ echo 0 > /sys/kernel/debug/$FAILTYPE/interval
> $ echo -1 > /sys/kernel/debug/$FAILTYPE/times
> $ echo 0 > /sys/kernel/debug/$FAILTYPE/space
> $ echo 1 > /sys/kernel/debug/$FAILTYPE/verbose
>
> with the same done over debugfs:
>
> $ echo Y > /sys/kernel/debug/dri/0000:00:02.0/is_fault_injection_active
Except that the interface is only used by fault injection tests. It is 
never done by a user manually hacking around with sysfs files. It is 
done by the IGT fault injection test which already has everything coded 
up as a helper function. So it is just a one line change in the test. 
Which is the only place that ever needs to do it.

Moving the control to debugfs/configfs is splitting the functionality 
across multiple unrelated subsystems. Doing it via the existing 
injection interface keeps everything together within a single subsystem.

>
> and while we know this will not work during the probe time, since on the
> driver side it's still a single bool flag, we can try to control its
> initial value using configfs, and while it would require few more steps
> it still be simpler/shorter than above fault-injection steps
Which is just adding more and more complication to something which does 
not need it.

>
>>> IMO instead of adding more and more functions as "fault-injectable" and
>>> marking them then as 100% fail, we should try to find more atomic points
>>> of failure (like no memory, no GGTT space, no VRAM, dead FW, broken FW
>>> comm, lost MMIO) and then take advantage of the framework that would
>>> inject faults all over with some randomness or deterministic and
>>> potentially overlapping with each other. this would allow to write more
>>> generic test that will list existing "injectable" functions, without a
>>> need to hardcoding them in the test, where clearly our dummy state
>>> function or diagnostic function wont fit and will require again special
>>> handling
>> I'm not sure if you are arguing for a re-write of the fault injection
>> framework in the kernel as a whole or just complaining that our IGT
> our IGT approach and usage
>
>> implementation is sub-optimal? But if you look at the corresponding IGT
>> patch, the code to nobble the 'is_injection_active()' function is
>> trivial - it just re-uses code that already exits for nobbling the test
>> target functions. So it is just one extra call at the start of each
>> test. There is no complexity or fragile hard-coding. And it does not
>> matter what fault injection test is going to be run - how complex or
>> trivial.
> but even if we say that we just reuse the existing code, since it was
> primary designed for different purposes, IMO we still are using wrong
> tools as there are more appropriate solutions to control a flag inside
> the driver (like debugfs and configfs)
They are appropriate for controlling general purpose settings that are 
of generic use. They are not appropriate for controlling something that 
is specific to another subsystem which already has a perfectly usable 
and trivial mechanism for adding that control.

>
>>>> Unless you are wanting a more generic 'test_in_progress' function that
>>>> is not specific to fault injection, then using debugfs seems like an
>>>> unnecessary complication.
>>> yes, more generic approach is always better, and with debugfs/flag
>>> approach it would likely enable us to suppress some dumps while doing
>>> some CAT tests or live kunit negative GuC tests
>> Except when generic means unusable because it is not available yet (see
>> above about module load testing). And kunit tests already have an
>> infrastructure and API all of their own for controlling how executing is
>> done during the test. Not sure what you mean by CAT tests?
> something like igt at xe_exec_reset@cat-error
What does that have to do with this? It should not be possible to 
generate a CT_DEAD failure in the cat-error test. It should not be 
possible to generate a CT_DEAD failure with anything other than a 
selftest/injection test that deliberately breaks the internal workings 
of the driver. A cat-error is a user generatable error. As such it 
should never hit an internal error path in the KMD.

John.

>
>> John.
>>
>>>> John.
>>>>
>>>>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>>>> Cc: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>>>>> Cc: John Harrison <john.c.harrison at intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/xe/xe_debugfs.c      |  5 +++++
>>>>>     drivers/gpu/drm/xe/xe_device.h       | 12 ++++++++++++
>>>>>     drivers/gpu/drm/xe/xe_device_types.h |  9 +++++++++
>>>>>     3 files changed, 26 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/
>>>>> xe_debugfs.c
>>>>> index d0503959a8ed..0567a57597d3 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_debugfs.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_debugfs.c
>>>>> @@ -235,4 +235,9 @@ void xe_debugfs_register(struct xe_device *xe)
>>>>>         xe_pxp_debugfs_register(xe->pxp);
>>>>>           fault_create_debugfs_attr("fail_gt_reset", root,
>>>>> &gt_reset_failure);
>>>>> +
>>>>> +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>>>>> +    debugfs_create_bool("fault_injection_in_progress", 0600, root,
>>>>> +                &xe->fault_injection_in_progress);
>>>>> +#endif
>>>>>     }
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/
>>>>> xe_device.h
>>>>> index 0bc3bc8e6803..ea25d8161050 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>> @@ -209,4 +209,16 @@ void xe_file_put(struct xe_file *xef);
>>>>>     #define LNL_FLUSH_WORK(wrk__) \
>>>>>         flush_work(wrk__)
>>>>>     +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>>>>> +static inline bool xe_fault_injection_in_progress(struct xe_device
>>>>> *xe)
>>>>> +{
>>>>> +    return xe->fault_injection_in_progress;
>>>>> +}
>>>>> +#else
>>>>> +static inline bool xe_fault_injection_in_progress(struct xe_device
>>>>> *xe)
>>>>> +{
>>>>> +    return false;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>>     #endif
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/
>>>>> xe/xe_device_types.h
>>>>> index 06c65dace026..513a811a3121 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>>>>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>>>>> @@ -578,6 +578,15 @@ struct xe_device {
>>>>>         u8 vm_inject_error_position;
>>>>>     #endif
>>>>>     +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>>>>> +    /**
>>>>> +     * @fault_injection_in_progress: flag used by the fault injection
>>>>> +     * tests to allow the driver to suppress non-essential error dumps
>>>>> +     * that might be otherwise generated due to an injected fault.
>>>>> +     */
>>>>> +    bool fault_injection_in_progress;
>>>>> +#endif
>>>>> +
>>>>>         /* private: */
>>>>>       #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)