[PATCH 1/2] drm/xe: Introduce flag to indicate possible fault injection

Wed May 14 21:05:49 UTC 2025

On 5/12/2025 1:20 PM, Michal Wajdeczko wrote:
> On 12.05.2025 20:36, John Harrison wrote:
>> On 5/12/2025 9:19 AM, Michal Wajdeczko wrote:
>>> When running some fault injection tests the driver might generate
>>> a lot of error logs which might unnecessary stress our CI systems.
>>>
>>> Introduce a flag exposed in debugfs that can be used by the fault
>>> injection tests to give the driver a hint to suppress non-essential
>>> error logs or dumps that might be otherwise generated.
>> Why use debugfs?
> because IMO it's more tailored solution, as actually we just want some
> flag to control level of diagnostics generated by the driver on error,
> and as you already mentioned, modparam likely would be not accepted
Tailored to what? It is certainly not tailored to testing module load 
because debugfs does not exist until module load is completed. And a lot 
of the fault injection tests are testing the error paths during load. 
debugfs will simply not work at all for those.

>
>> The whole point of the fault injection test is that it
>> replaces an existing function with something that just returns an error
>> code. Seems the perfect way to have a function which is simply "return
>> 0;" in normal usage but modified by the injection test to return an
>> error code when a test is running. Which is what the original patch was
>> doing. That way everything is self contained, there are no extraneous
>> interfaces in unrelated subsystems.
> while at the first look this idea might solve the problem of controlling
> the diagnostics level, the issue is that it would require exposing from
> the driver either a dummy state function or promoting an existing
> diagnostic function to be a fault-injectable. in both cases such
As opposed to requiring an entire new debugfs interface? A single 
exposed function seems simpler to me.

> functions would then require configuration steps where any of advanced
> fault-injection parameters will never be fully explored (like
> probability, space) so this again sounds like overshooting
I have no idea what you mean by this. What configuration do you need for 
a function which is literally just 'return true/false'? You can do 
whatever fancy fault injection test you like. That is totally 
independent of nobbling one 'is_fault_injection_acitve()' function to 
prevent excess debug spam during any and every fault injection test.

>
> IMO instead of adding more and more functions as "fault-injectable" and
> marking them then as 100% fail, we should try to find more atomic points
> of failure (like no memory, no GGTT space, no VRAM, dead FW, broken FW
> comm, lost MMIO) and then take advantage of the framework that would
> inject faults all over with some randomness or deterministic and
> potentially overlapping with each other. this would allow to write more
> generic test that will list existing "injectable" functions, without a
> need to hardcoding them in the test, where clearly our dummy state
> function or diagnostic function wont fit and will require again special
> handling
I'm not sure if you are arguing for a re-write of the fault injection 
framework in the kernel as a whole or just complaining that our IGT 
implementation is sub-optimal? But if you look at the corresponding IGT 
patch, the code to nobble the 'is_injection_active()' function is 
trivial - it just re-uses code that already exits for nobbling the test 
target functions. So it is just one extra call at the start of each 
test. There is no complexity or fragile hard-coding. And it does not 
matter what fault injection test is going to be run - how complex or 
trivial.

>> Unless you are wanting a more generic 'test_in_progress' function that
>> is not specific to fault injection, then using debugfs seems like an
>> unnecessary complication.
> yes, more generic approach is always better, and with debugfs/flag
> approach it would likely enable us to suppress some dumps while doing
> some CAT tests or live kunit negative GuC tests
Except when generic means unusable because it is not available yet (see 
above about module load testing). And kunit tests already have an 
infrastructure and API all of their own for controlling how executing is 
done during the test. Not sure what you mean by CAT tests?

John.

>
>> John.
>>
>>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>> Cc: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>>> Cc: John Harrison <john.c.harrison at intel.com>
>>> ---
>>>    drivers/gpu/drm/xe/xe_debugfs.c      |  5 +++++
>>>    drivers/gpu/drm/xe/xe_device.h       | 12 ++++++++++++
>>>    drivers/gpu/drm/xe/xe_device_types.h |  9 +++++++++
>>>    3 files changed, 26 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/
>>> xe_debugfs.c
>>> index d0503959a8ed..0567a57597d3 100644
>>> --- a/drivers/gpu/drm/xe/xe_debugfs.c
>>> +++ b/drivers/gpu/drm/xe/xe_debugfs.c
>>> @@ -235,4 +235,9 @@ void xe_debugfs_register(struct xe_device *xe)
>>>        xe_pxp_debugfs_register(xe->pxp);
>>>          fault_create_debugfs_attr("fail_gt_reset", root,
>>> &gt_reset_failure);
>>> +
>>> +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>>> +    debugfs_create_bool("fault_injection_in_progress", 0600, root,
>>> +                &xe->fault_injection_in_progress);
>>> +#endif
>>>    }
>>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/
>>> xe_device.h
>>> index 0bc3bc8e6803..ea25d8161050 100644
>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>> @@ -209,4 +209,16 @@ void xe_file_put(struct xe_file *xef);
>>>    #define LNL_FLUSH_WORK(wrk__) \
>>>        flush_work(wrk__)
>>>    +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>>> +static inline bool xe_fault_injection_in_progress(struct xe_device *xe)
>>> +{
>>> +    return xe->fault_injection_in_progress;
>>> +}
>>> +#else
>>> +static inline bool xe_fault_injection_in_progress(struct xe_device *xe)
>>> +{
>>> +    return false;
>>> +}
>>> +#endif
>>> +
>>>    #endif
>>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/
>>> xe/xe_device_types.h
>>> index 06c65dace026..513a811a3121 100644
>>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>>> @@ -578,6 +578,15 @@ struct xe_device {
>>>        u8 vm_inject_error_position;
>>>    #endif
>>>    +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>>> +    /**
>>> +     * @fault_injection_in_progress: flag used by the fault injection
>>> +     * tests to allow the driver to suppress non-essential error dumps
>>> +     * that might be otherwise generated due to an injected fault.
>>> +     */
>>> +    bool fault_injection_in_progress;
>>> +#endif
>>> +
>>>        /* private: */
>>>      #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)