[PATCH 1/2] drm/xe: Introduce flag to indicate possible fault injection
Michal Wajdeczko
michal.wajdeczko at intel.com
Mon May 12 20:20:28 UTC 2025
On 12.05.2025 20:36, John Harrison wrote:
> On 5/12/2025 9:19 AM, Michal Wajdeczko wrote:
>> When running some fault injection tests the driver might generate
>> a lot of error logs which might unnecessary stress our CI systems.
>>
>> Introduce a flag exposed in debugfs that can be used by the fault
>> injection tests to give the driver a hint to suppress non-essential
>> error logs or dumps that might be otherwise generated.
> Why use debugfs?
because IMO it's more tailored solution, as actually we just want some
flag to control level of diagnostics generated by the driver on error,
and as you already mentioned, modparam likely would be not accepted
> The whole point of the fault injection test is that it
> replaces an existing function with something that just returns an error
> code. Seems the perfect way to have a function which is simply "return
> 0;" in normal usage but modified by the injection test to return an
> error code when a test is running. Which is what the original patch was
> doing. That way everything is self contained, there are no extraneous
> interfaces in unrelated subsystems.
while at the first look this idea might solve the problem of controlling
the diagnostics level, the issue is that it would require exposing from
the driver either a dummy state function or promoting an existing
diagnostic function to be a fault-injectable. in both cases such
functions would then require configuration steps where any of advanced
fault-injection parameters will never be fully explored (like
probability, space) so this again sounds like overshooting
IMO instead of adding more and more functions as "fault-injectable" and
marking them then as 100% fail, we should try to find more atomic points
of failure (like no memory, no GGTT space, no VRAM, dead FW, broken FW
comm, lost MMIO) and then take advantage of the framework that would
inject faults all over with some randomness or deterministic and
potentially overlapping with each other. this would allow to write more
generic test that will list existing "injectable" functions, without a
need to hardcoding them in the test, where clearly our dummy state
function or diagnostic function wont fit and will require again special
handling
>
> Unless you are wanting a more generic 'test_in_progress' function that
> is not specific to fault injection, then using debugfs seems like an
> unnecessary complication.
yes, more generic approach is always better, and with debugfs/flag
approach it would likely enable us to suppress some dumps while doing
some CAT tests or live kunit negative GuC tests
>
> John.
>
>>
>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
>> Cc: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>> Cc: John Harrison <john.c.harrison at intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_debugfs.c | 5 +++++
>> drivers/gpu/drm/xe/xe_device.h | 12 ++++++++++++
>> drivers/gpu/drm/xe/xe_device_types.h | 9 +++++++++
>> 3 files changed, 26 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/
>> xe_debugfs.c
>> index d0503959a8ed..0567a57597d3 100644
>> --- a/drivers/gpu/drm/xe/xe_debugfs.c
>> +++ b/drivers/gpu/drm/xe/xe_debugfs.c
>> @@ -235,4 +235,9 @@ void xe_debugfs_register(struct xe_device *xe)
>> xe_pxp_debugfs_register(xe->pxp);
>> fault_create_debugfs_attr("fail_gt_reset", root,
>> >_reset_failure);
>> +
>> +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>> + debugfs_create_bool("fault_injection_in_progress", 0600, root,
>> + &xe->fault_injection_in_progress);
>> +#endif
>> }
>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/
>> xe_device.h
>> index 0bc3bc8e6803..ea25d8161050 100644
>> --- a/drivers/gpu/drm/xe/xe_device.h
>> +++ b/drivers/gpu/drm/xe/xe_device.h
>> @@ -209,4 +209,16 @@ void xe_file_put(struct xe_file *xef);
>> #define LNL_FLUSH_WORK(wrk__) \
>> flush_work(wrk__)
>> +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>> +static inline bool xe_fault_injection_in_progress(struct xe_device *xe)
>> +{
>> + return xe->fault_injection_in_progress;
>> +}
>> +#else
>> +static inline bool xe_fault_injection_in_progress(struct xe_device *xe)
>> +{
>> + return false;
>> +}
>> +#endif
>> +
>> #endif
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/
>> xe/xe_device_types.h
>> index 06c65dace026..513a811a3121 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -578,6 +578,15 @@ struct xe_device {
>> u8 vm_inject_error_position;
>> #endif
>> +#if IS_ENABLED(CONFIG_FAULT_INJECTION)
>> + /**
>> + * @fault_injection_in_progress: flag used by the fault injection
>> + * tests to allow the driver to suppress non-essential error dumps
>> + * that might be otherwise generated due to an injected fault.
>> + */
>> + bool fault_injection_in_progress;
>> +#endif
>> +
>> /* private: */
>> #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
>
More information about the Intel-xe
mailing list