[PATCH v3] drm/xe: Add helper function to inject fault into ct_dead_capture()

Fri May 9 00:37:51 UTC 2025

On 5/7/2025 8:32 AM, Michal Wajdeczko wrote:
> On 07.05.2025 15:15, Satyanarayana K V P wrote:
>> When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
>> functions, the CI test systems are going out of space and crashing. To
>> avoid this issue, a new helper function is created and when fault is
>> injected into this xe_should_fail_ct_dead_capture() helper function,
>> ct dead capture is avoided which suppresses ct dumps in the log.
>>
>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>> Suggested-by: John Harrison <John.C.Harrison at Intel.com>
>> Tested-by: Aditya Chauhan <aditya.chauhan at intel.com>
>>
>> ---
>> Cc: Jani Nikula <jani.nikula at intel.com>
>>
>> V2 -> V3:
>> - Added inline function to avoid compilation error in the absence of
>> CONFIG_FUNCTION_ERROR_INJECTION.
>>
>> V1 -> V2:
>> - Fixed review comments.
>> ---
>>   drivers/gpu/drm/xe/xe_guc_ct.c | 25 +++++++++++++++++++++++++
>>   1 file changed, 25 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>> index 2447de0ebedf..d959cc2e7b40 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>> @@ -1770,6 +1770,24 @@ void xe_guc_ct_print(struct xe_guc_ct *ct, struct drm_printer *p, bool want_ctb)
>>   }
>>   
>>   #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>> +/**
>> + * xe_should_fail_ct_dead_capture - Helper function to inject fault.
> this is a static function, likely we don't want full kernel-doc for it
Meaning just make it a regular comment? We definitely want documentation 
of some sort.

>
>> + *
>> + * This is a helper function to inject fault into ct_dead_capture().
>> + * As fault is injected using this function, need to make sure that
>> + * the compiler does not optimize and make it as a inline function.
>> + * To prevent compile optimization, "noinline" is added.
>> + */
>> +#ifdef CONFIG_FUNCTION_ERROR_INJECTION
>> +static noinline int xe_should_fail_ct_dead_capture(void)
>> +{
>> +	return 0;
>> +}
>> +ALLOW_ERROR_INJECTION(xe_should_fail_ct_dead_capture, ERRNO);
> hmm, but do we really need to abuse the error-injection framework to
> suppress GuC log dump? maybe cleaner option would be to just add module
> parameter (maybe under CONFIG_XE_DEBUG) that we can setup in the test
> and check here?
If we cannot under any circumstances have a module option for doing 
something useful like adjusting the GuC log size then I don't see how 
once can be accepted for something that is purely a hack for reducing CI 
spam.

As per my earlier comments, having a generic 'error injection test in 
progress' function would seem a cleaner way to do this. The CT code (or 
any other similar code) can check that function and early exit as 
appropriate, but the function itself is not specific to any particular 
subsystem.

>
>> +#else
>> +static inline int xe_should_fail_ct_dead_capture(void) { return 0; }
>> +#endif
>> +
>>   static void ct_dead_capture(struct xe_guc_ct *ct, struct guc_ctb *ctb, u32 reason_code)
>>   {
>>   	struct xe_guc_log_snapshot *snapshot_log;
>> @@ -1778,6 +1796,13 @@ static void ct_dead_capture(struct xe_guc_ct *ct, struct guc_ctb *ctb, u32 reaso
>>   	unsigned long flags;
>>   	bool have_capture;
>>   
>> +	/*
>> +	 * Huge dump is getting generated when injecting error for guc CT/MMIO
>> +	 * functions. So, let us suppress the dump when fault is injected.
>> +	 */
>> +	if (xe_should_fail_ct_dead_capture())
>> +		return;
> shouldn't we exit *after* the below statement?
>
> we want to skip just a dump, not to skip 'broken' markup
This is true. The 'broken' flag still needs to be set - that even 
happens when not in an XE_DEBUG build, and is a required part of the 
error handling to prevent further cascading of errors.

John.

>
>> +
>>   	if (ctb)
>>   		ctb->info.broken = true;
>>