[PATCH v2] drm/xe: Add helper function to inject fault into ct_dead_capture()

Thu May 8 23:00:17 UTC 2025

On 5/6/2025 10:13 PM, K V P, Satyanarayana wrote:
>> -----Original Message-----
>> From: Harrison, John C <john.c.harrison at intel.com>
>> Sent: Wednesday, May 7, 2025 4:50 AM
>> To: K V P, Satyanarayana <satyanarayana.k.v.p at intel.com>; intel-
>> xe at lists.freedesktop.org
>> Cc: Chauhan, Aditya <aditya.chauhan at intel.com>; Nikula, Jani
>> <jani.nikula at intel.com>
>> Subject: Re: [PATCH v2] drm/xe: Add helper function to inject fault into
>> ct_dead_capture()
>>
>> On 4/30/2025 6:17 AM, Satyanarayana K V P wrote:
>>> When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
>>> functions, the CI test systems are going out of space and crashing. To
>>> avoid this issue, a new helper function is created and when fault is
>>> injected into this xe_should_fail_ct_dead_capture() helper function,
>>> ct dead capture is avoided which suppresses ct dumps in the log.
>>>
>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>>> Suggested-by: John Harrison <John.C.Harrison at Intel.com>
>>> Tested-by: Aditya Chauhan <aditya.chauhan at intel.com>
>>>
>>> ---
>>> Cc: Jani Nikula <jani.nikula at intel.com>
>>>
>>> V1 -> V2:
>>> - Fixed review comments.
>>> ---
>>>    drivers/gpu/drm/xe/xe_guc_ct.c | 21 +++++++++++++++++++++
>>>    1 file changed, 21 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
>> b/drivers/gpu/drm/xe/xe_guc_ct.c
>>> index 2447de0ebedf..d6e7a8b80d8c 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>> @@ -1770,6 +1770,20 @@ void xe_guc_ct_print(struct xe_guc_ct *ct,
>> struct drm_printer *p, bool want_ctb)
>>>    }
>>>
>>>    #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>>> +/**
>>> + * xe_should_fail_ct_dead_capture - Helper function to inject fault.
>>> + *
>>> + * This is a helper function to inject fault into ct_dead_capture().
>>> + * As fault is injected using this function, need to make sure that
>>> + * the compiler does not optimize and make it as a inline function.
>>> + * To prevent compile optimization, "noinline" is added.
>>> + */
>>> +static noinline int xe_should_fail_ct_dead_capture(void)
>>> +{
>>> +	return 0;
>>> +}
>>> +ALLOW_ERROR_INJECTION(xe_should_fail_ct_dead_capture, ERRNO);
>>> +
>>>    static void ct_dead_capture(struct xe_guc_ct *ct, struct guc_ctb *ctb, u32
>> reason_code)
>>>    {
>>>    	struct xe_guc_log_snapshot *snapshot_log;
>>> @@ -1778,6 +1792,13 @@ static void ct_dead_capture(struct xe_guc_ct
>> *ct, struct guc_ctb *ctb, u32 reaso
>>>    	unsigned long flags;
>>>    	bool have_capture;
>>>
>>> +	/*
>>> +	 * Huge dump is getting generated when injecting error for guc
>> CT/MMIO
>>> +	 * functions. So, let us suppress the dump when fault is injected.
>>> +	 */
>>> +	if (xe_should_fail_ct_dead_capture())
>> Is it worth making this a more generic 'is_error_fault_injected()'? Then
>> it can be used by random other bits of code if/when necessary. And maybe
> I do not think we can have a generic error injection function. If the generic error injection function is called at multiple places
> (may be in future), then we may not inject error at point where we intend to inject as the first call will inject the error.
Not following. The point of 'xe_should_fail_ct_dead_capture' is really 
just to say whether a fault injection is in progress or not. If so, then 
don't do any of the dumping because this is not a real error. It is not 
actually any part of the error injection process itself. It doesn't 
matter how many bits of code call that function, the behaviour won't 
change other than to skip the code we don't want to run. E.g. if there 
are other 'dump on catastrophic failure' type code paths either now or 
in the future, they can all use the same 
'is_this_an_error_injection_test' function to skip when the error is fake.

John.

>> also have an inline/#define version for when
>> CONFIG_FUNCTION_ERROR_INJECTION is not defined?
>>
> Will do and send new patch.
>> John.
>>
>>> +		return;
>>> +
>>>    	if (ctb)
>>>    		ctb->info.broken = true;
>>>