[PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture()

Mon May 26 11:20:47 UTC 2025

On Fri, 23 May 2025, John Harrison <john.c.harrison at intel.com> wrote:
> On 5/22/2025 10:33 PM, K V P, Satyanarayana wrote:
>> Hi
>>> -----Original Message-----
>>> From: Harrison, John C <john.c.harrison at intel.com>
>>> Sent: Friday, May 23, 2025 1:09 AM
>>> To: K V P, Satyanarayana <satyanarayana.k.v.p at intel.com>; intel-
>>> xe at lists.freedesktop.org
>>> Cc: Wajdeczko, Michal <Michal.Wajdeczko at intel.com>; Nikula, Jani
>>> <jani.nikula at intel.com>
>>> Subject: Re: [PATCH v5] drm/xe: Add helper function to inject fault into
>>> ct_dead_capture()
>>>
>>> On 5/22/2025 1:00 AM, Satyanarayana K V P wrote:
>>>> When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
>>>> functions, the CI test systems are going out of space and crashing. To
>>>> avoid this issue, a new helper function is created and when fault is
>>>> injected into this xe_inject_fault() helper function, ct dead capture
>>>> is avoided which suppresses ct dumps in the log.
>>>>
>>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>>>> Suggested-by: John Harrison <John.C.Harrison at Intel.com>
>>>> ---
>>>> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>>> Cc: Jani Nikula <jani.nikula at intel.com>
>>>>
>>>> V4 -> V5:
>>>> - Fixed review comments.
>>>>
>>>> V3 -> V4:
>>>> - Updated the name of helper function and moved to xe_device.h file.
>>>>
>>>> V2 -> V3:
>>>> - Added inline function to avoid compilation error in the absence of
>>>> CONFIG_FUNCTION_ERROR_INJECTION.
>>>>
>>>> V1 -> V2:
>>>> - Fixed review comments.
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++
>>>>    drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
>>>>    drivers/gpu/drm/xe/xe_guc_ct.c |  6 ++++++
>>>>    3 files changed, 38 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>> b/drivers/gpu/drm/xe/xe_device.c
>>>> index 660b0c5126dc..e1b4087a2974 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>> @@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct
>>> drm_device *drm, void *arg)
>>>>    	xe_pm_runtime_put(xe);
>>>>    }
>>>>
>>>> +/**
>>>> + * xe_is_injection_active() - Helper function to test whether a fault
>>>> + * inject test is running.
>>>> + *
>>>> + * This is a helper function that the driver can use to detect whether
>>>> + * a fault injection test is running in order to suppress excessive debug
>>>> + * output. By default, the return value is fixed as zero but it can be modified
>>>> + * by the fault injection framework to return an error.
>>>> + *
>>>> + * Returns:
>>>> + * 0 if no fault is injected.
>>>> + * Injected error if fault is injected.
>>>> + */
>>>> +int xe_is_injection_active(void)
>>>> +{
>>>> +	return xe_inject_fault();
>>>> +}
>>>> +
>>> I'm confused. What is the point of this wrapper?
>> We are injecting fault into xe_inject_fault() function which is generic. The fault injection status is obtained from
>> xe_is_injection_active() function. This is to make function names meaningful.
> But why?
>
> Why not just have a single function that is named 
> "xe_is_injection_active" and has the fault injected into it directly? It 
> has a meaningful name, it does what it needs to do, it keeps things 
> simple. Why have an extra level of function wrapper for no benefit?
>
> And in terms of meaningful names, "xe_inject_fault" sounds like a 
> function which is going to actually inject a fault somewhere. Whereas, 
> you are wanting something more like 
> "xe_stub_function_for_injecting_fault_into".  But as above, why bother? 
> The comment tells you exactly what is happening here. There is no need 
> for extra levels of wrapping just to change the name of a function.

I think all of this begs the question, does this belong in the kernel at
all? The problem is all in userspace ("CI test systems are going out of
space and crashing"). Userspace asks for error injection. Why does the
kernel need to apply the policy of not logging something specific in
this case? What else should the kernel not log, and who's going to
decide?

Personally I think logging core dumps in dmesg even in CI setting is
questionable, but let's just assume it's needed. Then there needs to be
a (debugfs?) knob for userspace to enable/disable that behaviour. The
fault injection test is very specific. Have the test disable dmesg
coredumps during fault injection tests. That's all. Don't make the
kernel more complicated and ridden with special casing for userspace
problems.

BR,
Jani.

>
> John.
>
>>>>    /**
>>>>     * xe_device_declare_wedged - Declare device wedged
>>>>     * @xe: xe device instance
>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>> b/drivers/gpu/drm/xe/xe_device.h
>>>> index 0bc3bc8e6803..13a43fe4fb64 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>> @@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device
>>> *xe);
>>>>    struct xe_file *xe_file_get(struct xe_file *xef);
>>>>    void xe_file_put(struct xe_file *xef);
>>>>
>>>> +#ifdef CONFIG_FUNCTION_ERROR_INJECTION
>>>> +/*
>>>> + * As fault is injected using this function, need to make sure that
>>>> + * the compiler does not optimize and make it as a inline function.
>>>> + * To prevent compile optimization, "noinline" is added.
>>>> + */
>>>> +static noinline int xe_inject_fault(void) { return 0; }
>>>> +ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO);
>>>> +#else
>>>> +static inline int xe_inject_fault(void) { return 0; }
>>> Why not name this xe_is_injection_active() and just call it directly?
>>>
>>> John.
>>>
>>>> +#endif
>>>> +
>>>> +int xe_is_injection_active(void);
>>>> +
>>>>    /*
>>>>     * Occasionally it is seen that the G2H worker starts running after a delay of
>>> more than
>>>>     * a second even after being queued and activated by the Linux workqueue
>>> subsystem. This
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
>>> b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> index 822f4c33f730..89f992feba31 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> @@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct
>>> *ct, struct guc_ctb *ctb, u32 reaso
>>>>    	if (ctb)
>>>>    		ctb->info.broken = true;
>>>> +	/*
>>>> +	 * Huge dump is getting generated when injecting error for guc
>>> CT/MMIO
>>>> +	 * functions. So, let us suppress the dump when fault is injected.
>>>> +	 */
>>>> +	if (xe_is_injection_active())
>>>> +		return;
>>>>
>>>>    	/* Ignore further errors after the first dump until a reset */
>>>>    	if (ct->dead.reported)
>

-- 
Jani Nikula, Intel