[PATCH v6] drm/xe: Add helper function to inject fault into ct_dead_capture()

Sat Jun 7 00:06:57 UTC 2025

On 6/6/2025 6:19 AM, Jani Nikula wrote:
> On Thu, 05 Jun 2025, John Harrison <john.c.harrison at intel.com> wrote:
>> On 5/24/2025 7:46 AM, Satyanarayana K V P wrote:
>>> When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
>>> functions, the CI test systems are going out of space and crashing. To
>>> avoid this issue, a new helper function is created and when fault is
>>> injected into this xe_inject_fault() helper function, ct dead capture
>>> is avoided which suppresses ct dumps in the log.
>>>
>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>>> Suggested-by: John Harrison <John.C.Harrison at Intel.com>
>> Reviewed-by: John Harrison <John.C.Harrison at Intel.com>
>>
>> This seems like the simplest and cleanest solution to me (for both the
>> KMD and the IGT sides). I don't know if Jani or Michal still have
>> objections to it.
> Simple it may be, but I still think it conflates two orthogonal things
> that should both be decided by userspace.
It is being decided by userspace. And they are not orthogonal but 
completely dependent.

> If userspace wants error injection, why should the kernel decide it
> means no error capture in dmesg? Especially when that decision is to
> tackle an arbitrary self-inflicted *userspace* issue i.e. disk space
> limitation during testing.
Ignoring the CI limitation, it is also unhelpful to have huge logs 
describing hardware state for a bug that is purely software injected.

The dump to dmesg only occurs when something has gone wrong that is not 
supposed to be possible. There should certainly not be any way that a 
user land test can cause such an event. It is there to debug the rare 
and hard to repro cases that cannot be debugged any other way.

Whereas, this specific test is about making sure the KMD does not die 
horribly when an unexpected fault occurs. It is not an attempt to debug 
such occurrences, merely a check that the universe survives them. Which 
means that any debug output related to the source of the error is 
fundamentally useless - the test itself is the source. So any debug 
output related to debugging the source is simply noise and not relevant 
to debugging why the driver died *after* such an error occurred.

As the fault injection test is the only way to inject such errors, by 
definition, it is sensible that the fault injection test is the only 
thing that needs to suppress such output.

In other words, the test and the output suppression are inherently 
connected and not orthogonal.

So having extra complication to make the suppression mechanism generic 
and unrelated to the injection test is wasted effort and unnecessary 
complication.

If you want a more general purpose way to control debug output for 
specific local testing, we already have that via config options and 
kernel/drm debug levels. This is specifically and solely about improving 
the usability of the error injection test (for which config options are 
not viable because those are fixed across CI runs, likewise debug levels 
are not useful because we do need to see the normal driver debug output 
to debug any test failures).

John.

>
> Mechanism, not policy.
>
> There, I've said it, but this is not a hill I'm going to die on.
>
>
> BR,
> Jani.
>
>