[PATCH v6] drm/xe: Add helper function to inject fault into ct_dead_capture()
John Harrison
john.c.harrison at intel.com
Sat Jun 7 00:06:57 UTC 2025
On 6/6/2025 6:19 AM, Jani Nikula wrote:
> On Thu, 05 Jun 2025, John Harrison <john.c.harrison at intel.com> wrote:
>> On 5/24/2025 7:46 AM, Satyanarayana K V P wrote:
>>> When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
>>> functions, the CI test systems are going out of space and crashing. To
>>> avoid this issue, a new helper function is created and when fault is
>>> injected into this xe_inject_fault() helper function, ct dead capture
>>> is avoided which suppresses ct dumps in the log.
>>>
>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>>> Suggested-by: John Harrison <John.C.Harrison at Intel.com>
>> Reviewed-by: John Harrison <John.C.Harrison at Intel.com>
>>
>> This seems like the simplest and cleanest solution to me (for both the
>> KMD and the IGT sides). I don't know if Jani or Michal still have
>> objections to it.
> Simple it may be, but I still think it conflates two orthogonal things
> that should both be decided by userspace.
It is being decided by userspace. And they are not orthogonal but
completely dependent.
> If userspace wants error injection, why should the kernel decide it
> means no error capture in dmesg? Especially when that decision is to
> tackle an arbitrary self-inflicted *userspace* issue i.e. disk space
> limitation during testing.
Ignoring the CI limitation, it is also unhelpful to have huge logs
describing hardware state for a bug that is purely software injected.
The dump to dmesg only occurs when something has gone wrong that is not
supposed to be possible. There should certainly not be any way that a
user land test can cause such an event. It is there to debug the rare
and hard to repro cases that cannot be debugged any other way.
Whereas, this specific test is about making sure the KMD does not die
horribly when an unexpected fault occurs. It is not an attempt to debug
such occurrences, merely a check that the universe survives them. Which
means that any debug output related to the source of the error is
fundamentally useless - the test itself is the source. So any debug
output related to debugging the source is simply noise and not relevant
to debugging why the driver died *after* such an error occurred.
As the fault injection test is the only way to inject such errors, by
definition, it is sensible that the fault injection test is the only
thing that needs to suppress such output.
In other words, the test and the output suppression are inherently
connected and not orthogonal.
So having extra complication to make the suppression mechanism generic
and unrelated to the injection test is wasted effort and unnecessary
complication.
If you want a more general purpose way to control debug output for
specific local testing, we already have that via config options and
kernel/drm debug levels. This is specifically and solely about improving
the usability of the error injection test (for which config options are
not viable because those are fixed across CI runs, likewise debug levels
are not useful because we do need to see the normal driver debug output
to debug any test failures).
John.
>
> Mechanism, not policy.
>
> There, I've said it, but this is not a hill I'm going to die on.
>
>
> BR,
> Jani.
>
>
More information about the Intel-xe
mailing list