[PATCH v5 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler

Wed Jul 16 04:21:07 UTC 2025

On 7/15/2025 10:28 PM, Summers, Stuart wrote:
> On Tue, 2025-07-15 at 22:09 +0530, Riana Tauro wrote:
>> Hi Stuart
>>
>> On 7/15/2025 7:40 PM, Summers, Stuart wrote:
>>> On Tue, 2025-07-15 at 16:17 +0530, Riana Tauro wrote:
>>>> Add a debugfs fault handler to trigger csc error handler that
>>>> wedges the device and sends drm uevent
>>>>
>>>> v2: add debugfs only for bmg (Umesh)
>>>>
>>>> Signed-off-by: Riana Tauro <riana.tauro at intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_debugfs.c  |  3 +++
>>>>    drivers/gpu/drm/xe/xe_hw_error.c | 11 +++++++++++
>>>>    2 files changed, 14 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_debugfs.c
>>>> b/drivers/gpu/drm/xe/xe_debugfs.c
>>>> index 26e9d146ccbf..95057c04a022 100644
>>>> --- a/drivers/gpu/drm/xe/xe_debugfs.c
>>>> +++ b/drivers/gpu/drm/xe/xe_debugfs.c
>>>> @@ -31,6 +31,7 @@
>>>>    #endif
>>>>    
>>>>    DECLARE_FAULT_ATTR(gt_reset_failure);
>>>> +DECLARE_FAULT_ATTR(inject_csc_hw_error);
>>>>    
>>>>    static struct xe_device *node_to_xe(struct drm_info_node *node)
>>>>    {
>>>> @@ -294,6 +295,8 @@ void xe_debugfs_register(struct xe_device
>>>> *xe)
>>>>           xe_pxp_debugfs_register(xe->pxp);
>>>>    
>>>>           fault_create_debugfs_attr("fail_gt_reset", root,
>>>> &gt_reset_failure);
>>>> +       if (xe->info.platform == XE_BATTLEMAGE)
>>>
>>> I'm still not clear why we need to limit this to BMG.
>>
>> CSC errors are only applicable for BMG. This bit is not in prior
>> products. Adding a fault injection and testing flow will make sense
>> only for supported products
> 
> I'm looking in bspec, for instance at the non-fatal errors (53075), and
> I see different platforms implementing bit 17 for CSC non-fatal,
> including, as I had mentioned in the first review, for LNL. So this
> isn't only applicable to BMG. 

The recovery mechanism is only for BMG in fwupd. For LNL, the logging 
should be sufficient which is done in the irq patch.
Wedging and runtime survivability flow testing would not be applicable
if recovery is not supported.

Thanks
Riana

> We might be taking a call that we only
> need this on BMG, but IMO we should be clear about that since it isn't
> a case of not being supported by the hardware.
> 
> Thanks,
> Stuart
> 
>>
>> Thanks
>> Riana
>>>
>>> -Stuart
>>>
>>>> +               fault_create_debugfs_attr("inject_csc_hw_error",
>>>> root, &inject_csc_hw_error);
>>>>    
>>>>           if (IS_SRIOV_PF(xe))
>>>>                   xe_sriov_pf_debugfs_register(xe, root);
>>>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c
>>>> b/drivers/gpu/drm/xe/xe_hw_error.c
>>>> index c5b3e8c207fa..db7417c390ff 100644
>>>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>>>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>>>> @@ -3,6 +3,8 @@
>>>>     * Copyright © 2025 Intel Corporation
>>>>     */
>>>>    
>>>> +#include <linux/fault-inject.h>
>>>> +
>>>>    #include "regs/xe_gsc_regs.h"
>>>>    #include "regs/xe_hw_error_regs.h"
>>>>    #include "regs/xe_irq_regs.h"
>>>> @@ -13,6 +15,7 @@
>>>>    #include "xe_survivability_mode.h"
>>>>    
>>>>    #define  HEC_UNCORR_FW_ERR_BITS 4
>>>> +extern struct fault_attr inject_csc_hw_error;
>>>>    
>>>>    /* Error categories reported by hardware */
>>>>    enum hardware_error {
>>>> @@ -43,6 +46,11 @@ static const char *hw_error_to_str(const enum
>>>> hardware_error hw_err)
>>>>           }
>>>>    }
>>>>    
>>>> +static bool fault_inject_csc_hw_error(void)
>>>> +{
>>>> +       return should_fail(&inject_csc_hw_error, 1);
>>>> +}
>>>> +
>>>>    static void csc_hw_error_work(struct work_struct *work)
>>>>    {
>>>>           struct xe_tile *tile = container_of(work, typeof(*tile),
>>>> csc_hw_error_work);
>>>> @@ -130,6 +138,9 @@ void xe_hw_error_irq_handler(struct xe_tile
>>>> *tile, const u32 master_ctl)
>>>>    {
>>>>           enum hardware_error hw_err;
>>>>    
>>>> +       if (fault_inject_csc_hw_error())
>>>> +               schedule_work(&tile->csc_hw_error_work);
>>>> +
>>>>           for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
>>>>                   if (master_ctl & ERROR_IRQ(hw_err))
>>>>                           hw_error_source_handler(tile, hw_err);
>>>
>>
>>
>