[Intel-xe] [PATCH 10/11] drm/xe: Clear SOC CORRECTABLE error registers.

Aravind Iddamsetty aravind.iddamsetty at linux.intel.com
Thu Oct 12 02:59:49 UTC 2023


On 11/10/23 12:22, Ghimiray, Himal Prasad wrote:
>
> On 11-10-2023 12:18, Aravind Iddamsetty wrote:
>> On 27/09/23 17:16, Himal Prasad Ghimiray wrote:
>>> PVC doesn't support correctable SOC errors, if we receive MSI due to
>> statement looks incomplete/inappropriate,
>>
>> better rephrase to "PVC doesn't support correctable SOC error reporting"
> ok.
>>
>> Thanks,
>> Aravind.
>>> correctable error, classify them as Undefined and clear the registers.
>>>
>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_hw_error.c | 24 +++++++++++++++++++++++-
>>>   1 file changed, 23 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>>> index dcf395bd985f..0bcb1bea7ffb 100644
>>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>>> @@ -616,9 +616,30 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>>         lockdep_assert_held(&tile_to_xe(tile)->irq.lock);
>>>   -    if ((tile_to_xe(tile)->info.platform != XE_PVC) && hw_err == HARDWARE_ERROR_CORRECTABLE)
>>> +    if ((tile_to_xe(tile)->info.platform != XE_PVC))
>>>           return;
>>>   +    if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
>>> +        for (i = 0; i < PVC_NUM_IEH; i++)
>>> +            xe_mmio_write32(gt, SOC_GSYSEVTCTL_REG(base, slave_base, i),
>>> +                    ~REG_BIT(hw_err));
>>> +
>>> +        xe_mmio_write32(gt, SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err),
>>> +                REG_GENMASK(31, 0));
>>> +        xe_mmio_write32(gt, SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err),
>>> +                REG_GENMASK(31, 0));
>>> +        xe_mmio_write32(gt, SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
>>> +                REG_GENMASK(31, 0));
>>> +        xe_mmio_write32(gt, SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
>>> +                REG_GENMASK(31, 0));
>>> +
>>> +        drm_info(&tile_to_xe(tile)->drm, HW_ERR
>>> +              "Tile%d Undefine SOC %s error.",
>>> +              tile->id, hwerr_to_str);
>> I still feel in this scenarios at least we shall flag this as drm_err, since even though
>> it is correctable and corrected by HW, aren't they spurious as we don't expect to receive them
>> and a HW misbehaviour. Thoughts?
>
> Agreed. IMO this change should be part of low driver error reporting. Not only SOC, we need to report other gt and tile errors

the category will be added as part of low level driver error, but the since you are adding the print, suggesting to change to drm_err

Thanks,

Aravind.

>
> too as spurious interrupt errors when they are undefined irrespective of error classes(correctable/uncorrectable).
>
>>
>>
>> Thanks,
>> Aravind.
>>> +
>>> +        goto unmask_gsysevtctl;
>>> +    }
>>> +
>>>       if (hw_err == HARDWARE_ERROR_FATAL) {
>>>           soc_mstr_glbl_err_reg = soc_mstr_glbl_err_reg_fatal;
>>>           soc_mstr_lcl_err_reg = soc_mstr_lcl_err_reg_fatal;
>>> @@ -709,6 +730,7 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>>       xe_mmio_write32(gt, SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err),
>>>               mst_glb_errstat);
>>>   +unmask_gsysevtctl:
>>>       for (i = 0; i < PVC_NUM_IEH; i++)
>>>           xe_mmio_write32(gt, SOC_GSYSEVTCTL_REG(base, slave_base, i),
>>>                   (HARDWARE_ERROR_MAX << 1) + 1);


More information about the Intel-xe mailing list