[Intel-xe] [PATCH 10/11] drm/xe: Clear SOC CORRECTABLE error registers.
Ghimiray, Himal Prasad
himal.prasad.ghimiray at intel.com
Wed Oct 11 06:52:01 UTC 2023
On 11-10-2023 12:18, Aravind Iddamsetty wrote:
> On 27/09/23 17:16, Himal Prasad Ghimiray wrote:
>> PVC doesn't support correctable SOC errors, if we receive MSI due to
> statement looks incomplete/inappropriate,
>
> better rephrase to "PVC doesn't support correctable SOC error reporting"
ok.
>
> Thanks,
> Aravind.
>> correctable error, classify them as Undefined and clear the registers.
>>
>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_hw_error.c | 24 +++++++++++++++++++++++-
>> 1 file changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index dcf395bd985f..0bcb1bea7ffb 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -616,9 +616,30 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>
>> lockdep_assert_held(&tile_to_xe(tile)->irq.lock);
>>
>> - if ((tile_to_xe(tile)->info.platform != XE_PVC) && hw_err == HARDWARE_ERROR_CORRECTABLE)
>> + if ((tile_to_xe(tile)->info.platform != XE_PVC))
>> return;
>>
>> + if (hw_err == HARDWARE_ERROR_CORRECTABLE) {
>> + for (i = 0; i < PVC_NUM_IEH; i++)
>> + xe_mmio_write32(gt, SOC_GSYSEVTCTL_REG(base, slave_base, i),
>> + ~REG_BIT(hw_err));
>> +
>> + xe_mmio_write32(gt, SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err),
>> + REG_GENMASK(31, 0));
>> + xe_mmio_write32(gt, SOC_LOCAL_ERR_STAT_MASTER_REG(base, hw_err),
>> + REG_GENMASK(31, 0));
>> + xe_mmio_write32(gt, SOC_GLOBAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
>> + REG_GENMASK(31, 0));
>> + xe_mmio_write32(gt, SOC_LOCAL_ERR_STAT_SLAVE_REG(slave_base, hw_err),
>> + REG_GENMASK(31, 0));
>> +
>> + drm_info(&tile_to_xe(tile)->drm, HW_ERR
>> + "Tile%d Undefine SOC %s error.",
>> + tile->id, hwerr_to_str);
> I still feel in this scenarios at least we shall flag this as drm_err, since even though
> it is correctable and corrected by HW, aren't they spurious as we don't expect to receive them
> and a HW misbehaviour. Thoughts?
Agreed. IMO this change should be part of low driver error reporting.
Not only SOC, we need to report other gt and tile errors
too as spurious interrupt errors when they are undefined irrespective of
error classes(correctable/uncorrectable).
>
>
> Thanks,
> Aravind.
>> +
>> + goto unmask_gsysevtctl;
>> + }
>> +
>> if (hw_err == HARDWARE_ERROR_FATAL) {
>> soc_mstr_glbl_err_reg = soc_mstr_glbl_err_reg_fatal;
>> soc_mstr_lcl_err_reg = soc_mstr_lcl_err_reg_fatal;
>> @@ -709,6 +730,7 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>> xe_mmio_write32(gt, SOC_GLOBAL_ERR_STAT_MASTER_REG(base, hw_err),
>> mst_glb_errstat);
>>
>> +unmask_gsysevtctl:
>> for (i = 0; i < PVC_NUM_IEH; i++)
>> xe_mmio_write32(gt, SOC_GSYSEVTCTL_REG(base, slave_base, i),
>> (HARDWARE_ERROR_MAX << 1) + 1);
More information about the Intel-xe
mailing list