[PATCH 1/3] drm/amdgpu: implement RAS interrupt handler for poison creation

Zhang, Hawking Hawking.Zhang at amd.com
Wed Apr 20 06:18:08 UTC 2022


Please also keep the naming style consistent by using "amdgpu_ras" prefix

Regards,
Hawking

-----Original Message-----
From: Zhou1, Tao <Tao.Zhou1 at amd.com> 
Sent: Wednesday, April 20, 2022 14:17
To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org; Zhang, Hawking <Hawking.Zhang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Ziya, Mohammad zafar <Mohammadzafar.Ziya at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>
Subject: RE: [PATCH 1/3] drm/amdgpu: implement RAS interrupt handler for poison creation

[AMD Official Use Only]



> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> Sent: Wednesday, April 20, 2022 12:10 PM
> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; 
> Zhang, Hawking <Hawking.Zhang at amd.com>; Yang, Stanley 
> <Stanley.Yang at amd.com>; Ziya, Mohammad zafar 
> <Mohammadzafar.Ziya at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>
> Subject: Re: [PATCH 1/3] drm/amdgpu: implement RAS interrupt handler 
> for poison creation
> 
> 
> 
> On 4/20/2022 9:23 AM, Tao Zhou wrote:
> > Prepare for the implementation of poison consumption handler.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 62 ++++++++++++++-----------
> >   1 file changed, 35 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index 4bbed76b79c8..2d066cff70ea 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -1515,12 +1515,44 @@ static int amdgpu_ras_fs_fini(struct
> amdgpu_device *adev)
> >   /* ras fs end */
> >
> >   /* ih begin */
> > +static void ras_interrupt_poison_creation_handler(struct ras_manager *obj,
> > +				struct amdgpu_iv_entry *entry)
> > +{
> > +	struct ras_ih_data *data = &obj->ih_data;
> > +	struct ras_err_data err_data = {0, 0, 0, NULL};
> > +	int ret;
> > +
> > +	if (!data->cb)
> > +		return;
> > +
> > +	if (amdgpu_ras_is_poison_mode_supported(obj->adev) &&
> > +	    obj->head.block == AMDGPU_RAS_BLOCK__UMC)
> > +		dev_info(obj->adev->dev,
> > +			"Poison is created, no user action is needed.\n");
> > +	else {
> 
> It is confusing to handle RAS interrupts under poison_creation_handler 
> in non- poison mode.

[Tao] will split it into ras_interrupt_poison_creation_handler and ras_interrupt_umc_handler.

> 
> Thanks,
> Lijo
> 
> > +		/* Let IP handle its data, maybe we need get the output
> > +		 * from the callback to update the error type/count, etc
> > +		 */
> > +		ret = data->cb(obj->adev, &err_data, entry);
> > +		/* ue will trigger an interrupt, and in that case
> > +		 * we need do a reset to recovery the whole system.
> > +		 * But leave IP do that recovery, here we just dispatch
> > +		 * the error.
> > +		 */
> > +		if (ret == AMDGPU_RAS_SUCCESS) {
> > +			/* these counts could be left as 0 if
> > +			 * some blocks do not count error number
> > +			 */
> > +			obj->err_data.ue_count += err_data.ue_count;
> > +			obj->err_data.ce_count += err_data.ce_count;
> > +		}
> > +	}
> > +}
> > +
> >   static void amdgpu_ras_interrupt_handler(struct ras_manager *obj)
> >   {
> >   	struct ras_ih_data *data = &obj->ih_data;
> >   	struct amdgpu_iv_entry entry;
> > -	int ret;
> > -	struct ras_err_data err_data = {0, 0, 0, NULL};
> >
> >   	while (data->rptr != data->wptr) {
> >   		rmb();
> > @@ -1531,31 +1563,7 @@ static void 
> > amdgpu_ras_interrupt_handler(struct
> ras_manager *obj)
> >   		data->rptr = (data->aligned_element_size +
> >   				data->rptr) % data->ring_size;
> >
> > -		if (data->cb) {
> > -			if (amdgpu_ras_is_poison_mode_supported(obj->adev)
> &&
> > -			    obj->head.block == AMDGPU_RAS_BLOCK__UMC)
> > -				dev_info(obj->adev->dev,
> > -						"Poison is created, no user
> action is needed.\n");
> > -			else {
> > -				/* Let IP handle its data, maybe we need get
> the output
> > -				 * from the callback to udpate the error
> type/count, etc
> > -				 */
> > -				memset(&err_data, 0, sizeof(err_data));
> > -				ret = data->cb(obj->adev, &err_data, &entry);
> > -				/* ue will trigger an interrupt, and in that case
> > -				 * we need do a reset to recovery the whole
> system.
> > -				 * But leave IP do that recovery, here we just
> dispatch
> > -				 * the error.
> > -				 */
> > -				if (ret == AMDGPU_RAS_SUCCESS) {
> > -					/* these counts could be left as 0 if
> > -					 * some blocks do not count error
> number
> > -					 */
> > -					obj->err_data.ue_count +=
> err_data.ue_count;
> > -					obj->err_data.ce_count +=
> err_data.ce_count;
> > -				}
> > -			}
> > -		}
> > +		ras_interrupt_poison_creation_handler(obj, &entry);
> >   	}
> >   }
> >
> >


More information about the amd-gfx mailing list