[PATCH] drm/amdgpu: error handling for xgmi and ras

Quan, Evan Evan.Quan at amd.com
Fri Mar 15 09:14:47 UTC 2019


Hw_init -> psp_hw_init -> psp_hw_start (failed) -> ras_pre_fini -> hw_fini -> psp_hw_fini

On psp_hw_start failure, it destroys the cmd/fence/fw buffers. But the ras/xgmi deinit was delayed to psp_hw_fini.

In ras_pre_fini, it calls the ras APIs to disable previously enabled ras features. And that needs the cmd/fence buffers
and trigger a NULL pointer dereference.

Maybe "defer the cmd/fence/fw buffers destroys to psp_hw_fini also" is a  more proper way..

Regards,
Evan
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: 2019年3月15日 16:40
> To: Quan, Evan <Evan.Quan at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: error handling for xgmi and ras
> 
> Am 15.03.19 um 09:33 schrieb Evan Quan:
> > This is a quick workaround. A more complete error handling around
> > psp_hw_start is actually needed.
> >
> > Change-Id: I398efd652584e022debf237950207199a4ea78fc
> > Signed-off-by: Evan Quan <evan.quan at amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 +++++++++
> >   1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index 5888e24219d9..f8d712d306f1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -941,6 +941,11 @@ static int psp_load_fw(struct amdgpu_device
> *adev)
> >   	amdgpu_bo_free_kernel(&psp->fw_pri_bo,
> >   			      &psp->fw_pri_mc_addr, &psp->fw_pri_buf);
> >   failed:
> > +	if (psp->ras.ras_initialized)
> > +		psp->ras.ras_initialized = 0;
> > +	if (psp->ras.ras_initialized)
> > +		psp->ras.ras_initialized = 0;
> 
> No idea what that code does, but it looks really odd. Just copy&pasted typo
> maybe?
> 
> A simple "psp->ras.ras_initialized = 0;" in the error path should have the
> same effect.
> 
> Christian.
> 
> > +
> >   	kfree(psp->cmd);
> >   	psp->cmd = NULL;
> >   	return ret;
> > @@ -1061,6 +1066,10 @@ static int psp_resume(void *handle)
> >
> >   failed:
> >   	DRM_ERROR("PSP resume failed\n");
> > +	if (psp->ras.ras_initialized)
> > +		psp->ras.ras_initialized = 0;
> > +	if (psp->ras.ras_initialized)
> > +		psp->ras.ras_initialized = 0;
> >   	mutex_unlock(&adev->firmware.mutex);
> >   	return ret;
> >   }



More information about the amd-gfx mailing list