[PATCH] drm/amdgpu: error handling for xgmi and ras

Quan, Evan Evan.Quan at amd.com
Mon Mar 18 02:44:02 UTC 2019


Drop this patch as a new refined one was just sent out.

Regards,
Evan
> -----Original Message-----
> From: Quan, Evan
> Sent: 2019年3月15日 17:15
> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
> gfx at lists.freedesktop.org
> Subject: RE: [PATCH] drm/amdgpu: error handling for xgmi and ras
> 
> Hw_init -> psp_hw_init -> psp_hw_start (failed) -> ras_pre_fini -> hw_fini ->
> psp_hw_fini
> 
> On psp_hw_start failure, it destroys the cmd/fence/fw buffers. But the
> ras/xgmi deinit was delayed to psp_hw_fini.
> 
> In ras_pre_fini, it calls the ras APIs to disable previously enabled ras features.
> And that needs the cmd/fence buffers and trigger a NULL pointer
> dereference.
> 
> Maybe "defer the cmd/fence/fw buffers destroys to psp_hw_fini also" is a
> more proper way..
> 
> Regards,
> Evan
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com>
> > Sent: 2019年3月15日 16:40
> > To: Quan, Evan <Evan.Quan at amd.com>; amd-gfx at lists.freedesktop.org
> > Subject: Re: [PATCH] drm/amdgpu: error handling for xgmi and ras
> >
> > Am 15.03.19 um 09:33 schrieb Evan Quan:
> > > This is a quick workaround. A more complete error handling around
> > > psp_hw_start is actually needed.
> > >
> > > Change-Id: I398efd652584e022debf237950207199a4ea78fc
> > > Signed-off-by: Evan Quan <evan.quan at amd.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 +++++++++
> > >   1 file changed, 9 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > > index 5888e24219d9..f8d712d306f1 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > > @@ -941,6 +941,11 @@ static int psp_load_fw(struct amdgpu_device
> > *adev)
> > >   	amdgpu_bo_free_kernel(&psp->fw_pri_bo,
> > >   			      &psp->fw_pri_mc_addr, &psp->fw_pri_buf);
> > >   failed:
> > > +	if (psp->ras.ras_initialized)
> > > +		psp->ras.ras_initialized = 0;
> > > +	if (psp->ras.ras_initialized)
> > > +		psp->ras.ras_initialized = 0;
> >
> > No idea what that code does, but it looks really odd. Just copy&pasted
> > typo maybe?
> >
> > A simple "psp->ras.ras_initialized = 0;" in the error path should have
> > the same effect.
> >
> > Christian.
> >
> > > +
> > >   	kfree(psp->cmd);
> > >   	psp->cmd = NULL;
> > >   	return ret;
> > > @@ -1061,6 +1066,10 @@ static int psp_resume(void *handle)
> > >
> > >   failed:
> > >   	DRM_ERROR("PSP resume failed\n");
> > > +	if (psp->ras.ras_initialized)
> > > +		psp->ras.ras_initialized = 0;
> > > +	if (psp->ras.ras_initialized)
> > > +		psp->ras.ras_initialized = 0;
> > >   	mutex_unlock(&adev->firmware.mutex);
> > >   	return ret;
> > >   }



More information about the amd-gfx mailing list