[PATCH] drm/amdgpu: error handling for xgmi and ras
Quan, Evan
Evan.Quan at amd.com
Fri Mar 15 09:14:47 UTC 2019
Hw_init -> psp_hw_init -> psp_hw_start (failed) -> ras_pre_fini -> hw_fini -> psp_hw_fini
On psp_hw_start failure, it destroys the cmd/fence/fw buffers. But the ras/xgmi deinit was delayed to psp_hw_fini.
In ras_pre_fini, it calls the ras APIs to disable previously enabled ras features. And that needs the cmd/fence buffers
and trigger a NULL pointer dereference.
Maybe "defer the cmd/fence/fw buffers destroys to psp_hw_fini also" is a more proper way..
Regards,
Evan
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: 2019年3月15日 16:40
> To: Quan, Evan <Evan.Quan at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: error handling for xgmi and ras
>
> Am 15.03.19 um 09:33 schrieb Evan Quan:
> > This is a quick workaround. A more complete error handling around
> > psp_hw_start is actually needed.
> >
> > Change-Id: I398efd652584e022debf237950207199a4ea78fc
> > Signed-off-by: Evan Quan <evan.quan at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index 5888e24219d9..f8d712d306f1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -941,6 +941,11 @@ static int psp_load_fw(struct amdgpu_device
> *adev)
> > amdgpu_bo_free_kernel(&psp->fw_pri_bo,
> > &psp->fw_pri_mc_addr, &psp->fw_pri_buf);
> > failed:
> > + if (psp->ras.ras_initialized)
> > + psp->ras.ras_initialized = 0;
> > + if (psp->ras.ras_initialized)
> > + psp->ras.ras_initialized = 0;
>
> No idea what that code does, but it looks really odd. Just copy&pasted typo
> maybe?
>
> A simple "psp->ras.ras_initialized = 0;" in the error path should have the
> same effect.
>
> Christian.
>
> > +
> > kfree(psp->cmd);
> > psp->cmd = NULL;
> > return ret;
> > @@ -1061,6 +1066,10 @@ static int psp_resume(void *handle)
> >
> > failed:
> > DRM_ERROR("PSP resume failed\n");
> > + if (psp->ras.ras_initialized)
> > + psp->ras.ras_initialized = 0;
> > + if (psp->ras.ras_initialized)
> > + psp->ras.ras_initialized = 0;
> > mutex_unlock(&adev->firmware.mutex);
> > return ret;
> > }
More information about the amd-gfx
mailing list