[PATCH] drm/amdgpu: Fix mutex lock from atomic context.

Thu Sep 12 11:35:09 UTC 2019

Hi Andrey:

Are you sure of the VRAM content loss after gpu reset? I'm not very familiar with the detail of gpu reset and I'll do experiment to confirm the case you mentioned.

Regards,
Tao

> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Sent: 2019年9月12日 10:32
> To: Chen, Guchun <Guchun.Chen at amd.com>; Zhou1, Tao
> <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
> 
> That not what I meant. Let's say you handled one bad page interrupt and as
> a result have one bad page reserved. Now unrelated gfx ring timeout
> happens which triggers GPU reset and VRAM loss. When you come back from
> reset amdgpu_ras_reserve_bad_pages will be called but since last_reserved
> == data_count the bad page will not be reserved again, maybe we should just
> set data->last_reserved to 0 again if VRAM was lost during ASIC reset...
> 
> Andrey
> 
> ________________________________________
> From: Chen, Guchun <Guchun.Chen at amd.com>
> Sent: 11 September 2019 21:53:03
> To: Grodzovsky, Andrey; Zhou1, Tao; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander
> Subject: RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
> 
> Comment inline.
> 
> Regards,
> Guchun
> 
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Sent: Wednesday, September 11, 2019 10:41 PM
> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Chen, Guchun <Guchun.Chen at amd.com>; Deucher, Alexander
> <Alexander.Deucher at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
> 
> On second though this will break  what about reserving bad pages when
> resetting GPU for non RAS error reason such as manual reset ,S3 or ring
> timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the
> code as is.
> 
> Another possible issue in existing code - looks like no reservation will take
> place in those case even now as amdgpu_ras_reserve_bad_pages
> data->last_reserved will be equal to data->count , no ? Looks like for
> this case you need to add flag to FORCE reservation for all pages from
> 0 to data->counnt.
> [Guchun]Yes, last_reserved is not updated any more, unless we unload our
> driver. So it maybe always equal to data->count, then no new bad page will
> be reserved.
> I see we have one eeprom reset by user, can we put this last_reserved clean
> operation to user in the same stack as well?
> 
> Andrey
> 
> On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
> > I like this much more, I will relocate to
> > amdgpu_umc_process_ras_data_cb an push.
> >
> > Andrey
> >
> > On 9/10/19 11:08 PM, Zhou1, Tao wrote:
> >> amdgpu_ras_reserve_bad_pages is only used by umc block, so another
> >> approach is to move it into amdgpu_umc_process_ras_data_cb.
> >> Anyway, either way is OK and the patch is:
> >>
> >> Reviewed-by: Tao Zhou <tao.zhou1 at amd.com>
> >>
> >>> -----Original Message-----
> >>> From: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> >>> Sent: 2019年9月11日 3:41
> >>> To: amd-gfx at lists.freedesktop.org
> >>> Cc: Chen, Guchun <Guchun.Chen at amd.com>; Zhou1, Tao
> >>> <Tao.Zhou1 at amd.com>; Deucher, Alexander
> <Alexander.Deucher at amd.com>;
> >>> Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> >>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
> >>>
> >>> Problem:
> >>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
> >>> because writing to EEPROM during ASIC reset was unstable.
> >>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
> >>> directly from ISR context and so locking is not allowed. Also it's
> >>> irrelevant for this partilcular interrupt as this is generic RAS
> >>> interrupt and not memory errors specific.
> >>>
> >>> Fix:
> >>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
> >>>
> >>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
> >>>   1 file changed, 3 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >>> index 012034d..dd5da3c 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
> >>> amdgpu_device *adev,
> >>>       /* save bad page to eeprom before gpu reset,
> >>>        * i2c may be unstable in gpu reset
> >>>        */
> >>> -    amdgpu_ras_reserve_bad_pages(adev);
> >>> +    if (in_task())
> >>> +        amdgpu_ras_reserve_bad_pages(adev);
> >>> +
> >>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
> >>>           schedule_work(&ras->recovery_work);
> >>>       return 0;
> >>> --
> >>> 2.7.4