[PATCH] drm/amdgpu: Fix mutex lock from atomic context.

Wed Sep 11 14:41:12 UTC 2019

On second though this will break  what about reserving bad pages when 
resetting GPU for non RAS error reason such as manual reset ,S3 or ring 
timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the 
code as is.

Another possible issue in existing code - looks like no reservation will 
take place in those case even now as amdgpu_ras_reserve_bad_pages 
data->last_reserved will be equal to data->count , no ? Looks like for 
this case you need to add flag to FORCE reservation for all pages from  
0 to data->counnt.

Andrey

On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
> I like this much more, I will relocate to 
> amdgpu_umc_process_ras_data_cb an push.
>
> Andrey
>
> On 9/10/19 11:08 PM, Zhou1, Tao wrote:
>> amdgpu_ras_reserve_bad_pages is only used by umc block, so another 
>> approach is to move it into amdgpu_umc_process_ras_data_cb.
>> Anyway, either way is OK and the patch is:
>>
>> Reviewed-by: Tao Zhou <tao.zhou1 at amd.com>
>>
>>> -----Original Message-----
>>> From: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> Sent: 2019年9月11日 3:41
>>> To: amd-gfx at lists.freedesktop.org
>>> Cc: Chen, Guchun <Guchun.Chen at amd.com>; Zhou1, Tao
>>> <Tao.Zhou1 at amd.com>; Deucher, Alexander
>>> <Alexander.Deucher at amd.com>; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky at amd.com>
>>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>
>>> Problem:
>>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
>>> because writing to EEPROM during ASIC reset was unstable.
>>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
>>> directly from ISR context and so locking is not allowed. Also it's 
>>> irrelevant for
>>> this partilcular interrupt as this is generic RAS interrupt and not 
>>> memory
>>> errors specific.
>>>
>>> Fix:
>>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index 012034d..dd5da3c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
>>> amdgpu_device *adev,
>>>       /* save bad page to eeprom before gpu reset,
>>>        * i2c may be unstable in gpu reset
>>>        */
>>> -    amdgpu_ras_reserve_bad_pages(adev);
>>> +    if (in_task())
>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>> +
>>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>           schedule_work(&ras->recovery_work);
>>>       return 0;
>>> -- 
>>> 2.7.4