[PATCH] drm/amdgpu: Fix mutex lock from atomic context.
Grodzovsky, Andrey
Andrey.Grodzovsky at amd.com
Thu Sep 12 15:19:17 UTC 2019
RAS error will be triggered if the HW identified a faulty physical page,
the error comes through an interrupt where the data payload will have
information that can be translated into the bad page address, we then as
a recovery measure reset the ASIC, reserve this bad page so it cannot be
used by anyone else and store it's address to EEPROM for future
reservation after boot.
Andrey
On 9/12/19 11:15 AM, Christian König wrote:
> Well I still hope to avoid VRAM lost in most of the cases, but that is
> really not guaranteed.
>
> What is the bad page and why do you need to reserve it?
>
> Christian.
>
> Am 12.09.19 um 16:09 schrieb Grodzovsky, Andrey:
>> I am not sure VRAM loss happens every time, but when it does I would
>> assume you would have to reserve them again as the page tables content
>> was lost. On the other hand I do remember we keep shadow system memory
>> copies of all page tables so maybe that not an issue, so yes, just try
>> to allocate the bad page after reset and if it's still reserved you will
>> fail.
>>
>> Andrey
>>
>> On 9/12/19 7:35 AM, Zhou1, Tao wrote:
>>> Hi Andrey:
>>>
>>> Are you sure of the VRAM content loss after gpu reset? I'm not very
>>> familiar with the detail of gpu reset and I'll do experiment to
>>> confirm the case you mentioned.
>>>
>>> Regards,
>>> Tao
>>>
>>>> -----Original Message-----
>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>> Sent: 2019年9月12日 10:32
>>>> To: Chen, Guchun <Guchun.Chen at amd.com>; Zhou1, Tao
>>>> <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>
>>>> That not what I meant. Let's say you handled one bad page interrupt
>>>> and as
>>>> a result have one bad page reserved. Now unrelated gfx ring timeout
>>>> happens which triggers GPU reset and VRAM loss. When you come back
>>>> from
>>>> reset amdgpu_ras_reserve_bad_pages will be called but since
>>>> last_reserved
>>>> == data_count the bad page will not be reserved again, maybe we
>>>> should just
>>>> set data->last_reserved to 0 again if VRAM was lost during ASIC
>>>> reset...
>>>>
>>>> Andrey
>>>>
>>>> ________________________________________
>>>> From: Chen, Guchun <Guchun.Chen at amd.com>
>>>> Sent: 11 September 2019 21:53:03
>>>> To: Grodzovsky, Andrey; Zhou1, Tao; amd-gfx at lists.freedesktop.org
>>>> Cc: Deucher, Alexander
>>>> Subject: RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>
>>>> Comment inline.
>>>>
>>>> Regards,
>>>> Guchun
>>>>
>>>> -----Original Message-----
>>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>> Sent: Wednesday, September 11, 2019 10:41 PM
>>>> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Cc: Chen, Guchun <Guchun.Chen at amd.com>; Deucher, Alexander
>>>> <Alexander.Deucher at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>
>>>> On second though this will break what about reserving bad pages when
>>>> resetting GPU for non RAS error reason such as manual reset ,S3 or
>>>> ring
>>>> timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the
>>>> code as is.
>>>>
>>>> Another possible issue in existing code - looks like no reservation
>>>> will take
>>>> place in those case even now as amdgpu_ras_reserve_bad_pages
>>>> data->last_reserved will be equal to data->count , no ? Looks like for
>>>> this case you need to add flag to FORCE reservation for all pages from
>>>> 0 to data->counnt.
>>>> [Guchun]Yes, last_reserved is not updated any more, unless we
>>>> unload our
>>>> driver. So it maybe always equal to data->count, then no new bad
>>>> page will
>>>> be reserved.
>>>> I see we have one eeprom reset by user, can we put this
>>>> last_reserved clean
>>>> operation to user in the same stack as well?
>>>>
>>>> Andrey
>>>>
>>>> On 9/11/19 10:19 AM, Andrey Grodzovsky wrote:
>>>>> I like this much more, I will relocate to
>>>>> amdgpu_umc_process_ras_data_cb an push.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 9/10/19 11:08 PM, Zhou1, Tao wrote:
>>>>>> amdgpu_ras_reserve_bad_pages is only used by umc block, so another
>>>>>> approach is to move it into amdgpu_umc_process_ras_data_cb.
>>>>>> Anyway, either way is OK and the patch is:
>>>>>>
>>>>>> Reviewed-by: Tao Zhou <tao.zhou1 at amd.com>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>> Sent: 2019年9月11日 3:41
>>>>>>> To: amd-gfx at lists.freedesktop.org
>>>>>>> Cc: Chen, Guchun <Guchun.Chen at amd.com>; Zhou1, Tao
>>>>>>> <Tao.Zhou1 at amd.com>; Deucher, Alexander
>>>> <Alexander.Deucher at amd.com>;
>>>>>>> Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>>>>>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context.
>>>>>>>
>>>>>>> Problem:
>>>>>>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu
>>>>>>> because writing to EEPROM during ASIC reset was unstable.
>>>>>>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called
>>>>>>> directly from ISR context and so locking is not allowed. Also it's
>>>>>>> irrelevant for this partilcular interrupt as this is generic RAS
>>>>>>> interrupt and not memory errors specific.
>>>>>>>
>>>>>>> Fix:
>>>>>>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>>>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> index 012034d..dd5da3c 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>>>>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct
>>>>>>> amdgpu_device *adev,
>>>>>>> /* save bad page to eeprom before gpu reset,
>>>>>>> * i2c may be unstable in gpu reset
>>>>>>> */
>>>>>>> - amdgpu_ras_reserve_bad_pages(adev);
>>>>>>> + if (in_task())
>>>>>>> + amdgpu_ras_reserve_bad_pages(adev);
>>>>>>> +
>>>>>>> if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>>>>> schedule_work(&ras->recovery_work);
>>>>>>> return 0;
>>>>>>> --
>>>>>>> 2.7.4
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
More information about the amd-gfx
mailing list