[PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Thu Dec 23 15:42:58 UTC 2021
Adding back public list and Leslie specifically.
Lijo is right and it's not MTTR release only, also all the unmaps in
amdgpu_device_unmap_mmio
since this patch makes call to amdgpu_device_unmap_mmio conditioned on
device unplugged
we need then to still take care of unmapping even when device is NOT
unplugged - for this i suggest
to look at 'drm/amdgpu: Unmap all MMIO mappings' and just conditionally
call all the deleted unmaps
in the patch where the condition is 'if (drm_dev_enter(dev))'.
Andrey
On 2021-12-23 8:47 a.m., Lazar, Lijo wrote:
> [AMD Official Use Only]
>
> Actually, I was asking specifically about this -
>
> gmc_v9_0_sw_init -> amdgpu_bo_init-> adev->gmc.vram_mtrr = arch_phys_wc_add(adev->gmc.aper_base,
> adev->gmc.aper_size);
>
>
> As per this patch, if the driver load failed due to some error which happens during hw_init(), this action is not undone. I was thinking why it's not considered important to skip this on driver failure to load. For ex: when some MTRR register is reserved for this mapping.
>
> Thanks,
> Lijo
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Sent: Friday, December 17, 2021 9:05 PM
> To: Lazar, Lijo <Lijo.Lazar at amd.com>
> Subject: Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
>
> We do unmap on unload, this function is a collection of all MMIO unmapps we already do on unload, it's just does them early on in case of device hot removal. Before pci_driver.remove callback (amdgpu_pci_remove) finish execution.
>
> Andrey
>
> On 2021-12-17 10:23 a.m., Lazar, Lijo wrote:
>> [AMD Official Use Only]
>>
>> As a side note, even if it's a failed driver load, why it is not important to undo the mappings created during driver load? I'm wondering what is the impact on a system like MI200 A+A.
>>
>> Thanks,
>> Lijo
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>> Andrey Grodzovsky
>> Sent: Friday, December 17, 2021 8:32 PM
>> To: Koenig, Christian <Christian.Koenig at amd.com>; Shi, Leslie
>> <Yuliang.Shi at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>; Deucher,
>> Alexander <Alexander.Deucher at amd.com>; amd-gfx at lists.freedesktop.org
>> Cc: Chen, Guchun <Guchun.Chen at amd.com>
>> Subject: Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if
>> device is unplugged to prevent crash in GPU initialization failure
>>
>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>
>> Andrey
>>
>> On 2021-12-17 3:49 a.m., Christian König wrote:
>>> Am 17.12.21 um 03:26 schrieb Leslie Shi:
>>>> [Why]
>>>> In amdgpu_driver_load_kms, when amdgpu_device_init returns error
>>>> during driver modprobe, it will start the error handle path
>>>> immediately and call into amdgpu_device_unmap_mmio as well to
>>>> release mapped VRAM. However, in the following release callback,
>>>> driver stills visits the unmapped memory like
>>>> vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.
>>>>
>>>> [How]
>>>> call amdgpu_device_unmap_mmio() if device is unplugged to prevent
>>>> invalid memory address in
>>>> vcn_v3_0_sw_fini() when GPU initialization failure.
>>>>
>>>> Signed-off-by: Leslie Shi <Yuliang.Shi at amd.com>
>>> Looks sane to me, but Andrey should probably nood as well.
>>>
>>> Acked-by: Christian König <christian.koenig at amd.com>
>>>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index f31caec669e7..f6a47b927cfd 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct
>>>> amdgpu_device
>>>> *adev)
>>>> amdgpu_gart_dummy_page_fini(adev);
>>>> - amdgpu_device_unmap_mmio(adev);
>>>> + if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>> + amdgpu_device_unmap_mmio(adev);
>>>> +
>>>> }
>>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
More information about the amd-gfx
mailing list