[Bug 216373] New: Uncorrected errors reported for AMD GPU

Lazar, Lijo lijo.lazar at amd.com
Thu Aug 25 07:54:46 UTC 2022



On 8/25/2022 1:04 PM, Christian König wrote:
> Am 25.08.22 um 08:40 schrieb Stefan Roese:
>> On 24.08.22 16:45, Tom Seewald wrote:
>>> On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo <lijo.lazar at amd.com> wrote:
>>>> Unfortunately, I don't have any NV platforms to test. Attached is an
>>>> 'untested-patch' based on your trace logs.
>>>>
>>>> Thanks,
>>>> Lijo
>>>
>>> Thank you for the patch. It applied cleanly to v6.0-rc2 and after
>>> booting that kernel I no longer see any messages about PCI errors. I
>>> have uploaded a dmesg log to the bug report:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D301642&data=05%7C01%7Cchristian.koenig%40amd.com%7Cd55a659245b24864bd2d08da8664ae2d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637970065087671063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&sdata=vbhJ9OB0jIYr%2FRkDIbQHhRRqhyklnnHOT9Xi8z17MYY%3D&reserved=0 
>>>
>>
>> I did not follow this thread in depth, but FWICT the bug is solved now
>> with this patch. So is it correct, that the now fully enabled AER
>> support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
>> GPU driver?
> 
> It looks like it, but I'm not 100% sure about the rational behind it.
> 
> Lijo can you explain more on this?
> 

 From the trace, during gmc hw_init it takes this route -

gart_enable -> amdgpu_gtt_mgr_recover -> amdgpu_gart_invalidate_tlb -> 
amdgpu_device_flush_hdp -> amdgpu_asic_flush_hdp (non-ring based HDP flush)

HDP flush is done using remapped offset which is MMIO_REG_HOLE_OFFSET 
(0x80000 - PAGE_SIZE)

WREG32_NO_KIQ((adev->rmmio_remap.reg_offset + 
KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL) >> 2, 0);

However, the remapping is not yet done at this point. It's done at a 
later point during common block initialization. Access to the unmapped 
offset '(0x80000 - PAGE_SIZE)' seems to come back as unsupported request 
and reported through AER.

In the patch, I just moved the remapping before gmc block initialization.

Thanks,
Lijo

> Thanks,
> Christian.
> 
>>
>> Thanks,
>> Stefan
> 


More information about the amd-gfx mailing list