[Bug 216373] New: Uncorrected errors reported for AMD GPU

Christian König christian.koenig at amd.com
Thu Aug 25 08:18:28 UTC 2022


Am 25.08.22 um 09:54 schrieb Lazar, Lijo:
>
>
> On 8/25/2022 1:04 PM, Christian König wrote:
>> Am 25.08.22 um 08:40 schrieb Stefan Roese:
>>> On 24.08.22 16:45, Tom Seewald wrote:
>>>> On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo <lijo.lazar at amd.com> 
>>>> wrote:
>>>>> Unfortunately, I don't have any NV platforms to test. Attached is an
>>>>> 'untested-patch' based on your trace logs.
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>
>>>> Thank you for the patch. It applied cleanly to v6.0-rc2 and after
>>>> booting that kernel I no longer see any messages about PCI errors. I
>>>> have uploaded a dmesg log to the bug report:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D301642&data=05%7C01%7Cchristian.koenig%40amd.com%7Cd55a659245b24864bd2d08da8664ae2d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637970065087671063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&sdata=vbhJ9OB0jIYr%2FRkDIbQHhRRqhyklnnHOT9Xi8z17MYY%3D&reserved=0 
>>>>
>>>
>>> I did not follow this thread in depth, but FWICT the bug is solved now
>>> with this patch. So is it correct, that the now fully enabled AER
>>> support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
>>> GPU driver?
>>
>> It looks like it, but I'm not 100% sure about the rational behind it.
>>
>> Lijo can you explain more on this?
>>
>
> From the trace, during gmc hw_init it takes this route -
>
> gart_enable -> amdgpu_gtt_mgr_recover -> amdgpu_gart_invalidate_tlb -> 
> amdgpu_device_flush_hdp -> amdgpu_asic_flush_hdp (non-ring based HDP 
> flush)
>
> HDP flush is done using remapped offset which is MMIO_REG_HOLE_OFFSET 
> (0x80000 - PAGE_SIZE)
>
> WREG32_NO_KIQ((adev->rmmio_remap.reg_offset + 
> KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL) >> 2, 0);
>
> However, the remapping is not yet done at this point. It's done at a 
> later point during common block initialization. Access to the unmapped 
> offset '(0x80000 - PAGE_SIZE)' seems to come back as unsupported 
> request and reported through AER.

That's interesting behavior. So far AER always indicated some kind of 
transmission error.

When that happens as well on unmapped areas of the MMIO BAR then we need 
to keep that in mind.

Thanks,
Christian.

>
> In the patch, I just moved the remapping before gmc block initialization.
>
> Thanks,
> Lijo
>
>> Thanks,
>> Christian.
>>
>>>
>>> Thanks,
>>> Stefan
>>



More information about the amd-gfx mailing list