[Bug 216373] New: Uncorrected errors reported for AMD GPU

Thu Aug 25 17:48:45 UTC 2022

On Thu, Aug 25, 2022 at 10:18:28AM +0200, Christian König wrote:
> Am 25.08.22 um 09:54 schrieb Lazar, Lijo:
> > On 8/25/2022 1:04 PM, Christian König wrote:
> > > Am 25.08.22 um 08:40 schrieb Stefan Roese:
> > > > On 24.08.22 16:45, Tom Seewald wrote:
> > > > > On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo
> > > > > <lijo.lazar at amd.com> wrote:
> > > > > > Unfortunately, I don't have any NV platforms to test. Attached is an
> > > > > > 'untested-patch' based on your trace logs.
> > > > > ...
> > > > 
> > > > I did not follow this thread in depth, but FWICT the bug is solved now
> > > > with this patch. So is it correct, that the now fully enabled AER
> > > > support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
> > > > GPU driver?
> > > 
> > > It looks like it, but I'm not 100% sure about the rational behind it.
> > > 
> > > Lijo can you explain more on this?
> > 
> > From the trace, during gmc hw_init it takes this route -
> > 
> > gart_enable -> amdgpu_gtt_mgr_recover -> amdgpu_gart_invalidate_tlb ->
> > amdgpu_device_flush_hdp -> amdgpu_asic_flush_hdp (non-ring based HDP
> > flush)
> > 
> > HDP flush is done using remapped offset which is MMIO_REG_HOLE_OFFSET
> > (0x80000 - PAGE_SIZE)
> > 
> > WREG32_NO_KIQ((adev->rmmio_remap.reg_offset +
> > KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL) >> 2, 0);
> > 
> > However, the remapping is not yet done at this point. It's done at a
> > later point during common block initialization. Access to the unmapped
> > offset '(0x80000 - PAGE_SIZE)' seems to come back as unsupported request
> > and reported through AER.
> 
> That's interesting behavior. So far AER always indicated some kind of
> transmission error.
> 
> When that happens as well on unmapped areas of the MMIO BAR then we need to
> keep that in mind.

AER can log many different kinds of errors, some related to hardware
issues and some related to software.

PCI writes are normally posted and get no response, so AER is the main
way to find out about writes to unimplemented addresses.

Reads do get a response, of course, and reads to unimplemented
addresses cause errors that most hardware turns into a ~0 data return
(in addition to reporting via AER if enabled).

Bjorn