[PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
Christian König
christian.koenig at amd.com
Fri Apr 17 07:46:09 UTC 2020
Why do you think that 3efed000 and befed000 are misaligned addresses?
And see amdgpu_ttm_access_memory(), misaligned accesses are always
routed to the MM path.
Regards,
Christian.
Am 16.04.20 um 18:08 schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi Felix,
>
> You’re probably right.
>
> Passing Vega20 system:
>
> [ 56.683273] amdgpu: [vram dbg] addr 3e7ffff8, val
> deadbeef
>
> [ 56.683349] amdgpu: [vram dbg] addr 3efed000, val
> cafebabe <- potential misalign access
>
> Failing Vega20 system:
>
> [Apr16 12:00] amdgpu: [vram dbg] addr be7ffff8, val
> deadbeef
>
> [ +0.000082] amdgpu: [vram dbg] addr befed000, val
> ffffffff <- potential misalign access
>
> Thanks,
>
> Jon
>
> *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> *Sent:* Wednesday, April 15, 2020 11:02 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Kim, Jonathan
> <Jonathan.Kim at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> The test does not access outside of the allocated memory. But it
> deliberately crosses a boundary where memory can be allocated
> non-contiguously. This is meant to catch problems where the access
> function doesn't handle non-contiguous VRAM allocations correctly.
> However, the way that VRAM allocation has been optimized, I expect
> that most allocations are contiguous nowadays. However, the more
> interesting aspect of the test is, that it performs misaligned memory
> accesses. The MMIO method of accessing VRAM explicitly handles
> misaligned accesses and breaks them down into dword aligned accesses
> with proper masking and shifting.
>
> Could the unaligned nature of the memory access have something to do
> with hitting RAS errors? That's something unique to this test that we
> wouldn't see on a normal page table update or memory eviction.
>
> Regards,
> Felix
>
> ------------------------------------------------------------------------
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>
> *Sent:* Wednesday, April 15, 2020 6:58 AM
> *To:* Kim, Jonathan <Jonathan.Kim at amd.com
> <mailto:Jonathan.Kim at amd.com>>; Kuehling, Felix
> <Felix.Kuehling at amd.com <mailto:Felix.Kuehling at amd.com>>; Deucher,
> Alexander <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>
> *Cc:* Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk
> allocated mapped memory and 2 DWORDS outside that boundary (it’s
> only about 4MB to the boundary). Then we POKE to swap the DWORD
> positions across the boundary. The RAS event on the single
> failing machine happens on the out of boundary PEEK.
>
>
> Well when you access outside of an allocated buffer I would expect
> that we never get as far as even touching the hardware because the
> kernel should block the access with an -EPERM or -EFAULT. So sounds
> like I'm not understanding something correctly here.
>
> Apart from that I completely agree that we need to sort out any other
> RAS event first to make sure that the system is simply not failing
> randomly.
>
> Regards,
> Christian.
>
> Am 15.04.20 um 11:49 schrieb Kim, Jonathan:
>
> [AMD Public Use]
>
> Hi Christian,
>
> That could potentially be it. With additional testing, 2 of 3
> Vega20 machines never hit error over BAR access with the PTRACE
> test. 3 of 3 machines (from the same pool) always hit error with
> CWSR.
>
> To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk
> allocated mapped memory and 2 DWORDS outside that boundary (it’s
> only about 4MB to the boundary). Then we POKE to swap the DWORD
> positions across the boundary. The RAS event on the single
> failing machine happens on the out of boundary PEEK.
>
> Felix mentioned we don’t hit errors over general HDP access but
> that may not true. An Arcturus failure sys logs posted (which
> wasn’t tested by me) shows someone launched rocm bandwidth test,
> hit a VM fault and a RAS event ensued during evictions (I can
> point the internal ticket or log snippet offline if interested).
> Whether the RAS event is BAR access triggered or the result of HW
> instability is beyond me since I don’t have access to the machine.
>
> Thanks,
>
> Jon
>
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> *Sent:* Wednesday, April 15, 2020 4:11 AM
> *To:* Kim, Jonathan <Jonathan.Kim at amd.com>
> <mailto:Jonathan.Kim at amd.com>; Kuehling, Felix
> <Felix.Kuehling at amd.com> <mailto:Felix.Kuehling at amd.com>; Deucher,
> Alexander <Alexander.Deucher at amd.com>
> <mailto:Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>
> <mailto:Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible
> in amdgpu_device_vram_access v2"
>
> Hi Jon,
>
> Also cwsr tests fail on Vega20 with or without the revert with
> the same RAS error.
>
>
> That sounds like the system/setup has a more general problem.
>
> Could it be that we are seeing RAS errors because there really is
> some hardware failure, but with the MM path we don't trigger a RAS
> interrupt?
>
> Thanks,
> Christian.
>
> Am 14.04.20 um 22:30 schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> If we’re passing the test on the revert, then the only thing
> that’s different is we’re not invalidating HDP and doing a
> copy to host anymore in amdgpu_device_vram_access since the
> function is still called in ttm access_memory with BAR.
>
> Also cwsr tests fail on Vega20 with or without the revert with
> the same RAS error.
>
> Thanks,
>
> Jon
>
> *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> <mailto:Felix.Kuehling at amd.com>
> *Sent:* Tuesday, April 14, 2020 2:32 PM
> *To:* Kim, Jonathan <Jonathan.Kim at amd.com>
> <mailto:Jonathan.Kim at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com> <mailto:Christian.Koenig at amd.com>;
> Deucher, Alexander <Alexander.Deucher at amd.com>
> <mailto:Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>
> <mailto:Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> I wouldn't call it premature. Revert is a usual practice when
> there is a serious regression that isn't fully understood or
> root-caused. As far as I can tell, the problem has been
> reproduced on multiple systems, different GPUs, and clearly
> regressed to Christian's commit. I think that justifies
> reverting it for now.
>
> I agree with Christian that a general HDP memory access
> problem causing RAS errors would potentially cause problems in
> other tests as well. For example common operations like GART
> table updates, and GPUVM page table updates and PCIe peer2peer
> accesses in ROCm applications use HDP. But we're not seeing
> obvious problems from those. So we need to understand what's
> special about this test. I asked questions to that effect on
> our other email thread.
>
> Regards,
> Felix
>
> Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> I think it’s premature to push this revert.
>
> With more testing, I’m getting failures from different
> tests or sometimes none at all on my machine.
>
> Kent, let’s continue the discussion on the original thread.
>
> Thanks,
>
> Jon
>
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> *Sent:* Tuesday, April 14, 2020 10:47 AM
> *To:* Deucher, Alexander <Alexander.Deucher at amd.com>
> <mailto:Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>
> <mailto:Kent.Russell at amd.com>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>; Kuehling, Felix
> <Felix.Kuehling at amd.com> <mailto:Felix.Kuehling at amd.com>;
> Kim, Jonathan <Jonathan.Kim at amd.com>
> <mailto:Jonathan.Kim at amd.com>
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> That's exactly my concern as well.
>
> This looks a bit like the test creates erroneous data
> somehow, but there doesn't seems to be a RAS check in the
> MM data path.
>
> And now that we use the BAR path it goes up in flames.
>
> I just don't see how we can create erroneous data in a
> test case?
>
> Christian.
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com
> <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the
> BAR could cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on
> behalf of Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR
> if possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we
> get the dmesg output below, and then the kernel ends
> up hanging. I don't know enough about the test itself
> to know why this is occurring, but Jon Kim and Felix
> were discussing it on a separate thread when the issue
> was first reported, so they can hopefully provide some
> additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König
> <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR
> if possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit
> c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and
> subsequent kernel hard-hang
> > > when running the
> KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT
> detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to
> an unsuccessful
> > > queues preemption Failed to evict process queues
> Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000
> Started restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD
> VCPU state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26,
> response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup
> Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7,
> response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed
> to disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
> disable all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to
> disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
> *ERROR* suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong
> here since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König
> <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void
> amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size,
> adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com
> <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the
> BAR could cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on
> behalf of Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR
> if possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we
> get the dmesg output below, and then the kernel ends
> up hanging. I don't know enough about the test itself
> to know why this is occurring, but Jon Kim and Felix
> were discussing it on a separate thread when the issue
> was first reported, so they can hopefully provide some
> additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König
> <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR
> if possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit
> c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and
> subsequent kernel hard-hang
> > > when running the
> KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT
> detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to
> an unsuccessful
> > > queues preemption Failed to evict process queues
> Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000
> Started restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD
> VCPU state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26,
> response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup
> Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7,
> response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed
> to disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
> disable all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to
> disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
> *ERROR* suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong
> here since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König
> <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void
> amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size,
> adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com
> <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the
> BAR could cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on
> behalf of Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR
> if possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we
> get the dmesg output below, and then the kernel ends
> up hanging. I don't know enough about the test itself
> to know why this is occurring, but Jon Kim and Felix
> were discussing it on a separate thread when the issue
> was first reported, so they can hopefully provide some
> additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König
> <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR
> if possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit
> c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and
> subsequent kernel hard-hang
> > > when running the
> KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT
> detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to
> an unsuccessful
> > > queues preemption Failed to evict process queues
> Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000
> Started restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD
> VCPU state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26,
> response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup
> Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7,
> response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed
> to disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
> disable all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to
> disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
> *ERROR* suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong
> here since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König
> <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void
> amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size,
> adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com
> <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the
> BAR could cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on
> behalf of Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR
> if possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we
> get the dmesg output below, and then the kernel ends
> up hanging. I don't know enough about the test itself
> to know why this is occurring, but Jon Kim and Felix
> were discussing it on a separate thread when the issue
> was first reported, so they can hopefully provide some
> additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König
> <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR
> if possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit
> c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and
> subsequent kernel hard-hang
> > > when running the
> KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT
> detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to
> an unsuccessful
> > > queues preemption Failed to evict process queues
> Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000
> Started restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD
> VCPU state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26,
> response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup
> Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7,
> response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed
> to disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
> disable all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to
> disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
> *ERROR* suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong
> here since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König
> <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void
> amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size,
> adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com
> <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR
> could cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf
> of Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get
> the dmesg output below, and then the kernel ends up
> hanging. I don't know enough about the test itself to know
> why this is occurring, but Jon Kim and Felix were
> discussing it on a separate thread when the issue was
> first reported, so they can hopefully provide some
> additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit
> c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent
> kernel hard-hang
> > > when running the
> KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an
> unsuccessful
> > > queues preemption Failed to evict process queues
> Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
> state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26,
> response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7,
> response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
> disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
> disable all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
> suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here
> since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200417/251560c1/attachment-0001.htm>
More information about the amd-gfx
mailing list