[PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

Wed Apr 15 10:58:28 UTC 2020

> To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk 
> allocated mapped memory and 2 DWORDS outside that boundary (it’s only 
> about 4MB to the boundary).  Then we POKE to swap the DWORD positions 
> across the boundary.  The RAS event on the single failing machine 
> happens on the out of boundary PEEK.
>

Well when you access outside of an allocated buffer I would expect that 
we never get as far as even touching the hardware because the kernel 
should block the access with an -EPERM or -EFAULT. So sounds like I'm 
not understanding something correctly here.

Apart from that I completely agree that we need to sort out any other 
RAS event first to make sure that the system is simply not failing randomly.

Regards,
Christian.

Am 15.04.20 um 11:49 schrieb Kim, Jonathan:
>
> [AMD Public Use]
>
> Hi Christian,
>
> That could potentially be it.  With additional testing, 2 of 3 Vega20 
> machines never hit error over BAR access with the PTRACE test.  3 of 3 
> machines (from the same pool) always hit error with CWSR.
>
> To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk 
> allocated mapped memory and 2 DWORDS outside that boundary (it’s only 
> about 4MB to the boundary).  Then we POKE to swap the DWORD positions 
> across the boundary.  The RAS event on the single failing machine 
> happens on the out of boundary PEEK.
>
> Felix mentioned we don’t hit errors over general HDP access but that 
> may not true.  An Arcturus failure sys logs posted (which wasn’t 
> tested by me) shows someone launched rocm bandwidth test, hit a VM 
> fault and a RAS event ensued during evictions (I can point the 
> internal ticket or log snippet offline if interested).  Whether the 
> RAS event is BAR access triggered or the result of HW instability is 
> beyond me since I don’t have access to the machine.
>
> Thanks,
>
> Jon
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com>
> *Sent:* Wednesday, April 15, 2020 4:11 AM
> *To:* Kim, Jonathan <Jonathan.Kim at amd.com>; Kuehling, Felix 
> <Felix.Kuehling at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in 
> amdgpu_device_vram_access v2"
>
> Hi Jon,
>
>     Also cwsr tests fail on Vega20 with or without the revert with the
>     same RAS error.
>
>
> That sounds like the system/setup has a more general problem.
>
> Could it be that we are seeing RAS errors because there really is some 
> hardware failure, but with the MM path we don't trigger a RAS interrupt?
>
> Thanks,
> Christian.
>
> Am 14.04.20 um 22:30 schrieb Kim, Jonathan:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     If we’re passing the test on the revert, then the only thing
>     that’s different is we’re not invalidating HDP and doing a copy to
>     host anymore in amdgpu_device_vram_access since the function is
>     still called in ttm access_memory with BAR.
>
>     Also cwsr tests fail on Vega20 with or without the revert with the
>     same RAS error.
>
>     Thanks,
>
>     Jon
>
>     *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
>     <mailto:Felix.Kuehling at amd.com>
>     *Sent:* Tuesday, April 14, 2020 2:32 PM
>     *To:* Kim, Jonathan <Jonathan.Kim at amd.com>
>     <mailto:Jonathan.Kim at amd.com>; Koenig, Christian
>     <Christian.Koenig at amd.com> <mailto:Christian.Koenig at amd.com>;
>     Deucher, Alexander <Alexander.Deucher at amd.com>
>     <mailto:Alexander.Deucher at amd.com>
>     *Cc:* Russell, Kent <Kent.Russell at amd.com>
>     <mailto:Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>
>     *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>     I wouldn't call it premature. Revert is a usual practice when
>     there is a serious regression that isn't fully understood or
>     root-caused. As far as I can tell, the problem has been reproduced
>     on multiple systems, different GPUs, and clearly regressed to
>     Christian's commit. I think that justifies reverting it for now.
>
>     I agree with Christian that a general HDP memory access problem
>     causing RAS errors would potentially cause problems in other tests
>     as well. For example common operations like GART table updates,
>     and GPUVM page table updates and PCIe peer2peer accesses in ROCm
>     applications use HDP. But we're not seeing obvious problems from
>     those. So we need to understand what's special about this test. I
>     asked questions to that effect on our other email thread.
>
>     Regards,
>       Felix
>
>     Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         I think it’s premature to push this revert.
>
>         With more testing, I’m getting failures from different tests
>         or sometimes none at all on my machine.
>
>         Kent, let’s continue the discussion on the original thread.
>
>         Thanks,
>
>         Jon
>
>         *From:* Koenig, Christian <Christian.Koenig at amd.com>
>         <mailto:Christian.Koenig at amd.com>
>         *Sent:* Tuesday, April 14, 2020 10:47 AM
>         *To:* Deucher, Alexander <Alexander.Deucher at amd.com>
>         <mailto:Alexander.Deucher at amd.com>
>         *Cc:* Russell, Kent <Kent.Russell at amd.com>
>         <mailto:Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>; Kuehling, Felix
>         <Felix.Kuehling at amd.com> <mailto:Felix.Kuehling at amd.com>; Kim,
>         Jonathan <Jonathan.Kim at amd.com> <mailto:Jonathan.Kim at amd.com>
>         *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         That's exactly my concern as well.
>
>         This looks a bit like the test creates erroneous data somehow,
>         but there doesn't seems to be a RAS check in the MM data path.
>
>         And now that we use the BAR path it goes up in flames.
>
>         I just don't see how we can create erroneous data in a test case?
>
>         Christian.
>
>         Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>         <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
>             [AMD Public Use]
>
>             If this causes an issue, any access to vram via the BAR
>             could cause an issue.
>
>             Alex
>
>             ------------------------------------------------------------------------
>
>             *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
>             <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf
>             of Russell, Kent <Kent.Russell at amd.com
>             <mailto:Kent.Russell at amd.com>>
>             *Sent:* Tuesday, April 14, 2020 10:19 AM
>             *To:* Koenig, Christian <Christian.Koenig at amd.com
>             <mailto:Christian.Koenig at amd.com>>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             <amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>>
>             *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
>             <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
>             <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
>             *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in amdgpu_device_vram_access v2"
>
>             [AMD Official Use Only - Internal Distribution Only]
>
>             On VG20 or MI100, as soon as we run the subtest, we get
>             the dmesg output below, and then the kernel ends up
>             hanging. I don't know enough about the test itself to know
>             why this is occurring, but Jon Kim and Felix were
>             discussing it on a separate thread when the issue was
>             first reported, so they can hopefully provide some
>             additional information.
>
>              Kent
>
>             > -----Original Message-----
>             > From: Christian König <ckoenig.leichtzumerken at gmail.com
>             <mailto:ckoenig.leichtzumerken at gmail.com>>
>             > Sent: Tuesday, April 14, 2020 9:52 AM
>             > To: Russell, Kent <Kent.Russell at amd.com
>             <mailto:Kent.Russell at amd.com>>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in
>             > amdgpu_device_vram_access v2"
>             >
>             > Am 13.04.20 um 20:20 schrieb Kent Russell:
>             > > This reverts commit
>             c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>             > > The original patch causes a RAS event and subsequent
>             kernel hard-hang
>             > > when running the
>             KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>             > > Arcturus
>             > >
>             > > dmesg output at hang time:
>             > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>             > > amdgpu 0000:67:00.0: GPU reset begin!
>             > > Evicting PASID 0x8000 queues
>             > > Started evicting pasid 0x8000
>             > > qcm fence wait loop timeout expired
>             > > The cp might be in an unrecoverable state due to an
>             unsuccessful
>             > > queues preemption Failed to evict process queues
>             Failed to suspend
>             > > process 0x8000 Finished evicting pasid 0x8000 Started
>             restoring pasid
>             > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>             state may lost
>             > > due to RAS ERREVENT_ATHUB_INTERRUPT
>             > > amdgpu: [powerplay] Failed to send message 0x26,
>             response 0x0
>             > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>             > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>             > > amdgpu: [powerplay] Failed to send message 0x7,
>             response 0x0
>             > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>             disable all smu
>             > features!
>             > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>             disable all smu features!
>             > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>             > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>             suspend of IP
>             > > block <powerplay> failed -5
>             >
>             > Do you have more information on what's going wrong here
>             since this is a really
>             > important patch for KFD debugging.
>             >
>             > >
>             > > Signed-off-by: Kent Russell <kent.russell at amd.com
>             <mailto:kent.russell at amd.com>>
>             >
>             > Reviewed-by: Christian König <christian.koenig at amd.com
>             <mailto:christian.koenig at amd.com>>
>             >
>             > > ---
>             > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>             ----------------------
>             > >   1 file changed, 26 deletions(-)
>             > >
>             > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > index cf5d6e585634..a3f997f84020 100644
>             > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>             > amdgpu_device *adev, loff_t pos,
>             > >      uint32_t hi = ~0;
>             > >      uint64_t last;
>             > >
>             > > -
>             > > -#ifdef CONFIG_64BIT
>             > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>             > > -   if (last > pos) {
>             > > -           void __iomem *addr =
>             adev->mman.aper_base_kaddr + pos;
>             > > -           size_t count = last - pos;
>             > > -
>             > > -           if (write) {
>             > > - memcpy_toio(addr, buf, count);
>             > > -                   mb();
>             > > - amdgpu_asic_flush_hdp(adev, NULL);
>             > > -           } else {
>             > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>             > > -                   mb();
>             > > - memcpy_fromio(buf, addr, count);
>             > > -           }
>             > > -
>             > > -           if (count == size)
>             > > - return;
>             > > -
>             > > -           pos += count;
>             > > -           buf += count / 4;
>             > > -           size -= count;
>             > > -   }
>             > > -#endif
>             > > -
>             > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>             > >      for (last = pos + size; pos < last; pos += 4) {
>             > >              uint32_t tmp = pos >> 31;
>             _______________________________________________
>             amd-gfx mailing list
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
>         Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>         <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
>             [AMD Public Use]
>
>             If this causes an issue, any access to vram via the BAR
>             could cause an issue.
>
>             Alex
>
>             ------------------------------------------------------------------------
>
>             *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
>             <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf
>             of Russell, Kent <Kent.Russell at amd.com
>             <mailto:Kent.Russell at amd.com>>
>             *Sent:* Tuesday, April 14, 2020 10:19 AM
>             *To:* Koenig, Christian <Christian.Koenig at amd.com
>             <mailto:Christian.Koenig at amd.com>>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             <amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>>
>             *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
>             <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
>             <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
>             *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in amdgpu_device_vram_access v2"
>
>             [AMD Official Use Only - Internal Distribution Only]
>
>             On VG20 or MI100, as soon as we run the subtest, we get
>             the dmesg output below, and then the kernel ends up
>             hanging. I don't know enough about the test itself to know
>             why this is occurring, but Jon Kim and Felix were
>             discussing it on a separate thread when the issue was
>             first reported, so they can hopefully provide some
>             additional information.
>
>              Kent
>
>             > -----Original Message-----
>             > From: Christian König <ckoenig.leichtzumerken at gmail.com
>             <mailto:ckoenig.leichtzumerken at gmail.com>>
>             > Sent: Tuesday, April 14, 2020 9:52 AM
>             > To: Russell, Kent <Kent.Russell at amd.com
>             <mailto:Kent.Russell at amd.com>>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in
>             > amdgpu_device_vram_access v2"
>             >
>             > Am 13.04.20 um 20:20 schrieb Kent Russell:
>             > > This reverts commit
>             c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>             > > The original patch causes a RAS event and subsequent
>             kernel hard-hang
>             > > when running the
>             KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>             > > Arcturus
>             > >
>             > > dmesg output at hang time:
>             > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>             > > amdgpu 0000:67:00.0: GPU reset begin!
>             > > Evicting PASID 0x8000 queues
>             > > Started evicting pasid 0x8000
>             > > qcm fence wait loop timeout expired
>             > > The cp might be in an unrecoverable state due to an
>             unsuccessful
>             > > queues preemption Failed to evict process queues
>             Failed to suspend
>             > > process 0x8000 Finished evicting pasid 0x8000 Started
>             restoring pasid
>             > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>             state may lost
>             > > due to RAS ERREVENT_ATHUB_INTERRUPT
>             > > amdgpu: [powerplay] Failed to send message 0x26,
>             response 0x0
>             > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>             > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>             > > amdgpu: [powerplay] Failed to send message 0x7,
>             response 0x0
>             > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>             disable all smu
>             > features!
>             > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>             disable all smu features!
>             > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>             > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>             suspend of IP
>             > > block <powerplay> failed -5
>             >
>             > Do you have more information on what's going wrong here
>             since this is a really
>             > important patch for KFD debugging.
>             >
>             > >
>             > > Signed-off-by: Kent Russell <kent.russell at amd.com
>             <mailto:kent.russell at amd.com>>
>             >
>             > Reviewed-by: Christian König <christian.koenig at amd.com
>             <mailto:christian.koenig at amd.com>>
>             >
>             > > ---
>             > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>             ----------------------
>             > >   1 file changed, 26 deletions(-)
>             > >
>             > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > index cf5d6e585634..a3f997f84020 100644
>             > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>             > amdgpu_device *adev, loff_t pos,
>             > >      uint32_t hi = ~0;
>             > >      uint64_t last;
>             > >
>             > > -
>             > > -#ifdef CONFIG_64BIT
>             > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>             > > -   if (last > pos) {
>             > > -           void __iomem *addr =
>             adev->mman.aper_base_kaddr + pos;
>             > > -           size_t count = last - pos;
>             > > -
>             > > -           if (write) {
>             > > - memcpy_toio(addr, buf, count);
>             > > -                   mb();
>             > > - amdgpu_asic_flush_hdp(adev, NULL);
>             > > -           } else {
>             > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>             > > -                   mb();
>             > > - memcpy_fromio(buf, addr, count);
>             > > -           }
>             > > -
>             > > -           if (count == size)
>             > > -                   return;
>             > > -
>             > > -           pos += count;
>             > > -           buf += count / 4;
>             > > -           size -= count;
>             > > -   }
>             > > -#endif
>             > > -
>             > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>             > >      for (last = pos + size; pos < last; pos += 4) {
>             > >              uint32_t tmp = pos >> 31;
>             _______________________________________________
>             amd-gfx mailing list
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
>         Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>         <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
>             [AMD Public Use]
>
>             If this causes an issue, any access to vram via the BAR
>             could cause an issue.
>
>             Alex
>
>             ------------------------------------------------------------------------
>
>             *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
>             <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf
>             of Russell, Kent <Kent.Russell at amd.com
>             <mailto:Kent.Russell at amd.com>>
>             *Sent:* Tuesday, April 14, 2020 10:19 AM
>             *To:* Koenig, Christian <Christian.Koenig at amd.com
>             <mailto:Christian.Koenig at amd.com>>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             <amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>>
>             *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
>             <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
>             <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
>             *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in amdgpu_device_vram_access v2"
>
>             [AMD Official Use Only - Internal Distribution Only]
>
>             On VG20 or MI100, as soon as we run the subtest, we get
>             the dmesg output below, and then the kernel ends up
>             hanging. I don't know enough about the test itself to know
>             why this is occurring, but Jon Kim and Felix were
>             discussing it on a separate thread when the issue was
>             first reported, so they can hopefully provide some
>             additional information.
>
>              Kent
>
>             > -----Original Message-----
>             > From: Christian König <ckoenig.leichtzumerken at gmail.com
>             <mailto:ckoenig.leichtzumerken at gmail.com>>
>             > Sent: Tuesday, April 14, 2020 9:52 AM
>             > To: Russell, Kent <Kent.Russell at amd.com
>             <mailto:Kent.Russell at amd.com>>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in
>             > amdgpu_device_vram_access v2"
>             >
>             > Am 13.04.20 um 20:20 schrieb Kent Russell:
>             > > This reverts commit
>             c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>             > > The original patch causes a RAS event and subsequent
>             kernel hard-hang
>             > > when running the
>             KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>             > > Arcturus
>             > >
>             > > dmesg output at hang time:
>             > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>             > > amdgpu 0000:67:00.0: GPU reset begin!
>             > > Evicting PASID 0x8000 queues
>             > > Started evicting pasid 0x8000
>             > > qcm fence wait loop timeout expired
>             > > The cp might be in an unrecoverable state due to an
>             unsuccessful
>             > > queues preemption Failed to evict process queues
>             Failed to suspend
>             > > process 0x8000 Finished evicting pasid 0x8000 Started
>             restoring pasid
>             > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>             state may lost
>             > > due to RAS ERREVENT_ATHUB_INTERRUPT
>             > > amdgpu: [powerplay] Failed to send message 0x26,
>             response 0x0
>             > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>             > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>             > > amdgpu: [powerplay] Failed to send message 0x7,
>             response 0x0
>             > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>             disable all smu
>             > features!
>             > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>             disable all smu features!
>             > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>             > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>             suspend of IP
>             > > block <powerplay> failed -5
>             >
>             > Do you have more information on what's going wrong here
>             since this is a really
>             > important patch for KFD debugging.
>             >
>             > >
>             > > Signed-off-by: Kent Russell <kent.russell at amd.com
>             <mailto:kent.russell at amd.com>>
>             >
>             > Reviewed-by: Christian König <christian.koenig at amd.com
>             <mailto:christian.koenig at amd.com>>
>             >
>             > > ---
>             > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>             ----------------------
>             > >   1 file changed, 26 deletions(-)
>             > >
>             > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > index cf5d6e585634..a3f997f84020 100644
>             > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>             > amdgpu_device *adev, loff_t pos,
>             > >      uint32_t hi = ~0;
>             > >      uint64_t last;
>             > >
>             > > -
>             > > -#ifdef CONFIG_64BIT
>             > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>             > > -   if (last > pos) {
>             > > -           void __iomem *addr =
>             adev->mman.aper_base_kaddr + pos;
>             > > -           size_t count = last - pos;
>             > > -
>             > > -           if (write) {
>             > > - memcpy_toio(addr, buf, count);
>             > > -                   mb();
>             > > - amdgpu_asic_flush_hdp(adev, NULL);
>             > > -           } else {
>             > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>             > > -                   mb();
>             > > - memcpy_fromio(buf, addr, count);
>             > > -           }
>             > > -
>             > > -           if (count == size)
>             > > -                   return;
>             > > -
>             > > -           pos += count;
>             > > -           buf += count / 4;
>             > > -           size -= count;
>             > > -   }
>             > > -#endif
>             > > -
>             > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>             > >      for (last = pos + size; pos < last; pos += 4) {
>             > >              uint32_t tmp = pos >> 31;
>             _______________________________________________
>             amd-gfx mailing list
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
>         Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>         <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
>             [AMD Public Use]
>
>             If this causes an issue, any access to vram via the BAR
>             could cause an issue.
>
>             Alex
>
>             ------------------------------------------------------------------------
>
>             *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
>             <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf
>             of Russell, Kent <Kent.Russell at amd.com
>             <mailto:Kent.Russell at amd.com>>
>             *Sent:* Tuesday, April 14, 2020 10:19 AM
>             *To:* Koenig, Christian <Christian.Koenig at amd.com
>             <mailto:Christian.Koenig at amd.com>>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             <amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>>
>             *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
>             <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
>             <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
>             *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in amdgpu_device_vram_access v2"
>
>             [AMD Official Use Only - Internal Distribution Only]
>
>             On VG20 or MI100, as soon as we run the subtest, we get
>             the dmesg output below, and then the kernel ends up
>             hanging. I don't know enough about the test itself to know
>             why this is occurring, but Jon Kim and Felix were
>             discussing it on a separate thread when the issue was
>             first reported, so they can hopefully provide some
>             additional information.
>
>              Kent
>
>             > -----Original Message-----
>             > From: Christian König <ckoenig.leichtzumerken at gmail.com
>             <mailto:ckoenig.leichtzumerken at gmail.com>>
>             > Sent: Tuesday, April 14, 2020 9:52 AM
>             > To: Russell, Kent <Kent.Russell at amd.com
>             <mailto:Kent.Russell at amd.com>>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>             possible in
>             > amdgpu_device_vram_access v2"
>             >
>             > Am 13.04.20 um 20:20 schrieb Kent Russell:
>             > > This reverts commit
>             c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>             > > The original patch causes a RAS event and subsequent
>             kernel hard-hang
>             > > when running the
>             KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
>             > > Arcturus
>             > >
>             > > dmesg output at hang time:
>             > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>             > > amdgpu 0000:67:00.0: GPU reset begin!
>             > > Evicting PASID 0x8000 queues
>             > > Started evicting pasid 0x8000
>             > > qcm fence wait loop timeout expired
>             > > The cp might be in an unrecoverable state due to an
>             unsuccessful
>             > > queues preemption Failed to evict process queues
>             Failed to suspend
>             > > process 0x8000 Finished evicting pasid 0x8000 Started
>             restoring pasid
>             > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>             state may lost
>             > > due to RAS ERREVENT_ATHUB_INTERRUPT
>             > > amdgpu: [powerplay] Failed to send message 0x26,
>             response 0x0
>             > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>             > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>             > > amdgpu: [powerplay] Failed to send message 0x7,
>             response 0x0
>             > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>             disable all smu
>             > features!
>             > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
>             disable all smu features!
>             > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>             > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>             suspend of IP
>             > > block <powerplay> failed -5
>             >
>             > Do you have more information on what's going wrong here
>             since this is a really
>             > important patch for KFD debugging.
>             >
>             > >
>             > > Signed-off-by: Kent Russell <kent.russell at amd.com
>             <mailto:kent.russell at amd.com>>
>             >
>             > Reviewed-by: Christian König <christian.koenig at amd.com
>             <mailto:christian.koenig at amd.com>>
>             >
>             > > ---
>             > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>             ----------------------
>             > >   1 file changed, 26 deletions(-)
>             > >
>             > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > index cf5d6e585634..a3f997f84020 100644
>             > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>             > amdgpu_device *adev, loff_t pos,
>             > >      uint32_t hi = ~0;
>             > >      uint64_t last;
>             > >
>             > > -
>             > > -#ifdef CONFIG_64BIT
>             > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>             > > -   if (last > pos) {
>             > > -           void __iomem *addr =
>             adev->mman.aper_base_kaddr + pos;
>             > > -           size_t count = last - pos;
>             > > -
>             > > -           if (write) {
>             > > - memcpy_toio(addr, buf, count);
>             > > -                   mb();
>             > > - amdgpu_asic_flush_hdp(adev, NULL);
>             > > -           } else {
>             > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>             > > -                   mb();
>             > > - memcpy_fromio(buf, addr, count);
>             > > -           }
>             > > -
>             > > -           if (count == size)
>             > > -                   return;
>             > > -
>             > > -           pos += count;
>             > > -           buf += count / 4;
>             > > -           size -= count;
>             > > -   }
>             > > -#endif
>             > > -
>             > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>             > >      for (last = pos + size; pos < last; pos += 4) {
>             > >              uint32_t tmp = pos >> 31;
>             _______________________________________________
>             amd-gfx mailing list
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
>         Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>         <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
>         [AMD Public Use]
>
>         If this causes an issue, any access to vram via the BAR could
>         cause an issue.
>
>         Alex
>
>         ------------------------------------------------------------------------
>
>         *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
>         <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
>         Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
>         *Sent:* Tuesday, April 14, 2020 10:19 AM
>         *To:* Koenig, Christian <Christian.Koenig at amd.com
>         <mailto:Christian.Koenig at amd.com>>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         <amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>>
>         *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
>         <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
>         <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
>         *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         On VG20 or MI100, as soon as we run the subtest, we get the
>         dmesg output below, and then the kernel ends up hanging. I
>         don't know enough about the test itself to know why this is
>         occurring, but Jon Kim and Felix were discussing it on a
>         separate thread when the issue was first reported, so they can
>         hopefully provide some additional information.
>
>          Kent
>
>         > -----Original Message-----
>         > From: Christian König <ckoenig.leichtzumerken at gmail.com
>         <mailto:ckoenig.leichtzumerken at gmail.com>>
>         > Sent: Tuesday, April 14, 2020 9:52 AM
>         > To: Russell, Kent <Kent.Russell at amd.com
>         <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in
>         > amdgpu_device_vram_access v2"
>         >
>         > Am 13.04.20 um 20:20 schrieb Kent Russell:
>         > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>         > > The original patch causes a RAS event and subsequent
>         kernel hard-hang
>         > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
>         on VG20 and
>         > > Arcturus
>         > >
>         > > dmesg output at hang time:
>         > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>         > > amdgpu 0000:67:00.0: GPU reset begin!
>         > > Evicting PASID 0x8000 queues
>         > > Started evicting pasid 0x8000
>         > > qcm fence wait loop timeout expired
>         > > The cp might be in an unrecoverable state due to an
>         unsuccessful
>         > > queues preemption Failed to evict process queues Failed to
>         suspend
>         > > process 0x8000 Finished evicting pasid 0x8000 Started
>         restoring pasid
>         > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>         state may lost
>         > > due to RAS ERREVENT_ATHUB_INTERRUPT
>         > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>         > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>         > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>         > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>         > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>         disable all smu
>         > features!
>         > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
>         all smu features!
>         > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>         > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>         suspend of IP
>         > > block <powerplay> failed -5
>         >
>         > Do you have more information on what's going wrong here
>         since this is a really
>         > important patch for KFD debugging.
>         >
>         > >
>         > > Signed-off-by: Kent Russell <kent.russell at amd.com
>         <mailto:kent.russell at amd.com>>
>         >
>         > Reviewed-by: Christian König <christian.koenig at amd.com
>         <mailto:christian.koenig at amd.com>>
>         >
>         > > ---
>         > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>         ----------------------
>         > >   1 file changed, 26 deletions(-)
>         > >
>         > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > index cf5d6e585634..a3f997f84020 100644
>         > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>         > amdgpu_device *adev, loff_t pos,
>         > >      uint32_t hi = ~0;
>         > >      uint64_t last;
>         > >
>         > > -
>         > > -#ifdef CONFIG_64BIT
>         > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>         > > -   if (last > pos) {
>         > > -           void __iomem *addr =
>         adev->mman.aper_base_kaddr + pos;
>         > > -           size_t count = last - pos;
>         > > -
>         > > -           if (write) {
>         > > -                   memcpy_toio(addr, buf, count);
>         > > -                   mb();
>         > > - amdgpu_asic_flush_hdp(adev, NULL);
>         > > -           } else {
>         > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>         > > -                   mb();
>         > > -                   memcpy_fromio(buf, addr, count);
>         > > -           }
>         > > -
>         > > -           if (count == size)
>         > > -                   return;
>         > > -
>         > > -           pos += count;
>         > > -           buf += count / 4;
>         > > -           size -= count;
>         > > -   }
>         > > -#endif
>         > > -
>         > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>         > >      for (last = pos + size; pos < last; pos += 4) {
>         > >              uint32_t tmp = pos >> 31;
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200415/fffcce87/attachment-0001.htm>