[PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
Christian König
christian.koenig at amd.com
Wed Apr 15 10:58:28 UTC 2020
> To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk
> allocated mapped memory and 2 DWORDS outside that boundary (it’s only
> about 4MB to the boundary). Then we POKE to swap the DWORD positions
> across the boundary. The RAS event on the single failing machine
> happens on the out of boundary PEEK.
>
Well when you access outside of an allocated buffer I would expect that
we never get as far as even touching the hardware because the kernel
should block the access with an -EPERM or -EFAULT. So sounds like I'm
not understanding something correctly here.
Apart from that I completely agree that we need to sort out any other
RAS event first to make sure that the system is simply not failing randomly.
Regards,
Christian.
Am 15.04.20 um 11:49 schrieb Kim, Jonathan:
>
> [AMD Public Use]
>
> Hi Christian,
>
> That could potentially be it. With additional testing, 2 of 3 Vega20
> machines never hit error over BAR access with the PTRACE test. 3 of 3
> machines (from the same pool) always hit error with CWSR.
>
> To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk
> allocated mapped memory and 2 DWORDS outside that boundary (it’s only
> about 4MB to the boundary). Then we POKE to swap the DWORD positions
> across the boundary. The RAS event on the single failing machine
> happens on the out of boundary PEEK.
>
> Felix mentioned we don’t hit errors over general HDP access but that
> may not true. An Arcturus failure sys logs posted (which wasn’t
> tested by me) shows someone launched rocm bandwidth test, hit a VM
> fault and a RAS event ensued during evictions (I can point the
> internal ticket or log snippet offline if interested). Whether the
> RAS event is BAR access triggered or the result of HW instability is
> beyond me since I don’t have access to the machine.
>
> Thanks,
>
> Jon
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com>
> *Sent:* Wednesday, April 15, 2020 4:11 AM
> *To:* Kim, Jonathan <Jonathan.Kim at amd.com>; Kuehling, Felix
> <Felix.Kuehling at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> Hi Jon,
>
> Also cwsr tests fail on Vega20 with or without the revert with the
> same RAS error.
>
>
> That sounds like the system/setup has a more general problem.
>
> Could it be that we are seeing RAS errors because there really is some
> hardware failure, but with the MM path we don't trigger a RAS interrupt?
>
> Thanks,
> Christian.
>
> Am 14.04.20 um 22:30 schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> If we’re passing the test on the revert, then the only thing
> that’s different is we’re not invalidating HDP and doing a copy to
> host anymore in amdgpu_device_vram_access since the function is
> still called in ttm access_memory with BAR.
>
> Also cwsr tests fail on Vega20 with or without the revert with the
> same RAS error.
>
> Thanks,
>
> Jon
>
> *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> <mailto:Felix.Kuehling at amd.com>
> *Sent:* Tuesday, April 14, 2020 2:32 PM
> *To:* Kim, Jonathan <Jonathan.Kim at amd.com>
> <mailto:Jonathan.Kim at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com> <mailto:Christian.Koenig at amd.com>;
> Deucher, Alexander <Alexander.Deucher at amd.com>
> <mailto:Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>
> <mailto:Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible
> in amdgpu_device_vram_access v2"
>
> I wouldn't call it premature. Revert is a usual practice when
> there is a serious regression that isn't fully understood or
> root-caused. As far as I can tell, the problem has been reproduced
> on multiple systems, different GPUs, and clearly regressed to
> Christian's commit. I think that justifies reverting it for now.
>
> I agree with Christian that a general HDP memory access problem
> causing RAS errors would potentially cause problems in other tests
> as well. For example common operations like GART table updates,
> and GPUVM page table updates and PCIe peer2peer accesses in ROCm
> applications use HDP. But we're not seeing obvious problems from
> those. So we need to understand what's special about this test. I
> asked questions to that effect on our other email thread.
>
> Regards,
> Felix
>
> Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> I think it’s premature to push this revert.
>
> With more testing, I’m getting failures from different tests
> or sometimes none at all on my machine.
>
> Kent, let’s continue the discussion on the original thread.
>
> Thanks,
>
> Jon
>
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> *Sent:* Tuesday, April 14, 2020 10:47 AM
> *To:* Deucher, Alexander <Alexander.Deucher at amd.com>
> <mailto:Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>
> <mailto:Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>; Kuehling, Felix
> <Felix.Kuehling at amd.com> <mailto:Felix.Kuehling at amd.com>; Kim,
> Jonathan <Jonathan.Kim at amd.com> <mailto:Jonathan.Kim at amd.com>
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> That's exactly my concern as well.
>
> This looks a bit like the test creates erroneous data somehow,
> but there doesn't seems to be a RAS check in the MM data path.
>
> And now that we use the BAR path it goes up in flames.
>
> I just don't see how we can create erroneous data in a test case?
>
> Christian.
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR
> could cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf
> of Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get
> the dmesg output below, and then the kernel ends up
> hanging. I don't know enough about the test itself to know
> why this is occurring, but Jon Kim and Felix were
> discussing it on a separate thread when the issue was
> first reported, so they can hopefully provide some
> additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit
> c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent
> kernel hard-hang
> > > when running the
> KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an
> unsuccessful
> > > queues preemption Failed to evict process queues
> Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
> state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26,
> response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7,
> response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
> disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
> disable all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
> suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here
> since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR
> could cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf
> of Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get
> the dmesg output below, and then the kernel ends up
> hanging. I don't know enough about the test itself to know
> why this is occurring, but Jon Kim and Felix were
> discussing it on a separate thread when the issue was
> first reported, so they can hopefully provide some
> additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit
> c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent
> kernel hard-hang
> > > when running the
> KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an
> unsuccessful
> > > queues preemption Failed to evict process queues
> Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
> state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26,
> response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7,
> response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
> disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
> disable all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
> suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here
> since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR
> could cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf
> of Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get
> the dmesg output below, and then the kernel ends up
> hanging. I don't know enough about the test itself to know
> why this is occurring, but Jon Kim and Felix were
> discussing it on a separate thread when the issue was
> first reported, so they can hopefully provide some
> additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit
> c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent
> kernel hard-hang
> > > when running the
> KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an
> unsuccessful
> > > queues preemption Failed to evict process queues
> Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
> state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26,
> response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7,
> response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
> disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
> disable all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
> suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here
> since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR
> could cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf
> of Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get
> the dmesg output below, and then the kernel ends up
> hanging. I don't know enough about the test itself to know
> why this is occurring, but Jon Kim and Felix were
> discussing it on a separate thread when the issue was
> first reported, so they can hopefully provide some
> additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit
> c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent
> kernel hard-hang
> > > when running the
> KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an
> unsuccessful
> > > queues preemption Failed to evict process queues
> Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
> state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26,
> response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7,
> response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
> disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to
> disable all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
> suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here
> since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR could
> cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
> Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get the
> dmesg output below, and then the kernel ends up hanging. I
> don't know enough about the test itself to know why this is
> occurring, but Jon Kim and Felix were discussing it on a
> separate thread when the issue was first reported, so they can
> hopefully provide some additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent
> kernel hard-hang
> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
> on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an
> unsuccessful
> > > queues preemption Failed to evict process queues Failed to
> suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
> state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
> disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
> all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
> suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here
> since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200415/fffcce87/attachment-0001.htm>
More information about the amd-gfx
mailing list