[PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
Christian König
christian.koenig at amd.com
Wed Apr 15 08:11:11 UTC 2020
Hi Jon,
> Also cwsr tests fail on Vega20 with or without the revert with the
> same RAS error.
That sounds like the system/setup has a more general problem.
Could it be that we are seeing RAS errors because there really is some
hardware failure, but with the MM path we don't trigger a RAS interrupt?
Thanks,
Christian.
Am 14.04.20 um 22:30 schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> If we’re passing the test on the revert, then the only thing that’s
> different is we’re not invalidating HDP and doing a copy to host
> anymore in amdgpu_device_vram_access since the function is still
> called in ttm access_memory with BAR.
>
> Also cwsr tests fail on Vega20 with or without the revert with the
> same RAS error.
>
> Thanks,
>
> Jon
>
> *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> *Sent:* Tuesday, April 14, 2020 2:32 PM
> *To:* Kim, Jonathan <Jonathan.Kim at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> amdgpu_device_vram_access v2"
>
> I wouldn't call it premature. Revert is a usual practice when there is
> a serious regression that isn't fully understood or root-caused. As
> far as I can tell, the problem has been reproduced on multiple
> systems, different GPUs, and clearly regressed to Christian's commit.
> I think that justifies reverting it for now.
>
> I agree with Christian that a general HDP memory access problem
> causing RAS errors would potentially cause problems in other tests as
> well. For example common operations like GART table updates, and GPUVM
> page table updates and PCIe peer2peer accesses in ROCm applications
> use HDP. But we're not seeing obvious problems from those. So we need
> to understand what's special about this test. I asked questions to
> that effect on our other email thread.
>
> Regards,
> Felix
>
> Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> I think it’s premature to push this revert.
>
> With more testing, I’m getting failures from different tests or
> sometimes none at all on my machine.
>
> Kent, let’s continue the discussion on the original thread.
>
> Thanks,
>
> Jon
>
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> *Sent:* Tuesday, April 14, 2020 10:47 AM
> *To:* Deucher, Alexander <Alexander.Deucher at amd.com>
> <mailto:Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>
> <mailto:Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>; Kuehling, Felix
> <Felix.Kuehling at amd.com> <mailto:Felix.Kuehling at amd.com>; Kim,
> Jonathan <Jonathan.Kim at amd.com> <mailto:Jonathan.Kim at amd.com>
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible
> in amdgpu_device_vram_access v2"
>
> That's exactly my concern as well.
>
> This looks a bit like the test creates erroneous data somehow, but
> there doesn't seems to be a RAS check in the MM data path.
>
> And now that we use the BAR path it goes up in flames.
>
> I just don't see how we can create erroneous data in a test case?
>
> Christian.
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR could
> cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
> Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get the
> dmesg output below, and then the kernel ends up hanging. I
> don't know enough about the test itself to know why this is
> occurring, but Jon Kim and Felix were discussing it on a
> separate thread when the issue was first reported, so they can
> hopefully provide some additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent
> kernel hard-hang
> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
> on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an
> unsuccessful
> > > queues preemption Failed to evict process queues Failed to
> suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
> state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
> disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
> all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
> suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here
> since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR could
> cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
> Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get the
> dmesg output below, and then the kernel ends up hanging. I
> don't know enough about the test itself to know why this is
> occurring, but Jon Kim and Felix were discussing it on a
> separate thread when the issue was first reported, so they can
> hopefully provide some additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent
> kernel hard-hang
> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
> on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an
> unsuccessful
> > > queues preemption Failed to evict process queues Failed to
> suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
> state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
> disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
> all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
> suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here
> since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR could
> cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
> Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get the
> dmesg output below, and then the kernel ends up hanging. I
> don't know enough about the test itself to know why this is
> occurring, but Jon Kim and Felix were discussing it on a
> separate thread when the issue was first reported, so they can
> hopefully provide some additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent
> kernel hard-hang
> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
> on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an
> unsuccessful
> > > queues preemption Failed to evict process queues Failed to
> suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
> state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
> disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
> all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
> suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here
> since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR could
> cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
> Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get the
> dmesg output below, and then the kernel ends up hanging. I
> don't know enough about the test itself to know why this is
> occurring, but Jon Kim and Felix were discussing it on a
> separate thread when the issue was first reported, so they can
> hopefully provide some additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
> possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent
> kernel hard-hang
> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
> on VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an
> unsuccessful
> > > queues preemption Failed to evict process queues Failed to
> suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
> state may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
> disable all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
> all smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
> suspend of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here
> since this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr =
> adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
> Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
> [AMD Public Use]
>
> If this causes an issue, any access to vram via the BAR could
> cause an issue.
>
> Alex
>
> ------------------------------------------------------------------------
>
> *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
> Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
> *Sent:* Tuesday, April 14, 2020 10:19 AM
> *To:* Koenig, Christian <Christian.Koenig at amd.com
> <mailto:Christian.Koenig at amd.com>>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> <amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>>
> *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
> <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
> <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
> *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible
> in amdgpu_device_vram_access v2"
>
> [AMD Official Use Only - Internal Distribution Only]
>
> On VG20 or MI100, as soon as we run the subtest, we get the dmesg
> output below, and then the kernel ends up hanging. I don't know
> enough about the test itself to know why this is occurring, but
> Jon Kim and Felix were discussing it on a separate thread when the
> issue was first reported, so they can hopefully provide some
> additional information.
>
> Kent
>
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken at gmail.com
> <mailto:ckoenig.leichtzumerken at gmail.com>>
> > Sent: Tuesday, April 14, 2020 9:52 AM
> > To: Russell, Kent <Kent.Russell at amd.com
> <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
> > amdgpu_device_vram_access v2"
> >
> > Am 13.04.20 um 20:20 schrieb Kent Russell:
> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > > The original patch causes a RAS event and subsequent kernel
> hard-hang
> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on
> VG20 and
> > > Arcturus
> > >
> > > dmesg output at hang time:
> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > > amdgpu 0000:67:00.0: GPU reset begin!
> > > Evicting PASID 0x8000 queues
> > > Started evicting pasid 0x8000
> > > qcm fence wait loop timeout expired
> > > The cp might be in an unrecoverable state due to an unsuccessful
> > > queues preemption Failed to evict process queues Failed to suspend
> > > process 0x8000 Finished evicting pasid 0x8000 Started
> restoring pasid
> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state
> may lost
> > > due to RAS ERREVENT_ATHUB_INTERRUPT
> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable
> all smu
> > features!
> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all
> smu features!
> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend
> of IP
> > > block <powerplay> failed -5
> >
> > Do you have more information on what's going wrong here since
> this is a really
> > important patch for KFD debugging.
> >
> > >
> > > Signed-off-by: Kent Russell <kent.russell at amd.com
> <mailto:kent.russell at amd.com>>
> >
> > Reviewed-by: Christian König <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>>
> >
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
> ----------------------
> > > 1 file changed, 26 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index cf5d6e585634..a3f997f84020 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
> > amdgpu_device *adev, loff_t pos,
> > > uint32_t hi = ~0;
> > > uint64_t last;
> > >
> > > -
> > > -#ifdef CONFIG_64BIT
> > > - last = min(pos + size, adev->gmc.visible_vram_size);
> > > - if (last > pos) {
> > > - void __iomem *addr = adev->mman.aper_base_kaddr + pos;
> > > - size_t count = last - pos;
> > > -
> > > - if (write) {
> > > - memcpy_toio(addr, buf, count);
> > > - mb();
> > > - amdgpu_asic_flush_hdp(adev, NULL);
> > > - } else {
> > > - amdgpu_asic_invalidate_hdp(adev, NULL);
> > > - mb();
> > > - memcpy_fromio(buf, addr, count);
> > > - }
> > > -
> > > - if (count == size)
> > > - return;
> > > -
> > > - pos += count;
> > > - buf += count / 4;
> > > - size -= count;
> > > - }
> > > -#endif
> > > -
> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
> > > for (last = pos + size; pos < last; pos += 4) {
> > > uint32_t tmp = pos >> 31;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200415/eef6c8f6/attachment-0001.htm>
More information about the amd-gfx
mailing list