[PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"

Christian König christian.koenig at amd.com
Wed Apr 15 08:11:11 UTC 2020


Hi Jon,

> Also cwsr tests fail on Vega20 with or without the revert with the 
> same RAS error.

That sounds like the system/setup has a more general problem.

Could it be that we are seeing RAS errors because there really is some 
hardware failure, but with the MM path we don't trigger a RAS interrupt?

Thanks,
Christian.

Am 14.04.20 um 22:30 schrieb Kim, Jonathan:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> If we’re passing the test on the revert, then the only thing that’s 
> different is we’re not invalidating HDP and doing a copy to host 
> anymore in amdgpu_device_vram_access since the function is still 
> called in ttm access_memory with BAR.
>
> Also cwsr tests fail on Vega20 with or without the revert with the 
> same RAS error.
>
> Thanks,
>
> Jon
>
> *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> *Sent:* Tuesday, April 14, 2020 2:32 PM
> *To:* Kim, Jonathan <Jonathan.Kim at amd.com>; Koenig, Christian 
> <Christian.Koenig at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
> *Cc:* Russell, Kent <Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in 
> amdgpu_device_vram_access v2"
>
> I wouldn't call it premature. Revert is a usual practice when there is 
> a serious regression that isn't fully understood or root-caused. As 
> far as I can tell, the problem has been reproduced on multiple 
> systems, different GPUs, and clearly regressed to Christian's commit. 
> I think that justifies reverting it for now.
>
> I agree with Christian that a general HDP memory access problem 
> causing RAS errors would potentially cause problems in other tests as 
> well. For example common operations like GART table updates, and GPUVM 
> page table updates and PCIe peer2peer accesses in ROCm applications 
> use HDP. But we're not seeing obvious problems from those. So we need 
> to understand what's special about this test. I asked questions to 
> that effect on our other email thread.
>
> Regards,
>   Felix
>
> Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     I think it’s premature to push this revert.
>
>     With more testing, I’m getting failures from different tests or
>     sometimes none at all on my machine.
>
>     Kent, let’s continue the discussion on the original thread.
>
>     Thanks,
>
>     Jon
>
>     *From:* Koenig, Christian <Christian.Koenig at amd.com>
>     <mailto:Christian.Koenig at amd.com>
>     *Sent:* Tuesday, April 14, 2020 10:47 AM
>     *To:* Deucher, Alexander <Alexander.Deucher at amd.com>
>     <mailto:Alexander.Deucher at amd.com>
>     *Cc:* Russell, Kent <Kent.Russell at amd.com>
>     <mailto:Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>; Kuehling, Felix
>     <Felix.Kuehling at amd.com> <mailto:Felix.Kuehling at amd.com>; Kim,
>     Jonathan <Jonathan.Kim at amd.com> <mailto:Jonathan.Kim at amd.com>
>     *Subject:* Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>     That's exactly my concern as well.
>
>     This looks a bit like the test creates erroneous data somehow, but
>     there doesn't seems to be a RAS check in the MM data path.
>
>     And now that we use the BAR path it goes up in flames.
>
>     I just don't see how we can create erroneous data in a test case?
>
>     Christian.
>
>     Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>     <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
>         [AMD Public Use]
>
>         If this causes an issue, any access to vram via the BAR could
>         cause an issue.
>
>         Alex
>
>         ------------------------------------------------------------------------
>
>         *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org
>         <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
>         Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
>         *Sent:* Tuesday, April 14, 2020 10:19 AM
>         *To:* Koenig, Christian <Christian.Koenig at amd.com
>         <mailto:Christian.Koenig at amd.com>>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         <amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>>
>         *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
>         <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
>         <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
>         *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         On VG20 or MI100, as soon as we run the subtest, we get the
>         dmesg output below, and then the kernel ends up hanging. I
>         don't know enough about the test itself to know why this is
>         occurring, but Jon Kim and Felix were discussing it on a
>         separate thread when the issue was first reported, so they can
>         hopefully provide some additional information.
>
>          Kent
>
>         > -----Original Message-----
>         > From: Christian König <ckoenig.leichtzumerken at gmail.com
>         <mailto:ckoenig.leichtzumerken at gmail.com>>
>         > Sent: Tuesday, April 14, 2020 9:52 AM
>         > To: Russell, Kent <Kent.Russell at amd.com
>         <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in
>         > amdgpu_device_vram_access v2"
>         >
>         > Am 13.04.20 um 20:20 schrieb Kent Russell:
>         > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>         > > The original patch causes a RAS event and subsequent
>         kernel hard-hang
>         > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
>         on VG20 and
>         > > Arcturus
>         > >
>         > > dmesg output at hang time:
>         > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>         > > amdgpu 0000:67:00.0: GPU reset begin!
>         > > Evicting PASID 0x8000 queues
>         > > Started evicting pasid 0x8000
>         > > qcm fence wait loop timeout expired
>         > > The cp might be in an unrecoverable state due to an
>         unsuccessful
>         > > queues preemption Failed to evict process queues Failed to
>         suspend
>         > > process 0x8000 Finished evicting pasid 0x8000 Started
>         restoring pasid
>         > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>         state may lost
>         > > due to RAS ERREVENT_ATHUB_INTERRUPT
>         > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>         > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>         > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>         > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>         > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>         disable all smu
>         > features!
>         > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
>         all smu features!
>         > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>         > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>         suspend of IP
>         > > block <powerplay> failed -5
>         >
>         > Do you have more information on what's going wrong here
>         since this is a really
>         > important patch for KFD debugging.
>         >
>         > >
>         > > Signed-off-by: Kent Russell <kent.russell at amd.com
>         <mailto:kent.russell at amd.com>>
>         >
>         > Reviewed-by: Christian König <christian.koenig at amd.com
>         <mailto:christian.koenig at amd.com>>
>         >
>         > > ---
>         > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>         ----------------------
>         > >   1 file changed, 26 deletions(-)
>         > >
>         > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > index cf5d6e585634..a3f997f84020 100644
>         > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>         > amdgpu_device *adev, loff_t pos,
>         > >      uint32_t hi = ~0;
>         > >      uint64_t last;
>         > >
>         > > -
>         > > -#ifdef CONFIG_64BIT
>         > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>         > > -   if (last > pos) {
>         > > -           void __iomem *addr =
>         adev->mman.aper_base_kaddr + pos;
>         > > -           size_t count = last - pos;
>         > > -
>         > > -           if (write) {
>         > > - memcpy_toio(addr, buf, count);
>         > > -                   mb();
>         > > - amdgpu_asic_flush_hdp(adev, NULL);
>         > > -           } else {
>         > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>         > > -                   mb();
>         > > - memcpy_fromio(buf, addr, count);
>         > > -           }
>         > > -
>         > > -           if (count == size)
>         > > -                   return;
>         > > -
>         > > -           pos += count;
>         > > -           buf += count / 4;
>         > > -           size -= count;
>         > > -   }
>         > > -#endif
>         > > -
>         > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>         > >      for (last = pos + size; pos < last; pos += 4) {
>         > >              uint32_t tmp = pos >> 31;
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
>     Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>     <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
>         [AMD Public Use]
>
>         If this causes an issue, any access to vram via the BAR could
>         cause an issue.
>
>         Alex
>
>         ------------------------------------------------------------------------
>
>         *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org
>         <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
>         Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
>         *Sent:* Tuesday, April 14, 2020 10:19 AM
>         *To:* Koenig, Christian <Christian.Koenig at amd.com
>         <mailto:Christian.Koenig at amd.com>>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         <amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>>
>         *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
>         <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
>         <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
>         *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         On VG20 or MI100, as soon as we run the subtest, we get the
>         dmesg output below, and then the kernel ends up hanging. I
>         don't know enough about the test itself to know why this is
>         occurring, but Jon Kim and Felix were discussing it on a
>         separate thread when the issue was first reported, so they can
>         hopefully provide some additional information.
>
>          Kent
>
>         > -----Original Message-----
>         > From: Christian König <ckoenig.leichtzumerken at gmail.com
>         <mailto:ckoenig.leichtzumerken at gmail.com>>
>         > Sent: Tuesday, April 14, 2020 9:52 AM
>         > To: Russell, Kent <Kent.Russell at amd.com
>         <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in
>         > amdgpu_device_vram_access v2"
>         >
>         > Am 13.04.20 um 20:20 schrieb Kent Russell:
>         > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>         > > The original patch causes a RAS event and subsequent
>         kernel hard-hang
>         > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
>         on VG20 and
>         > > Arcturus
>         > >
>         > > dmesg output at hang time:
>         > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>         > > amdgpu 0000:67:00.0: GPU reset begin!
>         > > Evicting PASID 0x8000 queues
>         > > Started evicting pasid 0x8000
>         > > qcm fence wait loop timeout expired
>         > > The cp might be in an unrecoverable state due to an
>         unsuccessful
>         > > queues preemption Failed to evict process queues Failed to
>         suspend
>         > > process 0x8000 Finished evicting pasid 0x8000 Started
>         restoring pasid
>         > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>         state may lost
>         > > due to RAS ERREVENT_ATHUB_INTERRUPT
>         > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>         > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>         > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>         > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>         > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>         disable all smu
>         > features!
>         > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
>         all smu features!
>         > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>         > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>         suspend of IP
>         > > block <powerplay> failed -5
>         >
>         > Do you have more information on what's going wrong here
>         since this is a really
>         > important patch for KFD debugging.
>         >
>         > >
>         > > Signed-off-by: Kent Russell <kent.russell at amd.com
>         <mailto:kent.russell at amd.com>>
>         >
>         > Reviewed-by: Christian König <christian.koenig at amd.com
>         <mailto:christian.koenig at amd.com>>
>         >
>         > > ---
>         > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>         ----------------------
>         > >   1 file changed, 26 deletions(-)
>         > >
>         > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > index cf5d6e585634..a3f997f84020 100644
>         > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>         > amdgpu_device *adev, loff_t pos,
>         > >      uint32_t hi = ~0;
>         > >      uint64_t last;
>         > >
>         > > -
>         > > -#ifdef CONFIG_64BIT
>         > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>         > > -   if (last > pos) {
>         > > -           void __iomem *addr =
>         adev->mman.aper_base_kaddr + pos;
>         > > -           size_t count = last - pos;
>         > > -
>         > > -           if (write) {
>         > > - memcpy_toio(addr, buf, count);
>         > > -                   mb();
>         > > - amdgpu_asic_flush_hdp(adev, NULL);
>         > > -           } else {
>         > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>         > > -                   mb();
>         > > - memcpy_fromio(buf, addr, count);
>         > > -           }
>         > > -
>         > > -           if (count == size)
>         > > -                   return;
>         > > -
>         > > -           pos += count;
>         > > -           buf += count / 4;
>         > > -           size -= count;
>         > > -   }
>         > > -#endif
>         > > -
>         > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>         > >      for (last = pos + size; pos < last; pos += 4) {
>         > >              uint32_t tmp = pos >> 31;
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
>     Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>     <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
>         [AMD Public Use]
>
>         If this causes an issue, any access to vram via the BAR could
>         cause an issue.
>
>         Alex
>
>         ------------------------------------------------------------------------
>
>         *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org
>         <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
>         Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
>         *Sent:* Tuesday, April 14, 2020 10:19 AM
>         *To:* Koenig, Christian <Christian.Koenig at amd.com
>         <mailto:Christian.Koenig at amd.com>>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         <amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>>
>         *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
>         <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
>         <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
>         *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         On VG20 or MI100, as soon as we run the subtest, we get the
>         dmesg output below, and then the kernel ends up hanging. I
>         don't know enough about the test itself to know why this is
>         occurring, but Jon Kim and Felix were discussing it on a
>         separate thread when the issue was first reported, so they can
>         hopefully provide some additional information.
>
>          Kent
>
>         > -----Original Message-----
>         > From: Christian König <ckoenig.leichtzumerken at gmail.com
>         <mailto:ckoenig.leichtzumerken at gmail.com>>
>         > Sent: Tuesday, April 14, 2020 9:52 AM
>         > To: Russell, Kent <Kent.Russell at amd.com
>         <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in
>         > amdgpu_device_vram_access v2"
>         >
>         > Am 13.04.20 um 20:20 schrieb Kent Russell:
>         > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>         > > The original patch causes a RAS event and subsequent
>         kernel hard-hang
>         > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
>         on VG20 and
>         > > Arcturus
>         > >
>         > > dmesg output at hang time:
>         > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>         > > amdgpu 0000:67:00.0: GPU reset begin!
>         > > Evicting PASID 0x8000 queues
>         > > Started evicting pasid 0x8000
>         > > qcm fence wait loop timeout expired
>         > > The cp might be in an unrecoverable state due to an
>         unsuccessful
>         > > queues preemption Failed to evict process queues Failed to
>         suspend
>         > > process 0x8000 Finished evicting pasid 0x8000 Started
>         restoring pasid
>         > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>         state may lost
>         > > due to RAS ERREVENT_ATHUB_INTERRUPT
>         > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>         > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>         > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>         > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>         > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>         disable all smu
>         > features!
>         > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
>         all smu features!
>         > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>         > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>         suspend of IP
>         > > block <powerplay> failed -5
>         >
>         > Do you have more information on what's going wrong here
>         since this is a really
>         > important patch for KFD debugging.
>         >
>         > >
>         > > Signed-off-by: Kent Russell <kent.russell at amd.com
>         <mailto:kent.russell at amd.com>>
>         >
>         > Reviewed-by: Christian König <christian.koenig at amd.com
>         <mailto:christian.koenig at amd.com>>
>         >
>         > > ---
>         > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>         ----------------------
>         > >   1 file changed, 26 deletions(-)
>         > >
>         > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > index cf5d6e585634..a3f997f84020 100644
>         > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>         > amdgpu_device *adev, loff_t pos,
>         > >      uint32_t hi = ~0;
>         > >      uint64_t last;
>         > >
>         > > -
>         > > -#ifdef CONFIG_64BIT
>         > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>         > > -   if (last > pos) {
>         > > -           void __iomem *addr =
>         adev->mman.aper_base_kaddr + pos;
>         > > -           size_t count = last - pos;
>         > > -
>         > > -           if (write) {
>         > > - memcpy_toio(addr, buf, count);
>         > > -                   mb();
>         > > - amdgpu_asic_flush_hdp(adev, NULL);
>         > > -           } else {
>         > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>         > > -                   mb();
>         > > - memcpy_fromio(buf, addr, count);
>         > > -           }
>         > > -
>         > > -           if (count == size)
>         > > -                   return;
>         > > -
>         > > -           pos += count;
>         > > -           buf += count / 4;
>         > > -           size -= count;
>         > > -   }
>         > > -#endif
>         > > -
>         > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>         > >      for (last = pos + size; pos < last; pos += 4) {
>         > >              uint32_t tmp = pos >> 31;
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
>     Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>     <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
>         [AMD Public Use]
>
>         If this causes an issue, any access to vram via the BAR could
>         cause an issue.
>
>         Alex
>
>         ------------------------------------------------------------------------
>
>         *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org
>         <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
>         Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
>         *Sent:* Tuesday, April 14, 2020 10:19 AM
>         *To:* Koenig, Christian <Christian.Koenig at amd.com
>         <mailto:Christian.Koenig at amd.com>>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         <amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>>
>         *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
>         <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
>         <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
>         *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in amdgpu_device_vram_access v2"
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         On VG20 or MI100, as soon as we run the subtest, we get the
>         dmesg output below, and then the kernel ends up hanging. I
>         don't know enough about the test itself to know why this is
>         occurring, but Jon Kim and Felix were discussing it on a
>         separate thread when the issue was first reported, so they can
>         hopefully provide some additional information.
>
>          Kent
>
>         > -----Original Message-----
>         > From: Christian König <ckoenig.leichtzumerken at gmail.com
>         <mailto:ckoenig.leichtzumerken at gmail.com>>
>         > Sent: Tuesday, April 14, 2020 9:52 AM
>         > To: Russell, Kent <Kent.Russell at amd.com
>         <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if
>         possible in
>         > amdgpu_device_vram_access v2"
>         >
>         > Am 13.04.20 um 20:20 schrieb Kent Russell:
>         > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>         > > The original patch causes a RAS event and subsequent
>         kernel hard-hang
>         > > when running the KFDMemoryTest.PtraceAccessInvisibleVram
>         on VG20 and
>         > > Arcturus
>         > >
>         > > dmesg output at hang time:
>         > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>         > > amdgpu 0000:67:00.0: GPU reset begin!
>         > > Evicting PASID 0x8000 queues
>         > > Started evicting pasid 0x8000
>         > > qcm fence wait loop timeout expired
>         > > The cp might be in an unrecoverable state due to an
>         unsuccessful
>         > > queues preemption Failed to evict process queues Failed to
>         suspend
>         > > process 0x8000 Finished evicting pasid 0x8000 Started
>         restoring pasid
>         > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU
>         state may lost
>         > > due to RAS ERREVENT_ATHUB_INTERRUPT
>         > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>         > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>         > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>         > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>         > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to
>         disable all smu
>         > features!
>         > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable
>         all smu features!
>         > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>         > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
>         suspend of IP
>         > > block <powerplay> failed -5
>         >
>         > Do you have more information on what's going wrong here
>         since this is a really
>         > important patch for KFD debugging.
>         >
>         > >
>         > > Signed-off-by: Kent Russell <kent.russell at amd.com
>         <mailto:kent.russell at amd.com>>
>         >
>         > Reviewed-by: Christian König <christian.koenig at amd.com
>         <mailto:christian.koenig at amd.com>>
>         >
>         > > ---
>         > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>         ----------------------
>         > >   1 file changed, 26 deletions(-)
>         > >
>         > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > index cf5d6e585634..a3f997f84020 100644
>         > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>         > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>         > amdgpu_device *adev, loff_t pos,
>         > >      uint32_t hi = ~0;
>         > >      uint64_t last;
>         > >
>         > > -
>         > > -#ifdef CONFIG_64BIT
>         > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>         > > -   if (last > pos) {
>         > > -           void __iomem *addr =
>         adev->mman.aper_base_kaddr + pos;
>         > > -           size_t count = last - pos;
>         > > -
>         > > -           if (write) {
>         > > - memcpy_toio(addr, buf, count);
>         > > -                   mb();
>         > > - amdgpu_asic_flush_hdp(adev, NULL);
>         > > -           } else {
>         > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>         > > -                   mb();
>         > > - memcpy_fromio(buf, addr, count);
>         > > -           }
>         > > -
>         > > -           if (count == size)
>         > > -                   return;
>         > > -
>         > > -           pos += count;
>         > > -           buf += count / 4;
>         > > -           size -= count;
>         > > -   }
>         > > -#endif
>         > > -
>         > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>         > >      for (last = pos + size; pos < last; pos += 4) {
>         > >              uint32_t tmp = pos >> 31;
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>
>     Am 14.04.2020 16:35 schrieb "Deucher, Alexander"
>     <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>:
>
>     [AMD Public Use]
>
>     If this causes an issue, any access to vram via the BAR could
>     cause an issue.
>
>     Alex
>
>     ------------------------------------------------------------------------
>
>     *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org
>     <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
>     Russell, Kent <Kent.Russell at amd.com <mailto:Kent.Russell at amd.com>>
>     *Sent:* Tuesday, April 14, 2020 10:19 AM
>     *To:* Koenig, Christian <Christian.Koenig at amd.com
>     <mailto:Christian.Koenig at amd.com>>; amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>
>     <amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>>
>     *Cc:* Kuehling, Felix <Felix.Kuehling at amd.com
>     <mailto:Felix.Kuehling at amd.com>>; Kim, Jonathan
>     <Jonathan.Kim at amd.com <mailto:Jonathan.Kim at amd.com>>
>     *Subject:* RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible
>     in amdgpu_device_vram_access v2"
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     On VG20 or MI100, as soon as we run the subtest, we get the dmesg
>     output below, and then the kernel ends up hanging. I don't know
>     enough about the test itself to know why this is occurring, but
>     Jon Kim and Felix were discussing it on a separate thread when the
>     issue was first reported, so they can hopefully provide some
>     additional information.
>
>      Kent
>
>     > -----Original Message-----
>     > From: Christian König <ckoenig.leichtzumerken at gmail.com
>     <mailto:ckoenig.leichtzumerken at gmail.com>>
>     > Sent: Tuesday, April 14, 2020 9:52 AM
>     > To: Russell, Kent <Kent.Russell at amd.com
>     <mailto:Kent.Russell at amd.com>>; amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>
>     > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in
>     > amdgpu_device_vram_access v2"
>     >
>     > Am 13.04.20 um 20:20 schrieb Kent Russell:
>     > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
>     > > The original patch causes a RAS event and subsequent kernel
>     hard-hang
>     > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on
>     VG20 and
>     > > Arcturus
>     > >
>     > > dmesg output at hang time:
>     > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
>     > > amdgpu 0000:67:00.0: GPU reset begin!
>     > > Evicting PASID 0x8000 queues
>     > > Started evicting pasid 0x8000
>     > > qcm fence wait loop timeout expired
>     > > The cp might be in an unrecoverable state due to an unsuccessful
>     > > queues preemption Failed to evict process queues Failed to suspend
>     > > process 0x8000 Finished evicting pasid 0x8000 Started
>     restoring pasid
>     > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state
>     may lost
>     > > due to RAS ERREVENT_ATHUB_INTERRUPT
>     > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
>     > > amdgpu: [powerplay] Failed to set soft min gfxclk !
>     > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
>     > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
>     > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable
>     all smu
>     > features!
>     > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all
>     smu features!
>     > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
>     > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend
>     of IP
>     > > block <powerplay> failed -5
>     >
>     > Do you have more information on what's going wrong here since
>     this is a really
>     > important patch for KFD debugging.
>     >
>     > >
>     > > Signed-off-by: Kent Russell <kent.russell at amd.com
>     <mailto:kent.russell at amd.com>>
>     >
>     > Reviewed-by: Christian König <christian.koenig at amd.com
>     <mailto:christian.koenig at amd.com>>
>     >
>     > > ---
>     > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>     ----------------------
>     > >   1 file changed, 26 deletions(-)
>     > >
>     > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > index cf5d6e585634..a3f997f84020 100644
>     > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct
>     > amdgpu_device *adev, loff_t pos,
>     > >      uint32_t hi = ~0;
>     > >      uint64_t last;
>     > >
>     > > -
>     > > -#ifdef CONFIG_64BIT
>     > > -   last = min(pos + size, adev->gmc.visible_vram_size);
>     > > -   if (last > pos) {
>     > > -           void __iomem *addr = adev->mman.aper_base_kaddr + pos;
>     > > -           size_t count = last - pos;
>     > > -
>     > > -           if (write) {
>     > > -                   memcpy_toio(addr, buf, count);
>     > > -                   mb();
>     > > - amdgpu_asic_flush_hdp(adev, NULL);
>     > > -           } else {
>     > > - amdgpu_asic_invalidate_hdp(adev, NULL);
>     > > -                   mb();
>     > > -                   memcpy_fromio(buf, addr, count);
>     > > -           }
>     > > -
>     > > -           if (count == size)
>     > > -                   return;
>     > > -
>     > > -           pos += count;
>     > > -           buf += count / 4;
>     > > -           size -= count;
>     > > -   }
>     > > -#endif
>     > > -
>     > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>     > >      for (last = pos + size; pos < last; pos += 4) {
>     > >              uint32_t tmp = pos >> 31;
>     _______________________________________________
>     amd-gfx mailing list
>     amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>     https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200415/eef6c8f6/attachment-0001.htm>


More information about the amd-gfx mailing list