[PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode.

Mon Oct 21 10:08:09 UTC 2024

Hi Chong,

Andjelkovic just shared a bunch of traces from rocm on teams with me 
which I analyzed.

When you know what you look for it's actually pretty obvious what's 
going on. Just look at the timestamp of the fault and compare that with 
the timestamp of the operation mapping something at the given address.

When mapping an address happens only after accessing an address then 
there is clearly something wrong in the code which coordinates this and 
that is the ROCm stress test tool in this case.

Regards,
Christian.

Am 21.10.24 um 11:02 schrieb Li, Chong(Alan):
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
> Hi, Christian, Raina, Yera.
>
> If this issue in rocm, I need assign my ticket SWDEV-459983 
> <https://ontrack-internal.amd.com/browse/SWDEV-459983> to rocm team.
>
> Is there anything to share with the rocm pm?
>
> Such as the Email or chat history or the ticket you talk with Andjelkovic.
>
> Thanks,
>
> Chong.
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com>
> *Sent:* Monday, October 21, 2024 4:00 PM
> *To:* Li, Chong(Alan) <Chong.Li at amd.com>; amd-gfx at lists.freedesktop.org
> *Cc:* cao, lin <lin.cao at amd.com>
> *Subject:* Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode 
> to sync mode.
>
> Am 21.10.24 um 07:56 schrieb Chong Li:
>
>     change the gpu tlb flush mode to sync mode to
>
>     solve the issue in the rocm stress test.
>
>
> And again complete NAK to this.
>
> I've already proven together with Andjelkovic that the problem is that 
> the rocm stress test is broken.
>
> The test tries to access memory before it is probably mapped and that 
> is provable by looking into the tracelogs.
>
> Regards,
> Christian.
>
>
>     Signed-off-by: Chong Li<chongli2 at amd.com>  <mailto:chongli2 at amd.com>
>
>     ---
>
>       drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c | 4 ++--
>
>       1 file changed, 2 insertions(+), 2 deletions(-)
>
>     diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
>
>     index 51cddfa3f1e8..4d9ff7b31618 100644
>
>     --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
>
>     +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
>
>     @@ -98,7 +98,6 @@ void amdgpu_vm_tlb_fence_create(struct amdgpu_device *adev, struct amdgpu_vm *vm
>
>        f->adev = adev;
>
>        f->dependency = *fence;
>
>        f->pasid = vm->pasid;
>
>     - INIT_WORK(&f->work, amdgpu_tlb_fence_work);
>
>        spin_lock_init(&f->lock);
>
>       
>
>        dma_fence_init(&f->base, &amdgpu_tlb_fence_ops, &f->lock,
>
>     @@ -106,7 +105,8 @@ void amdgpu_vm_tlb_fence_create(struct amdgpu_device *adev, struct amdgpu_vm *vm
>
>       
>
>        /* TODO: We probably need a separate wq here */
>
>        dma_fence_get(&f->base);
>
>     - schedule_work(&f->work);
>
>       
>
>        *fence = &f->base;
>
>     +
>
>     + amdgpu_tlb_fence_work(&f->work);
>
>       }
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241021/20dd0eaa/attachment-0001.htm>