[PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode.

Fri Oct 25 06:46:27 UTC 2024

[AMD Official Use Only - AMD Internal Distribution Only]

Hi, Christian.

The size of log file so large, can’t paste in the Email.

I copy the log file in directory “\\ark\incoming\chong\log<file://ark/incoming/chong/log>”, the log file name is “kern.log”.

Can you access this directory ?





Thanks,
Chong.


From: Koenig, Christian <Christian.Koenig at amd.com>
Sent: Thursday, October 24, 2024 7:22 PM
To: Li, Chong(Alan) <Chong.Li at amd.com>; Andjelkovic, Dejan <Dejan.Andjelkovic at amd.com>
Cc: cao, lin <lin.cao at amd.com>; Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; Zhang, Tiantian (Celine) <Tiantian.Zhang at amd.com>; Raina, Yera <Yera.Raina at amd.com>
Subject: Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode.

Do you have the full log as text file? As image it's pretty much useless.

Regards,
Christian.
Am 24.10.24 um 09:41 schrieb Li, Chong(Alan):

[AMD Official Use Only - AMD Internal Distribution Only]

Hi, Christian.

We can see the dmesg log,
After address “7ef90be00” already update the ptes, page fault still happen.


[cid:image001.png at 01DB26CB.DBEED2C0]

[cid:image002.png at 01DB26CB.DBEED2C0]



Thanks,
Chong.

From: Koenig, Christian <Christian.Koenig at amd.com><mailto:Christian.Koenig at amd.com>
Sent: Wednesday, October 23, 2024 5:26 PM
To: Li, Chong(Alan) <Chong.Li at amd.com><mailto:Chong.Li at amd.com>; Andjelkovic, Dejan <Dejan.Andjelkovic at amd.com><mailto:Dejan.Andjelkovic at amd.com>
Cc: cao, lin <lin.cao at amd.com><mailto:lin.cao at amd.com>; Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com><mailto:ZhenGuo.Yin at amd.com>; Zhang, Tiantian (Celine) <Tiantian.Zhang at amd.com><mailto:Tiantian.Zhang at amd.com>; Raina, Yera <Yera.Raina at amd.com><mailto:Yera.Raina at amd.com>
Subject: Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode.

Hi Chong,

oh that could indeed be.

I suggest to add a trace point for the page fault so that we can guarantee that we use the same time basis for both events.

That should make it trivial to compare them.

Regards,
Christian.
Am 23.10.24 um 10:17 schrieb Li, Chong(Alan):

[AMD Official Use Only - AMD Internal Distribution Only]

Hi, Christian.

I add a log in kernel, and prove the timestamp in tracing log is slower than dmesg log,
so we can’t give a conclusion that the issue in rocm.


------------------------ the information I sync with Andjelkovic, Dejan ----------------------------------------
dmesg shows that the page fault happens address “0x000072e5f4401000” at time “6587.772178”,

[cid:image003.png at 01DB26CB.DBEED2C0]

tracing log shows that the function “amdgpu_vm_update_ptes” be called at time “6587.790869”,
[cid:image004.png at 01DB26CB.DBEED2C0]
------------------------ the information I sync with Andjelkovic, Dejan ----------------------------------------


From the log time stamp, you give a conclusion that “The test tries to access memory before it is probably mapped and that is provable by looking into the tracelogs.”.

But after I review the code, the function “amdgpu_vm_ptes_update” be called in function “svm_range_set_attr”,

So, after this log in above dmesg print “[ 6587.772136] amdgpu: pasid 0x8002 svms 0x000000008b03ff39 [0x72e5f4400 0x72e5fc3ff] done, r=0”,
the function “svm_range_set_attr” will leave, in that time “amdgpu_vm_ptes_update” is already be called, the timestamp is not reasonable.

I think maybe the timestamp in tracing log has some delay, and I add a line of log in kernel to verify my guess,

[cid:image005.png at 01DB26CB.DBEED2C0]

The below is the result:
tracing log shows the address “ffffffc00” at time “227.298607”,
dmesg log print the address “ffffffc00” at time “226.756137”.


traing log:
[cid:image006.png at 01DB26CB.DBEED2C0]

dmesg log:
[cid:image007.png at 01DB26CB.DBEED2C0]








Thanks,
Chong.

From: Li, Chong(Alan)
Sent: Monday, October 21, 2024 6:38 PM
To: Koenig, Christian <Christian.Koenig at amd.com><mailto:Christian.Koenig at amd.com>; Raina, Yera <Yera.Raina at amd.com><mailto:Yera.Raina at amd.com>; Andjelkovic, Dejan <Dejan.Andjelkovic at amd.com><mailto:Dejan.Andjelkovic at amd.com>
Cc: cao, lin <lin.cao at amd.com><mailto:lin.cao at amd.com>; Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com><mailto:ZhenGuo.Yin at amd.com>; Zhang, Tiantian (Celine) <Tiantian.Zhang at amd.com><mailto:Tiantian.Zhang at amd.com>
Subject: RE: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode.

Hi, Christian.
Thanks for your reply,
And do you have any advice about this issue?


Hi, Raina, Year.
Share I assign this ticket SWDEV-459983<https://ontrack-internal.amd.com/browse/SWDEV-459983> to rocm team?



Thanks,
Chong.

From: Koenig, Christian <Christian.Koenig at amd.com<mailto:Christian.Koenig at amd.com>>
Sent: Monday, October 21, 2024 6:08 PM
To: Li, Chong(Alan) <Chong.Li at amd.com<mailto:Chong.Li at amd.com>>; Raina, Yera <Yera.Raina at amd.com<mailto:Yera.Raina at amd.com>>
Cc: cao, lin <lin.cao at amd.com<mailto:lin.cao at amd.com>>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
Subject: Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode.

Hi Chong,

Andjelkovic just shared a bunch of traces from rocm on teams with me which I analyzed.

When you know what you look for it's actually pretty obvious what's going on. Just look at the timestamp of the fault and compare that with the timestamp of the operation mapping something at the given address.

When mapping an address happens only after accessing an address then there is clearly something wrong in the code which coordinates this and that is the ROCm stress test tool in this case.

Regards,
Christian.
Am 21.10.24 um 11:02 schrieb Li, Chong(Alan):

[AMD Official Use Only - AMD Internal Distribution Only]

Hi, Christian,     Raina, Yera.

If this issue in rocm, I need assign my ticket SWDEV-459983<https://ontrack-internal.amd.com/browse/SWDEV-459983> to rocm team.

Is there anything to share with the rocm pm?
Such as the Email or chat history or the ticket you talk with Andjelkovic.

Thanks,
Chong.

From: Koenig, Christian <Christian.Koenig at amd.com><mailto:Christian.Koenig at amd.com>
Sent: Monday, October 21, 2024 4:00 PM
To: Li, Chong(Alan) <Chong.Li at amd.com><mailto:Chong.Li at amd.com>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
Cc: cao, lin <lin.cao at amd.com><mailto:lin.cao at amd.com>
Subject: Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode.

Am 21.10.24 um 07:56 schrieb Chong Li:




change the gpu tlb flush mode to sync mode to

solve the issue in the rocm stress test.

And again complete NAK to this.

I've already proven together with Andjelkovic that the problem is that the rocm stress test is broken.

The test tries to access memory before it is probably mapped and that is provable by looking into the tracelogs.

Regards,
Christian.









Signed-off-by: Chong Li <chongli2 at amd.com><mailto:chongli2 at amd.com>

---

 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c | 4 ++--

 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c

index 51cddfa3f1e8..4d9ff7b31618 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c

@@ -98,7 +98,6 @@ void amdgpu_vm_tlb_fence_create(struct amdgpu_device *adev, struct amdgpu_vm *vm

  f->adev = adev;

  f->dependency = *fence;

  f->pasid = vm->pasid;

- INIT_WORK(&f->work, amdgpu_tlb_fence_work);

  spin_lock_init(&f->lock);



  dma_fence_init(&f->base, &amdgpu_tlb_fence_ops, &f->lock,

@@ -106,7 +105,8 @@ void amdgpu_vm_tlb_fence_create(struct amdgpu_device *adev, struct amdgpu_vm *vm



  /* TODO: We probably need a separate wq here */

  dma_fence_get(&f->base);

- schedule_work(&f->work);



  *fence = &f->base;

+

+ amdgpu_tlb_fence_work(&f->work);

 }




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241025/a1134be6/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 83231 bytes
Desc: image001.png
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241025/a1134be6/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 74007 bytes
Desc: image002.png
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241025/a1134be6/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 34357 bytes
Desc: image003.png
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241025/a1134be6/attachment-0009.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 7556 bytes
Desc: image004.png
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241025/a1134be6/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 15396 bytes
Desc: image005.png
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241025/a1134be6/attachment-0011.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image006.png
Type: image/png
Size: 37011 bytes
Desc: image006.png
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241025/a1134be6/attachment-0012.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image007.png
Type: image/png
Size: 24298 bytes
Desc: image007.png
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241025/a1134be6/attachment-0013.png>