<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2021-08-04 5:01 a.m., Christian
      König wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:920b7b04-8f8d-a57a-724e-811a4c7e581c@gmail.com">Am
      03.08.21 um 00:33 schrieb Philip Yang:
      <br>
      <blockquote type="cite">Take vm->invalidated_lock spinlock to
        remove vm pd and pt bos, to avoid
        <br>
        link list corruption with crash backtrace:
        <br>
        <br>
        [ 2290.505111] list_del corruption. next->prev should be
        <br>
          ffff9b2514ec0018, but was 4e03280211010f04
        <br>
        [ 2290.505154] kernel BUG at lib/list_debug.c:56!
        <br>
        [ 2290.505176] invalid opcode: 0000 [#1] SMP NOPTI
        <br>
        [ 2290.505254] Workqueue: events delayed_fput
        <br>
        [ 2290.505271] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x4c
        <br>
        [ 2290.505513] Call Trace:
        <br>
        [ 2290.505623]  amdgpu_vm_free_table+0x26/0x80 [amdgpu]
        <br>
        [ 2290.505705]  amdgpu_vm_free_pts+0x7a/0xf0 [amdgpu]
        <br>
        [ 2290.505786]  amdgpu_vm_fini+0x1f0/0x440 [amdgpu]
        <br>
        [ 2290.505864]  amdgpu_driver_postclose_kms+0x172/0x290 [amdgpu]
        <br>
        [ 2290.505893]  drm_file_free.part.10+0x1d4/0x270 [drm]
        <br>
        [ 2290.505916]  drm_release+0xa9/0xe0 [drm]
        <br>
        [ 2290.505930]  __fput+0xb7/0x230
        <br>
        [ 2290.505942]  delayed_fput+0x1c/0x30
        <br>
        [ 2290.505957]  process_one_work+0x1a7/0x360
        <br>
        [ 2290.505971]  worker_thread+0x30/0x390
        <br>
        [ 2290.505985]  ? create_worker+0x1a0/0x1a0
        <br>
        [ 2290.505999]  kthread+0x112/0x130
        <br>
        [ 2290.506011]  ? kthread_flush_work_fn+0x10/0x10
        <br>
        [ 2290.506027]  ret_from_fork+0x22/0x40
        <br>
      </blockquote>
      <br>
      Wow, well this is a big NAK.
      <br>
      <br>
      The page tables should never ever be on the invalidation list or
      otherwise we would try to point PTEs to them which is a huge
      security issue.
      <br>
      <br>
      Taking the lock just workaround that. Can you investigate how it
      happens that a page table ends up on that list?
      <br>
    </blockquote>
    <p>I will be off 3 days, I can investigate this further next Monday.
      From debug, amdgpu_vm_bo_invalidate is called many times. The
      background: app has 8 processes, running on 8 GPUs, 64 VMs, VRAM
      usage is around 85% from rocm-info, 1/5 chance this happens
      (kernel BUG, server die) after application segmentation fault
      crash (another app issue). It is new issue on rocm release 4.3,
      never happened before on rocm 4.1 (no app crash on 4.1 either).
      kernel version is 4.18 on CentOS.<br>
    </p>
    <p>Regards,</p>
    <p>Philip<br>
    </p>
    <blockquote type="cite" cite="mid:920b7b04-8f8d-a57a-724e-811a4c7e581c@gmail.com">
      <br>
      Thanks in advance,
      <br>
      Christian.
      <br>
      <br>
      <blockquote type="cite">
        <br>
        Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
        <br>
        ---
        <br>
          drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 ++
        <br>
          1 file changed, 2 insertions(+)
        <br>
        <br>
        diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
        b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
        <br>
        index 2a88ed5d983b..5c4c355e7d6b 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
        <br>
        +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
        <br>
        @@ -1045,7 +1045,9 @@ static void amdgpu_vm_free_table(struct
        amdgpu_vm_bo_base *entry)
        <br>
                  return;
        <br>
              shadow = amdgpu_bo_shadowed(entry->bo);
        <br>
              entry->bo->vm_bo = NULL;
        <br>
        +    spin_lock(&entry->vm->invalidated_lock);
        <br>
              list_del(&entry->vm_status);
        <br>
        +    spin_unlock(&entry->vm->invalidated_lock);
        <br>
              amdgpu_bo_unref(&shadow);
        <br>
              amdgpu_bo_unref(&entry->bo);
        <br>
          }
        <br>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>