<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2023-12-19 3:10, Christian König
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:346c1009-2930-4424-9cd4-589e5872877e@amd.com">Am
      15.12.23 um 16:19 schrieb Felix Kuehling:
      <br>
      <blockquote type="cite">
        <br>
        On 2023-12-15 07:30, Christian König wrote:
        <br>
        <blockquote type="cite">
          <blockquote type="cite">@@ -1425,11 +1451,21 @@ int
            amdgpu_vm_handle_moved(struct amdgpu_device *adev,
            <br>
                      }
            <br>
                        r = amdgpu_vm_bo_update(adev, bo_va, clear);
            <br>
            -        if (r)
            <br>
            -            return r;
            <br>
                        if (unlock)
            <br>
                          dma_resv_unlock(resv);
            <br>
            +        if (r)
            <br>
            +            return r;
            <br>
            +
            <br>
            +        /* Remember evicted DMABuf imports in compute VMs
            for later
            <br>
            +         * validation
            <br>
            +         */
            <br>
            +        if (vm->is_compute_context &&
            bo_va->base.bo &&
            <br>
            +            bo_va->base.bo->tbo.base.import_attach
            &&
            <br>
            +            (!bo_va->base.bo->tbo.resource ||
            <br>
            +            
            bo_va->base.bo->tbo.resource->mem_type ==
            TTM_PL_SYSTEM))
            <br>
            +            amdgpu_vm_bo_evicted(&bo_va->base);
            <br>
            +
            <br>
          </blockquote>
          <br>
          The change looks mostly good now. Just one thing which worries
          me is that when GFX and compute is mixed in the same VM this
          here might cause problems when we run into an error during
          command submission.
          <br>
          <br>
          E.g. GFX validates the VM BOs, but then the IOCTL fails before
          calling amdgpu_vm_handle_moved().
          <br>
          <br>
          In this case the DMA-buf wouldn't be validated any more.
          <br>
        </blockquote>
        <br>
        This code path shouldn't be relevant for command submission, but
        for the amdgpu_vm_handle_moved call in
        amdgpu_dma_buf_move_notify. That's where the BO is first found
        to be evicted and its PTEs invalidated. That's where we need to
        remember it so it can be validated in the next call to
        amdgpu_vm_validate.
        <br>
        <br>
        Currently the amdgpu_cs code path calls amdgpu_vm_validate with
        ticket=NULL, so it won't validate these imports. The only place
        that auto-validates evicted imports is
        amdgpu_amdkfd_restore_process_bos. So none of this should affect
        amdgpu_cs command submission.
        <br>
      </blockquote>
      <br>
      Yeah, but ticket=NULL will result in removing those imports from
      the validation list.</blockquote>
    <p>I have a comment for that in amdgpu_vm_validate:</p>
    <pre>                        if (!ticket) {
                                /* We need to move the BO out of the evicted
                                 * list to avoid an infinite loop. It will be
                                 * moved back to evicted in the next
                                 * amdgpu_vm_handle_moved.
                                 */
                                amdgpu_vm_bo_invalidated(bo_base);
                                spin_lock(&vm->status_lock);
                                continue;
                        }
</pre>
    <p>The net result is that the BO is still tracked as evicted.<br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:346c1009-2930-4424-9cd4-589e5872877e@amd.com"> This
      could potentially result in not validating them should we ever mix
      GFX and Compute submissions in the same VM.
      <br>
    </blockquote>
    <p>My eviction test does exactly this, and it's working just fine.</p>
    <p>Regards,<br>
        Felix<br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:346c1009-2930-4424-9cd4-589e5872877e@amd.com">
      <br>
      For now that works, but we need to clean that up more thoughtfully
      I think.
      <br>
      <br>
      Christian.
      <br>
      <br>
      <blockquote type="cite">
        <br>
        <br>
        The flow for amdgpu_amdkfd_restore_process_bos is:
        <br>
        <br>
         * amdgpu_vm_validate validates the evicted dmabuf imports
        (moves the
        <br>
           bo_vas from "evicted" to "invalidated")
        <br>
         * amdgpu_vm_handle_moved iterates over invalidated bo_vas and
        updates
        <br>
           the PTEs with valid entries (moves the bo_vas to "done")
        <br>
        <br>
        <br>
        Regards,
        <br>
          Felix
        <br>
        <br>
        <br>
        <blockquote type="cite">
          <br>
          Regards,
          <br>
          Christian. </blockquote>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>