<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <div class="moz-cite-prefix">On 2023-12-20 8:58, Christian König
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:db8e0eee-1f43-472f-b1dc-138240fbf4af@gmail.com">
      
      Am 19.12.23 um 23:43 schrieb Felix Kuehling:<br>
      <blockquote type="cite" cite="mid:f37e19fe-4fc9-4580-9961-ccdcf7f9b35a@amd.com"> On
        2023-12-19 3:10, Christian König wrote:
        <blockquote type="cite" cite="mid:346c1009-2930-4424-9cd4-589e5872877e@amd.com">Am
          15.12.23 um 16:19 schrieb Felix Kuehling: <br>
          <blockquote type="cite"> <br>
            On 2023-12-15 07:30, Christian König wrote: <br>
            <blockquote type="cite">
              <blockquote type="cite">@@ -1425,11 +1451,21 @@ int
                amdgpu_vm_handle_moved(struct amdgpu_device *adev, <br>
                          } <br>
                            r = amdgpu_vm_bo_update(adev, bo_va, clear);
                <br>
                -        if (r) <br>
                -            return r; <br>
                            if (unlock) <br>
                              dma_resv_unlock(resv); <br>
                +        if (r) <br>
                +            return r; <br>
                + <br>
                +        /* Remember evicted DMABuf imports in compute
                VMs for later <br>
                +         * validation <br>
                +         */ <br>
                +        if (vm->is_compute_context &&
                bo_va->base.bo && <br>
                +           
                bo_va->base.bo->tbo.base.import_attach &&
                <br>
                +            (!bo_va->base.bo->tbo.resource || <br>
                +            
                bo_va->base.bo->tbo.resource->mem_type ==
                TTM_PL_SYSTEM)) <br>
                +            amdgpu_vm_bo_evicted(&bo_va->base);
                <br>
                + <br>
              </blockquote>
              <br>
              The change looks mostly good now. Just one thing which
              worries me is that when GFX and compute is mixed in the
              same VM this here might cause problems when we run into an
              error during command submission. <br>
              <br>
              E.g. GFX validates the VM BOs, but then the IOCTL fails
              before calling amdgpu_vm_handle_moved(). <br>
              <br>
              In this case the DMA-buf wouldn't be validated any more. <br>
            </blockquote>
            <br>
            This code path shouldn't be relevant for command submission,
            but for the amdgpu_vm_handle_moved call in
            amdgpu_dma_buf_move_notify. That's where the BO is first
            found to be evicted and its PTEs invalidated. That's where
            we need to remember it so it can be validated in the next
            call to amdgpu_vm_validate. <br>
            <br>
            Currently the amdgpu_cs code path calls amdgpu_vm_validate
            with ticket=NULL, so it won't validate these imports. The
            only place that auto-validates evicted imports is
            amdgpu_amdkfd_restore_process_bos. So none of this should
            affect amdgpu_cs command submission. <br>
          </blockquote>
          <br>
          Yeah, but ticket=NULL will result in removing those imports
          from the validation list.</blockquote>
        <p>I have a comment for that in amdgpu_vm_validate:</p>
        <pre>                        if (!ticket) {
                                /* We need to move the BO out of the evicted
                                 * list to avoid an infinite loop. It will be
                                 * moved back to evicted in the next
                                 * amdgpu_vm_handle_moved.
                                 */
                                amdgpu_vm_bo_invalidated(bo_base);
                                spin_lock(&vm->status_lock);
                                continue;
                        }
</pre>
        <p>The net result is that the BO is still tracked as evicted.<br>
        </p>
      </blockquote>
      <br>
      Yeah, that's exactly what I mean:<br>
      <br>
      What happens if amdgpu_vm_validate() is called, removes the BOs
      from the evicted list, but then an error happens (or just an
      interrupted system call) and amdgpu_vm_handle_moved is never
      called?<br>
      <br>
      In this case the DMA-bufs would be on the moved list and
      amdgpu_vm_handle moved would have to be called once before we can
      validate them again.<br>
    </blockquote>
    <p>And that would be a good argument for tracking evicted user BOs
      with a separate state after all. That way we could just leave that
      list alone if ticket==NULL in amdgpu_vm_validate.</p>
    <p>Regards,<br>
        Felix</p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:db8e0eee-1f43-472f-b1dc-138240fbf4af@gmail.com"> <br>
      Regards,<br>
      Christian.<br>
      <br>
      <blockquote type="cite" cite="mid:f37e19fe-4fc9-4580-9961-ccdcf7f9b35a@amd.com">
        <p> </p>
        <p><br>
        </p>
        <blockquote type="cite" cite="mid:346c1009-2930-4424-9cd4-589e5872877e@amd.com"> This
          could potentially result in not validating them should we ever
          mix GFX and Compute submissions in the same VM. <br>
        </blockquote>
        <p>My eviction test does exactly this, and it's working just
          fine.</p>
        <p>Regards,<br>
            Felix<br>
        </p>
        <p><br>
        </p>
        <blockquote type="cite" cite="mid:346c1009-2930-4424-9cd4-589e5872877e@amd.com"> <br>
          For now that works, but we need to clean that up more
          thoughtfully I think. <br>
          <br>
          Christian. <br>
          <br>
          <blockquote type="cite"> <br>
            <br>
            The flow for amdgpu_amdkfd_restore_process_bos is: <br>
            <br>
             * amdgpu_vm_validate validates the evicted dmabuf imports
            (moves the <br>
               bo_vas from "evicted" to "invalidated") <br>
             * amdgpu_vm_handle_moved iterates over invalidated bo_vas
            and updates <br>
               the PTEs with valid entries (moves the bo_vas to "done")
            <br>
            <br>
            <br>
            Regards, <br>
              Felix <br>
            <br>
            <br>
            <blockquote type="cite"> <br>
              Regards, <br>
              Christian. </blockquote>
          </blockquote>
          <br>
        </blockquote>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>