[PATCH 03/14] drm/amdgpu: Add helper to turn an existing VM into a compute VM

Mon Mar 5 16:58:24 UTC 2018

On 2018-03-05 07:17 AM, Christian König wrote:
> Am 04.03.2018 um 03:34 schrieb Felix Kuehling:
>> Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 82
>> ++++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  1 +
>>   2 files changed, 83 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 5afbc5e..58153ef 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -2350,6 +2350,8 @@ int amdgpu_vm_init(struct amdgpu_device *adev,
>> struct amdgpu_vm *vm,
>>       INIT_KFIFO(vm->faults);
>>       vm->fault_credit = 16;
>>   +    vm->vm_context = vm_context;
>
> I think we should drop the vm_context parameter and all the related
> code in amdgpu_vm_init(). But that can be part of a later cleanup
> patch as well.

Yep. It will be a prerequisite for sharing VMs between graphics and
compute. But then CPU vs SDMA page table update would need to be handled
differently, on a per-call basis and probably some synchronization
between them.

Also, we'd need to fix shadow page table handling with CPU updates, or
get rid of shadow page tables. I'm not sure why they're needed, TBH.

>
>> +
>>       return 0;
>>     error_free_root:
>> @@ -2364,6 +2366,86 @@ int amdgpu_vm_init(struct amdgpu_device *adev,
>> struct amdgpu_vm *vm,
>>   }
>>     /**
>> + * amdgpu_vm_make_compute - Turn a GFX VM into a compute VM
>> + *
>> + * This only works on GFX VMs that don't have any BOs added and no
>> + * page tables allocated yet.
>> + *
>> + * Changes the following VM parameters:
>> + * - vm_context
>> + * - use_cpu_for_update
>> + * - pte_supports_ats
>> + * - pasid (old PASID is released, because compute manages its own
>> PASIDs)
>> + *
>> + * Reinitializes the page directory to reflect the changed ATS
>> + * setting. May leave behind an unused shadow BO for the page
>> + * directory when switching from SDMA updates to CPU updates.
>> + *
>> + * Returns 0 for success, -errno for errors.
>> + */
>> +int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct
>> amdgpu_vm *vm)
>> +{
>> +    bool pte_support_ats = (adev->asic_type == CHIP_RAVEN);
>> +    int r;
>> +
>> +    r = amdgpu_bo_reserve(vm->root.base.bo, true);
>> +    if (r)
>> +        return r;
>> +
>> +    /* Sanity checks */
>> +    if (vm->vm_context == AMDGPU_VM_CONTEXT_COMPUTE) {
>> +        /* Can happen if ioctl is interrupted by a signal after
>> +         * this function already completed. Just return success.
>> +         */
>> +        r = 0;
>> +        goto error;
>> +    }
>
> Ok, that is actually a show stopper. An interrupted IOCTL should never
> have a visible effect.
>
> Is that just a theoretical issue or did you run into this in practice?

It's theoretical. init_kfd_vm in patch 4 reserves the page directory
again, validates it and waits for the validation to complete. Both the
reservation and the waiting can be interrupted by signals. This happens
after amdgpu_vm_make_compute has completed.

If we wanted to achieve "no visible effect", we'd need to roll back the
conversion to a compute VM, which may involve more waiting.

We have a similar issue in map_memory_to_gpu. We map on all GPUs and
wait for a sync object in the end. If SDMA is used, this allows the SDMA
on all GPUs to work concurrently. The wait in the end can be
interrupted. Instead of rolling back the mapping on all GPUs (which
would require more waiting), we just let the system call return
-ERESTARTSYS and make sure the restart doesn't cause any trouble.

I've been looking for documentation about the expected behaviour for
system calls interrupted by signals. All I can find is this statement in
Documentation/kernel-hacking/hacking.rst:
> After you slept you should check if a signal occurred: the Unix/Linux
> way of handling signals is to temporarily exit the system call with the
> ``-ERESTARTSYS`` error. The system call entry code will switch back to
> user context, process the signal handler and then your system call will
> be restarted (unless the user disabled that). So you should be prepared
> to process the restart, e.g. if you're in the middle of manipulating
> some data structure.
The same paragraph is also copied in
https://www.kernel.org/doc/htmldocs/kernel-hacking/ioctls.html.

The key sentence is "should be prepared to process the restart, e.g. if
you're in the middle of manipulating some data structure". Rolling back
everything would achieve that, but it's not the only way and not the
most efficient way.

In this case, I'm handling the restart by checking whether the VM is
already a compute VM. So on the restart the conversion to a compute VM
becomes a no-op.

Similarly, in the map_memory_to_gpu ioctl, mapping on a GPU where the
memory is already mapped becomes a no-op, which handles the restart
case. In this case we even added a specific test for this condition in
kfdtest, where we deliberately interrupt a big mapping operation with a
signal.

Regards,
  Felix

>
> Regards,
> Christian.
>
>> +    if (!RB_EMPTY_ROOT(&vm->va.rb_root) || vm->root.entries) {
>> +        r = -EINVAL;
>> +        goto error;
>> +    }
>> +
>> +    /* Check if PD needs to be reinitialized and do it before
>> +     * changing any other state, in case it fails.
>> +     */
>> +    if (pte_support_ats != vm->pte_support_ats) {
>> +        /* TODO: This depends on a patch series by Christian.
>> +         * It's only needed for GFX9 GPUs, which aren't
>> +         * supported by upstream KFD yet.
>> +        r = amdgpu_vm_clear_bo(adev, vm, vm->root.base.bo,
>> +                   adev->vm_manager.root_level,
>> +                   pte_support_ats);
>> +        if (r)
>> +            goto error;
>> +        */
>> +    }
>> +
>> +    /* Update VM state */
>> +    vm->vm_context = AMDGPU_VM_CONTEXT_COMPUTE;
>> +    vm->use_cpu_for_update = !!(adev->vm_manager.vm_update_mode &
>> +                    AMDGPU_VM_USE_CPU_FOR_COMPUTE);
>> +    vm->pte_support_ats = pte_support_ats;
>> +    DRM_DEBUG_DRIVER("VM update mode is %s\n",
>> +             vm->use_cpu_for_update ? "CPU" : "SDMA");
>> +    WARN_ONCE((vm->use_cpu_for_update & !amdgpu_vm_is_large_bar(adev)),
>> +          "CPU update of VM recommended only for large BAR system\n");
>> +
>> +    if (vm->pasid) {
>> +        unsigned long flags;
>> +
>> +        spin_lock_irqsave(&adev->vm_manager.pasid_lock, flags);
>> +        idr_remove(&adev->vm_manager.pasid_idr, vm->pasid);
>> +        spin_unlock_irqrestore(&adev->vm_manager.pasid_lock, flags);
>> +
>> +        vm->pasid = 0;
>> +    }
>> +
>> +error:
>> +    amdgpu_bo_unreserve(vm->root.base.bo);
>> +    return r;
>> +}
>> +
>> +/**
>>    * amdgpu_vm_free_levels - free PD/PT levels
>>    *
>>    * @adev: amdgpu device structure
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> index 40b4e09..7f50a38 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> @@ -263,6 +263,7 @@ void amdgpu_vm_manager_init(struct amdgpu_device
>> *adev);
>>   void amdgpu_vm_manager_fini(struct amdgpu_device *adev);
>>   int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>              int vm_context, unsigned int pasid);
>> +int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct
>> amdgpu_vm *vm);
>>   void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm);
>>   bool amdgpu_vm_pasid_fault_credit(struct amdgpu_device *adev,
>>                     unsigned int pasid);
>