[PATCH 03/14] drm/amdgpu: Add helper to turn an existing VM into a compute VM

Mon Mar 5 17:14:08 UTC 2018

Am 05.03.2018 um 17:58 schrieb Felix Kuehling:
> On 2018-03-05 07:17 AM, Christian König wrote:
>> Am 04.03.2018 um 03:34 schrieb Felix Kuehling:
>>> Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 82
>>> ++++++++++++++++++++++++++++++++++
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  1 +
>>>    2 files changed, 83 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 5afbc5e..58153ef 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -2350,6 +2350,8 @@ int amdgpu_vm_init(struct amdgpu_device *adev,
>>> struct amdgpu_vm *vm,
>>>        INIT_KFIFO(vm->faults);
>>>        vm->fault_credit = 16;
>>>    +    vm->vm_context = vm_context;
>> I think we should drop the vm_context parameter and all the related
>> code in amdgpu_vm_init(). But that can be part of a later cleanup
>> patch as well.
> Yep. It will be a prerequisite for sharing VMs between graphics and
> compute. But then CPU vs SDMA page table update would need to be handled
> differently, on a per-call basis and probably some synchronization
> between them.
>
> Also, we'd need to fix shadow page table handling with CPU updates, or
> get rid of shadow page tables. I'm not sure why they're needed, TBH.

The idea behind shadow page tables is that you have the current state of 
the pipeline if the GPU crashes. Since with CPU updates there is no 
pipeline they doesn't make sense at all with them.

>
>>> +
>>>        return 0;
>>>      error_free_root:
>>> @@ -2364,6 +2366,86 @@ int amdgpu_vm_init(struct amdgpu_device *adev,
>>> struct amdgpu_vm *vm,
>>>    }
>>>      /**
>>> + * amdgpu_vm_make_compute - Turn a GFX VM into a compute VM
>>> + *
>>> + * This only works on GFX VMs that don't have any BOs added and no
>>> + * page tables allocated yet.
>>> + *
>>> + * Changes the following VM parameters:
>>> + * - vm_context
>>> + * - use_cpu_for_update
>>> + * - pte_supports_ats
>>> + * - pasid (old PASID is released, because compute manages its own
>>> PASIDs)
>>> + *
>>> + * Reinitializes the page directory to reflect the changed ATS
>>> + * setting. May leave behind an unused shadow BO for the page
>>> + * directory when switching from SDMA updates to CPU updates.
>>> + *
>>> + * Returns 0 for success, -errno for errors.
>>> + */
>>> +int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct
>>> amdgpu_vm *vm)
>>> +{
>>> +    bool pte_support_ats = (adev->asic_type == CHIP_RAVEN);
>>> +    int r;
>>> +
>>> +    r = amdgpu_bo_reserve(vm->root.base.bo, true);
>>> +    if (r)
>>> +        return r;
>>> +
>>> +    /* Sanity checks */
>>> +    if (vm->vm_context == AMDGPU_VM_CONTEXT_COMPUTE) {
>>> +        /* Can happen if ioctl is interrupted by a signal after
>>> +         * this function already completed. Just return success.
>>> +         */
>>> +        r = 0;
>>> +        goto error;
>>> +    }
>> Ok, that is actually a show stopper. An interrupted IOCTL should never
>> have a visible effect.
>>
>> Is that just a theoretical issue or did you run into this in practice?
> It's theoretical. init_kfd_vm in patch 4 reserves the page directory
> again, validates it and waits for the validation to complete. Both the
> reservation and the waiting can be interrupted by signals.

No, both have parameters to control if they are interruptible or not for 
exactly this reason.

>   This happens
> after amdgpu_vm_make_compute has completed.
>
> If we wanted to achieve "no visible effect", we'd need to roll back the
> conversion to a compute VM, which may involve more waiting.
>
> We have a similar issue in map_memory_to_gpu. We map on all GPUs and
> wait for a sync object in the end. If SDMA is used, this allows the SDMA
> on all GPUs to work concurrently. The wait in the end can be
> interrupted. Instead of rolling back the mapping on all GPUs (which
> would require more waiting), we just let the system call return
> -ERESTARTSYS and make sure the restart doesn't cause any trouble.

If waits are interruptible or not is controllable by parameters. At 
least for importing the VM it is probably ok to disable this.

> I've been looking for documentation about the expected behaviour for
> system calls interrupted by signals. All I can find is this statement in
> Documentation/kernel-hacking/hacking.rst:
>> After you slept you should check if a signal occurred: the Unix/Linux
>> way of handling signals is to temporarily exit the system call with the
>> ``-ERESTARTSYS`` error. The system call entry code will switch back to
>> user context, process the signal handler and then your system call will
>> be restarted (unless the user disabled that). So you should be prepared
>> to process the restart, e.g. if you're in the middle of manipulating
>> some data structure.
> The same paragraph is also copied in
> https://www.kernel.org/doc/htmldocs/kernel-hacking/ioctls.html.
>
> The key sentence is "should be prepared to process the restart, e.g. if
> you're in the middle of manipulating some data structure". Rolling back
> everything would achieve that, but it's not the only way and not the
> most efficient way.
>
> In this case, I'm handling the restart by checking whether the VM is
> already a compute VM. So on the restart the conversion to a compute VM
> becomes a no-op.
>
> Similarly, in the map_memory_to_gpu ioctl, mapping on a GPU where the
> memory is already mapped becomes a no-op, which handles the restart
> case. In this case we even added a specific test for this condition in
> kfdtest, where we deliberately interrupt a big mapping operation with a
> signal.

This approach is a clear no-go.

Changes done before interrupting a system call doesn't necessary need to 
be reverted, but only if they don't have a user visible effect.

For example: When we map something we need to allocate multiple page 
directories. It is not necessary to release all previous allocated ones 
if an allocation suddenly says that it was interrupted, but only because 
those page directories are not user accessible.

Similar exceptions can be done by avoiding things when the IOCTL was 
interrupted. E.g. in the VA IOCTL we usually directly submit updating 
the page tables, but if that is interrupted we just return success 
instead of restarting the IOCTL and wait with the update until the first 
command submission.

The easiest way to work around this to to add a separate IOCTL which 
waits for VM updates to complete. Should be trivial to use 
vm->last_update for this.

Regards,
Christian.

>
> Regards,
>    Felix
>
>> Regards,
>> Christian.
>>
>>> +    if (!RB_EMPTY_ROOT(&vm->va.rb_root) || vm->root.entries) {
>>> +        r = -EINVAL;
>>> +        goto error;
>>> +    }
>>> +
>>> +    /* Check if PD needs to be reinitialized and do it before
>>> +     * changing any other state, in case it fails.
>>> +     */
>>> +    if (pte_support_ats != vm->pte_support_ats) {
>>> +        /* TODO: This depends on a patch series by Christian.
>>> +         * It's only needed for GFX9 GPUs, which aren't
>>> +         * supported by upstream KFD yet.
>>> +        r = amdgpu_vm_clear_bo(adev, vm, vm->root.base.bo,
>>> +                   adev->vm_manager.root_level,
>>> +                   pte_support_ats);
>>> +        if (r)
>>> +            goto error;
>>> +        */
>>> +    }
>>> +
>>> +    /* Update VM state */
>>> +    vm->vm_context = AMDGPU_VM_CONTEXT_COMPUTE;
>>> +    vm->use_cpu_for_update = !!(adev->vm_manager.vm_update_mode &
>>> +                    AMDGPU_VM_USE_CPU_FOR_COMPUTE);
>>> +    vm->pte_support_ats = pte_support_ats;
>>> +    DRM_DEBUG_DRIVER("VM update mode is %s\n",
>>> +             vm->use_cpu_for_update ? "CPU" : "SDMA");
>>> +    WARN_ONCE((vm->use_cpu_for_update & !amdgpu_vm_is_large_bar(adev)),
>>> +          "CPU update of VM recommended only for large BAR system\n");
>>> +
>>> +    if (vm->pasid) {
>>> +        unsigned long flags;
>>> +
>>> +        spin_lock_irqsave(&adev->vm_manager.pasid_lock, flags);
>>> +        idr_remove(&adev->vm_manager.pasid_idr, vm->pasid);
>>> +        spin_unlock_irqrestore(&adev->vm_manager.pasid_lock, flags);
>>> +
>>> +        vm->pasid = 0;
>>> +    }
>>> +
>>> +error:
>>> +    amdgpu_bo_unreserve(vm->root.base.bo);
>>> +    return r;
>>> +}
>>> +
>>> +/**
>>>     * amdgpu_vm_free_levels - free PD/PT levels
>>>     *
>>>     * @adev: amdgpu device structure
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> index 40b4e09..7f50a38 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> @@ -263,6 +263,7 @@ void amdgpu_vm_manager_init(struct amdgpu_device
>>> *adev);
>>>    void amdgpu_vm_manager_fini(struct amdgpu_device *adev);
>>>    int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>>               int vm_context, unsigned int pasid);
>>> +int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct
>>> amdgpu_vm *vm);
>>>    void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm);
>>>    bool amdgpu_vm_pasid_fault_credit(struct amdgpu_device *adev,
>>>                      unsigned int pasid);