[PATCH] drm/amdgpu: Fix multi-level page table bugs for large BOs v3

Thu Mar 30 01:59:39 UTC 2017

On 03/30/2017 02:07 AM, Felix Kuehling wrote:
> On 17-03-29 01:40 PM, Christian König wrote:
>> Am 29.03.2017 um 19:22 schrieb Felix Kuehling:
>>> Fix the start/end address calculation for address ranges that span
>>> multiple page directories in amdgpu_vm_alloc_levels.
>>>
>>> Add error messages if page tables aren't found. Otherwise the page
>>> table update would just fail silently.
>>>
>>> v2:
>>>    * Change WARN_ON to WARN_ON_ONCE
>>>    * Move masking of high address bits to caller
>>>    * Add range-check for "from" and "to"
>>> v3:
>>>    * Replace WARN_ON_ONCE in get_pt with pr_err in caller
>>>
>>> Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>
>> Mhm, I still don't like that we modify saddr/eaddr here because I
>> wanted to resolve the recursion sooner or later.
>>
>> But anyway that is work for a future patch, this one is Reviewed-by:
>> Christian König <christian.koenig at amd.com>.
>
> Thank you, I appreciate it. I realize it's probably not perfect. But I'm
> just looking at the intention of the original code and trying to make it
> work correctly for KFD so we can continue integrating amdgpu changes
> (and important fixes, such as VM fault handling) into the KFD branch.
> That's also why I'm a bit in a rush. This was actually blocking KFD.

Anyway, let's make it work for all of us.
It will become better in coding, I believe. :)

Reviewed-by: Junwei Zhang <Jerry.Zhang at amd.com>

>
> We take a different approach from hybrid. We do a nightly git-merge from
> amd-staging-4.9 and running it through our automated pre-submission
> tests. If the merge doesn't have conflicts that process is completely
> automated (except for someone pressing the Submit button in the end). If
> the testing fails, we have to block integrations until the problem is
> fixed upstream. So I'm fixing the problem upstream. ;)
>
> Regards,
>    Felix
>
>>
>> Regards,
>> Christian.
>>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 24 ++++++++++++++++++------
>>>    1 file changed, 18 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 818747f..c11d15e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -275,13 +275,18 @@ static int amdgpu_vm_alloc_levels(struct
>>> amdgpu_device *adev,
>>>            memset(parent->entries, 0 , sizeof(struct amdgpu_vm_pt));
>>>        }
>>>    -    from = (saddr >> shift) % amdgpu_vm_num_entries(adev, level);
>>> -    to = (eaddr >> shift) % amdgpu_vm_num_entries(adev, level);
>>> +    from = saddr >> shift;
>>> +    to = eaddr >> shift;
>>> +    if (from >= amdgpu_vm_num_entries(adev, level) ||
>>> +        to >= amdgpu_vm_num_entries(adev, level))
>>> +        return -EINVAL;
>>>          if (to > parent->last_entry_used)
>>>            parent->last_entry_used = to;
>>>          ++level;
>>> +    saddr = saddr & ((1 << shift) - 1);
>>> +    eaddr = eaddr & ((1 << shift) - 1);
>>>          /* walk over the address space and allocate the page tables */
>>>        for (pt_idx = from; pt_idx <= to; ++pt_idx) {
>>> @@ -312,8 +317,11 @@ static int amdgpu_vm_alloc_levels(struct
>>> amdgpu_device *adev,
>>>            }
>>>              if (level < adev->vm_manager.num_level) {
>>> -            r = amdgpu_vm_alloc_levels(adev, vm, entry, saddr,
>>> -                           eaddr, level);
>>> +            uint64_t sub_saddr = (pt_idx == from) ? saddr : 0;
>>> +            uint64_t sub_eaddr = (pt_idx == to) ? eaddr :
>>> +                ((1 << shift) - 1);
>>> +            r = amdgpu_vm_alloc_levels(adev, vm, entry, sub_saddr,
>>> +                           sub_eaddr, level);
>>>                if (r)
>>>                    return r;
>>>            }
>>> @@ -990,8 +998,10 @@ static void amdgpu_vm_update_ptes(struct
>>> amdgpu_pte_update_params *params,
>>>        /* initialize the variables */
>>>        addr = start;
>>>        pt = amdgpu_vm_get_pt(params, addr);
>>> -    if (!pt)
>>> +    if (!pt) {
>>> +        pr_err("PT not found, aborting update_ptes\n");
>>>            return;
>>> +    }
>>>          if (params->shadow) {
>>>            if (!pt->shadow)
>>> @@ -1015,8 +1025,10 @@ static void amdgpu_vm_update_ptes(struct
>>> amdgpu_pte_update_params *params,
>>>        /* walk over the address space and update the page tables */
>>>        while (addr < end) {
>>>            pt = amdgpu_vm_get_pt(params, addr);
>>> -        if (!pt)
>>> +        if (!pt) {
>>> +            pr_err("PT not found, aborting update_ptes\n");
>>>                return;
>>> +        }
>>>              if (params->shadow) {
>>>                if (!pt->shadow)
>>
>>
>