[PATCH 3/3] drm/amdgpu: Fix multi-level page table bugs for large BOs

Wed Mar 29 15:42:36 UTC 2017

On 17-03-29 11:22 AM, Christian König wrote:
> Am 29.03.2017 um 16:46 schrieb Michel Dänzer:
>> On 29/03/17 11:32 PM, Felix Kuehling wrote:
>>> On 17-03-29 02:52 AM, Christian König wrote:
>>>> Please clean that up and also remove all those WARN_ON(), we don't
>>>> want to spam the system log with backtraces like this.
>>> I disagree. When you see the backtraces from WARN_ONs in
>>> amdgpu_vm_get_pt, it means the driver is trying to update a page table
>>> that hasn't been allocated. Without those WARN_ONs there is absolutely
>>> no warning of any kind - amdgpu_vm_update_pte just fails silently (it's
>>> a void function).
>>>
>>> Therefore I wouldn't call those warnings spam. Given that the VM PT
>>> allocation code for multi-levels is non-trivial, I think that's a good
>>> sanity check to have and really helpful to see in a log when user
>>> report
>>> random VM faults.
>> WARN_ON_ONCE might be a compromise. We'll still get a backtrace, but it
>> won't cause crazy log spam if the bad condition is consistently hit.
> Good idea. We should also only add one at the beginning of the
> function and not multiple catches for each dependent variable.

I don't understand what you mean here. Are we still talking about
amdgpu_vm_update_pte? You'll have to walk the page table before you know
whether a PT BO actually exists. So I don't understand what check would
make sense at the start (before the page table walk).

I removed the WARN_ON from amdgpu_vm_alloc_levels and instead return an
error if "from" or "to" are out of range. I'll send out v2 of my patch
in a minute.

Regards,
  Felix