KFDTest failures on latest drm-next rebase

Wed Sep 27 18:08:33 UTC 2017

[+Christian, I know you're on vacation. When you get back ...]
[+amd-gfx]
[+Harish]

I can't imagine how the static queue change can trigger this.

The error message looks like some kind of overflow of lock dependencies.
Too many locks depending on a held lock. I don't understand the lock
dependency code well enough to really know what that means, though.
Christian, or anyone on amd-gfx, have you seen this type of problem before?

That said, the backtrace shows that the problem was triggered by an
eviction. The Fragmentation test should not trigger an eviction unless
we have a memory leak  somewhere, or another process uses a bunch of
memory at the same time. Or some bug that accidentally triggers an
eviction fence, again.

Harish, do you have time to help Kent investigate this issue on his
drm-next rebase branch?

Thanks,
  Felix

On 2017-09-27 12:07 PM, Kent Russell wrote:
> It passed when I reverted your last change (drm/amdkfd: Make unmapping
> of static queues conditional). Want me to push the recent changes up
> to the gerritgit branch to make it easier for you to look at?
>
>  Kent
>
>
> On 2017-09-27 07:18 AM, Kent Russell wrote:
>> Tried a rebase onto drm-next (no conflicts aside from a new header
>> object in amdgpu.h) and got this new failure/hang on Fiji during
>> Fragmentation test:
>>
>> [  413.275694] DEBUG_LOCKS_WARN_ON(hlock->references == (1 << 12)-1)
>> [  413.275699] ------------[ cut here ]------------
>> [  413.286970] WARNING: CPU: 2 PID: 61 at
>> /home/krussell/git/testingcompute/kernel/kernel/locking/lockdep.c:3289
>> __lock_acquire+0xc08/0x1540
>> [  413.299986] Modules linked in: fuse x86_pkg_temp_thermal video
>> acpi_pad amdkfd amd_iommu_v2 amdgpu chash ttm
>> [  413.310339] CPU: 2 PID: 61 Comm: kworker/2:1 Tainted: G W      
>> 4.13.0-rc5-kfd-krussell #1
>> [  413.319680] Hardware name: ASUS All Series/Z97-A-USB31, BIOS 2801
>> 11/11/2015
>> [  413.327141] Workqueue: events kfd_restore_bo_worker [amdkfd]
>> [  413.333108] task: ffff98c85a420000 task.stack: ffffa87c03368000
>> [  413.339360] RIP: 0010:__lock_acquire+0xc08/0x1540
>> [  413.344317] RSP: 0018:ffffa87c0336ba00 EFLAGS: 00010086
>> [  413.349820] RAX: 0000000000000035 RBX: 00000000000000c8 RCX:
>> 0000000000000000
>> [  413.357338] RDX: ffff98c85a420000 RSI: 0000000000000001 RDI:
>> ffffffff9d0c5eab
>> [  413.364848] RBP: ffffa87c0336bab0 R08: 0000000000000000 R09:
>> 0000000000000001
>> [  413.372357] R10: 0000000000000000 R11: ffffffffc0364642 R12:
>> 0000000000000000
>> [  413.379868] R13: 0000000000000005 R14: ffff98c6de9782b8 R15:
>> ffff98c85a420000
>> [  413.387380] FS:  0000000000000000(0000) GS:ffff98c87ec80000(0000)
>> knlGS:0000000000000000
>> [  413.395897] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  413.401942] CR2: 000000000261f000 CR3: 0000000815769000 CR4:
>> 00000000001406e0
>> [  413.409450] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [  413.416961] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> [  413.424470] Call Trace:
>> [  413.427039]  ? __lock_is_held+0x76/0xc0
>> [  413.431078]  lock_acquire+0x59/0x80
>> [  413.434743]  ? lock_acquire+0x59/0x80
>> [  413.438594]  ? ttm_eu_reserve_buffers+0x242/0x5a0 [ttm]
>> [  413.444100]  ? ttm_eu_reserve_buffers+0x242/0x5a0 [ttm]
>> [  413.449598]  __ww_mutex_lock.constprop.10+0x95/0xfb0
>> [  413.454809]  ? ttm_eu_reserve_buffers+0x242/0x5a0 [ttm]
>> [  413.460301]  ? __lock_is_held+0x76/0xc0
>> [  413.464340]  ww_mutex_lock+0x42/0x60
>> [  413.468104]  ? ww_mutex_lock+0x42/0x60
>> [  413.472052]  ttm_eu_reserve_buffers+0x242/0x5a0 [ttm]
>> [  413.477423] amdgpu_amdkfd_gpuvm_restore_process_bos+0x1a9/0x480
>> [amdgpu]
>> [  413.484581]  ? mark_held_locks+0x6f/0xa0
>> [  413.488713]  ? console_unlock+0x384/0x4a0
>> [  413.492936]  ? trace_hardirqs_on_caller+0x11f/0x190
>> [  413.498076]  ? irq_work_queue+0x9c/0xb0
>> [  413.502115]  ? wake_up_klogd+0x36/0x40
>> [  413.506062]  ? console_unlock+0x482/0x4a0
>> [  413.510283]  ? vprintk_emit+0x221/0x320
>> [  413.514329]  kfd_restore_bo_worker+0x4a/0xd0 [amdkfd]
>> [  413.519652]  process_one_work+0x190/0x430
>> [  413.523874]  ? process_one_work+0x133/0x430
>> [  413.528281]  worker_thread+0x46/0x400
>> [  413.532136]  kthread+0x10a/0x140
>> [  413.535533]  ? process_one_work+0x430/0x430
>> [  413.539938]  ? kthread_create_on_node+0x40/0x40
>> [  413.544709]  ret_from_fork+0x2a/0x40
>>
>> The only kfd-staging change is your static-queue unmapping, and the
>> only upstream changes were the chash typo fix, gem_prime_mmap support
>> and the GPUVM fault storms (plus a little fuzzy_fan stuff for CI).
>> Any thoughts?
>>
>>  Kent
>>
>