[PATCH] drm/nouveau/dmem: missing mutex_lock in error path

Fri Jun 14 17:39:38 UTC 2019

On 6/13/19 5:49 PM, John Hubbard wrote:
> On 6/13/19 5:11 PM, Ralph Campbell wrote:
>> In nouveau_dmem_pages_alloc(), the drm->dmem->mutex is unlocked before
>> calling nouveau_dmem_chunk_alloc().
>> Reacquire the lock before continuing to the next page.
>>
>> Signed-off-by: Ralph Campbell <rcampbell at nvidia.com>
>> ---
>>
>> I found this while testing Jason Gunthorpe's hmm tree but this is
>> independent of those changes. I guess it could go through
>> David Airlie's tree for nouveau or Jason's tree.
>>
> 
> Hi Ralph,
> 
> btw, was this the fix for the crash you were seeing? It might be nice to
> mention in the commit description, if you are seeing real symptoms.
> 
> 
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> index 27aa4e72abe9..00f7236af1b9 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> @@ -379,9 +379,10 @@ nouveau_dmem_pages_alloc(struct nouveau_drm *drm,
>>   			ret = nouveau_dmem_chunk_alloc(drm);
>>   			if (ret) {
>>   				if (c)
>> -					break;
> 
> Actually, the pre-existing code is a little concerning. Your change preserves
> the behavior, but it seems questionable to be doing a "return 0" (whether
> via the above break, or your change) when it's in this partially allocated
> state. It's reporting success when it only allocates part of what was requested,
> and it doesn't fill in the pages array either.
> 
> 
> 
>> +					return 0;
>>   				return ret;
>>   			}
>> +			mutex_lock(&drm->dmem->mutex);
>>   			continue;
>>   		}
>>   
>>
> 
> The above comment is about pre-existing potential problems, but your patch itself
> looks correct, so:
> 
> Reviewed-by: John Hubbard <jhubbard at nvidia.com>
> 
> 
> thanks,
> 
The crash was the NULL pointer bug in Christoph's patch #10.
I sent a separate reply for that.

Below is the console output I got, then I made the changes just based on
code inspection. Do you think I should include it in the change log?

As for the "return 0", If you follow the call chain,
nouveau_dmem_pages_alloc() is only ever called for one page so this
currently "works" but I agree it is a bit of a time bomb. There are a
number of other bugs that I can see that need fixing but I think those
should be separate patches.

[ 1294.871933] =====================================
[ 1294.876656] WARNING: bad unlock balance detected!
[ 1294.881375] 5.2.0-rc3+ #5 Not tainted
[ 1294.885048] -------------------------------------
[ 1294.889773] test-malloc-vra/6299 is trying to release lock 
(&drm->dmem->mutex) at:
[ 1294.897482] [<ffffffffa01a220f>] 
nouveau_dmem_migrate_alloc_and_copy+0x79f/0xbf0 [nouveau]
[ 1294.905782] but there are no more locks to release!
[ 1294.910690]
[ 1294.910690] other info that might help us debug this:
[ 1294.917249] 1 lock held by test-malloc-vra/6299:
[ 1294.921881]  #0: 0000000016e10454 (&mm->mmap_sem#2){++++}, at: 
nouveau_svmm_bind+0x142/0x210 [nouveau]
[ 1294.931313]
[ 1294.931313] stack backtrace:
[ 1294.935702] CPU: 4 PID: 6299 Comm: test-malloc-vra Not tainted 
5.2.0-rc3+ #5
[ 1294.942786] Hardware name: ASUS X299-A/PRIME X299-A, BIOS 1401 05/21/2018
[ 1294.949590] Call Trace:
[ 1294.952059]  dump_stack+0x7c/0xc0
[ 1294.955469]  ? nouveau_dmem_migrate_alloc_and_copy+0x79f/0xbf0 [nouveau]
[ 1294.962213]  print_unlock_imbalance_bug.cold.52+0xca/0xcf
[ 1294.967641]  lock_release+0x306/0x380
[ 1294.971383]  ? nouveau_dmem_migrate_alloc_and_copy+0x79f/0xbf0 [nouveau]
[ 1294.978089]  ? lock_downgrade+0x2d0/0x2d0
[ 1294.982121]  ? find_held_lock+0xac/0xd0
[ 1294.985979]  __mutex_unlock_slowpath+0x8f/0x3f0
[ 1294.990540]  ? wait_for_completion+0x230/0x230
[ 1294.995002]  ? rwlock_bug.part.2+0x60/0x60
[ 1294.999197]  nouveau_dmem_migrate_alloc_and_copy+0x79f/0xbf0 [nouveau]
[ 1295.005751]  ? page_mapping+0x98/0x110
[ 1295.009511]  migrate_vma+0xa74/0x1090
[ 1295.013186]  ? move_to_new_page+0x480/0x480
[ 1295.017400]  ? __kmalloc+0x153/0x300
[ 1295.021052]  ? nouveau_dmem_migrate_vma+0xd8/0x1e0 [nouveau]
[ 1295.026796]  nouveau_dmem_migrate_vma+0x157/0x1e0 [nouveau]
[ 1295.032466]  ? nouveau_dmem_init+0x490/0x490 [nouveau]
[ 1295.037612]  ? vmacache_find+0xc2/0x110
[ 1295.041537]  nouveau_svmm_bind+0x1b4/0x210 [nouveau]
[ 1295.046583]  ? nouveau_svm_fault+0x13e0/0x13e0 [nouveau]
[ 1295.051912]  drm_ioctl_kernel+0x14d/0x1a0
[ 1295.055930]  ? drm_setversion+0x330/0x330
[ 1295.059971]  drm_ioctl+0x308/0x530
[ 1295.063384]  ? drm_version+0x150/0x150
[ 1295.067153]  ? find_held_lock+0xac/0xd0
[ 1295.070996]  ? __pm_runtime_resume+0x3f/0xa0
[ 1295.075285]  ? mark_held_locks+0x29/0xa0
[ 1295.079230]  ? _raw_spin_unlock_irqrestore+0x3c/0x50
[ 1295.084232]  ? lockdep_hardirqs_on+0x17d/0x250
[ 1295.088768]  nouveau_drm_ioctl+0x9a/0x100 [nouveau]
[ 1295.093661]  do_vfs_ioctl+0x137/0x9a0
[ 1295.097341]  ? ioctl_preallocate+0x140/0x140
[ 1295.101623]  ? match_held_lock+0x1b/0x230
[ 1295.105646]  ? match_held_lock+0x1b/0x230
[ 1295.109660]  ? find_held_lock+0xac/0xd0
[ 1295.113512]  ? __do_page_fault+0x324/0x630
[ 1295.117617]  ? lock_downgrade+0x2d0/0x2d0
[ 1295.121648]  ? mark_held_locks+0x79/0xa0
[ 1295.125583]  ? handle_mm_fault+0x352/0x430
[ 1295.129687]  ksys_ioctl+0x60/0x90
[ 1295.133020]  ? mark_held_locks+0x29/0xa0
[ 1295.136964]  __x64_sys_ioctl+0x3d/0x50
[ 1295.140726]  do_syscall_64+0x68/0x250
[ 1295.144400]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1295.149465] RIP: 0033:0x7f1a3495809b
[ 1295.153053] Code: 0f 1e fa 48 8b 05 ed bd 0c 00 64 c7 00 26 00 00 00 
48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bd bd 0c 00 f7 d8 64 89 01 48
[ 1295.171850] RSP: 002b:00007ffef7ed1358 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[ 1295.179451] RAX: ffffffffffffffda RBX: 00007ffef7ed1628 RCX: 
00007f1a3495809b
[ 1295.186601] RDX: 00007ffef7ed13b0 RSI: 0000000040406449 RDI: 
0000000000000004
[ 1295.193759] RBP: 00007ffef7ed13b0 R08: 0000000000000000 R09: 
000000000157e770
[ 1295.200917] R10: 000000000151c010 R11: 0000000000000246 R12: 
0000000040406449
[ 1295.208083] R13: 0000000000000004 R14: 0000000000000000 R15: 
0000000000000000