[PATCH 6/6] drm/amdgpu: keep the MMU lock until the update ends

Tue Sep 5 22:40:29 UTC 2017

I believe this can lead to lock warnings when multiple MMU notifier
pairs are happening at the same time. They're probably harmless, because
read locks are not exclusive. Below is an example I got during process
termination of an HSA process.

Is there some kind of lock annotation that can prevent this message?

Regards,
  Felix

[  232.510984] ============================================
[  232.510984] WARNING: possible recursive locking detected
[  232.510985] 4.12.0-kfd-fkuehlin #1 Tainted: G            E  
[  232.510986] --------------------------------------------
[  232.510986] test_integer_op/2692 is trying to acquire lock:
[  232.510987]  (&rmn->lock){++++.+}, at: [<ffffffffc01bf368>] amdgpu_mn_invalidate_range_start_hsa+0x28/0xc0 [amdgpu]
[  232.511029] 
               but task is already holding lock:
[  232.511029]  (&rmn->lock){++++.+}, at: [<ffffffffc01bf368>] amdgpu_mn_invalidate_range_start_hsa+0x28/0xc0 [amdgpu]
[  232.511052] 
               other info that might help us debug this:
[  232.511052]  Possible unsafe locking scenario:

[  232.511053]        CPU0
[  232.511053]        ----
[  232.511053]   lock(&rmn->lock);
[  232.511053]   lock(&rmn->lock);
[  232.511054] 
                *** DEADLOCK ***

[  232.511054]  May be due to missing lock nesting notation

[  232.511055] 3 locks held by test_integer_op/2692:
[  232.511055]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8d1a34e1>] SyS_madvise+0x361/0x880
[  232.511058]  #1:  (&rmn->lock){++++.+}, at: [<ffffffffc01bf368>] amdgpu_mn_invalidate_range_start_hsa+0x28/0xc0 [amdgpu]
[  232.511080]  #2:  (srcu){......}, at: [<ffffffff8d1b6d90>] __mmu_notifier_invalidate_range_start+0x0/0xb0
[  232.511083] 
               stack backtrace:
[  232.511084] CPU: 4 PID: 2692 Comm: test_integer_op Tainted: G            E   4.12.0-kfd-fkuehlin #1
[  232.511085] Hardware name: ASUS All Series/Z97-PRO(Wi-Fi ac)/USB 3.1, BIOS 2401 04/27/2015
[  232.511085] Call Trace:
[  232.511088]  dump_stack+0x85/0xc7
[  232.511090]  __lock_acquire+0x14da/0x1540
[  232.511091]  ? __lock_acquire+0x28d/0x1540
[  232.511093]  lock_acquire+0x59/0x80
[  232.511093]  ? lock_acquire+0x59/0x80
[  232.511115]  ? amdgpu_mn_invalidate_range_start_hsa+0x28/0xc0 [amdgpu]
[  232.511117]  down_read+0x3e/0x70
[  232.511139]  ? amdgpu_mn_invalidate_range_start_hsa+0x28/0xc0 [amdgpu]
[  232.511161]  amdgpu_mn_invalidate_range_start_hsa+0x28/0xc0 [amdgpu]
[  232.511162]  __mmu_notifier_invalidate_range_start+0x71/0xb0
[  232.511164]  __split_huge_pmd+0x296/0x890
[  232.511166]  unmap_page_range+0x353/0x930
[  232.511167]  ? reacquire_held_locks+0x67/0x170
[  232.511168]  unmap_single_vma+0x54/0xd0
[  232.511170]  zap_page_range+0xac/0x110
[  232.511170]  ? lock_acquire+0x59/0x80
[  232.511171]  ? find_vma+0x63/0x70
[  232.511172]  SyS_madvise+0x40a/0x880
[  232.511173]  ? trace_hardirqs_on_caller+0x11f/0x190
[  232.511174]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[  232.511175] RIP: 0033:0x7f35d0e827c7
[  232.511176] RSP: 002b:00007f35c3ffef18 EFLAGS: 00000206 ORIG_RAX: 000000000000001c
[  232.511176] RAX: ffffffffffffffda RBX: 00007f35970028b0 RCX: 00007f35d0e827c7
[  232.511177] RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007f35c37ff000
[  232.511177] RBP: 00007f35970028e8 R08: 0000000000000000 R09: 000000000000001e
[  232.511177] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000004
[  232.511178] R13: 00007f35970028c0 R14: 0000000000630250 R15: 0000000000000007
[  232.511179]  ? entry_SYSCALL_64_fastpath+0x1f/0xbe

On 2017-09-05 11:37 AM, Christian König wrote:
> From: Christian König <christian.koenig at amd.com>
>
> This is quite controversial because it adds another lock which is held during
> page table updates, but I don't see much other option.
>
> Signed-off-by: Christian König <christian.koenig at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 651a746..a5a560b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -196,6 +196,24 @@ static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
>  
>  		amdgpu_mn_invalidate_node(node, start, end);
>  	}
> +}
> +
> +/**
> + * amdgpu_mn_invalidate_range_end - callback to notify about mm change
> + *
> + * @mn: our notifier
> + * @mn: the mm this callback is about
> + * @start: start of updated range
> + * @end: end of updated range
> + *
> + * Release the lock again to allow new command submissions.
> + */
> +static void amdgpu_mn_invalidate_range_end(struct mmu_notifier *mn,
> +					   struct mm_struct *mm,
> +					   unsigned long start,
> +					   unsigned long end)
> +{
> +	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>  
>  	up_read(&rmn->lock);
>  }
> @@ -204,6 +222,7 @@ static const struct mmu_notifier_ops amdgpu_mn_ops = {
>  	.release = amdgpu_mn_release,
>  	.invalidate_page = amdgpu_mn_invalidate_page,
>  	.invalidate_range_start = amdgpu_mn_invalidate_range_start,
> +	.invalidate_range_end = amdgpu_mn_invalidate_range_end,
>  };
>  
>  /**