[Intel-gfx] [PATCH] drm: Give the DRM device's anon_inode a unique lockclass for its mmap_rswem

Mon Dec 11 17:44:28 UTC 2017

On Mon, Dec 11, 2017 at 6:27 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> Quoting Daniel Vetter (2017-12-11 17:20:32)
>> On Mon, Dec 11, 2017 at 11:39 AM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
>> > Teach lockdep to track the device's internal mmapping separately
>> > from the generic lockclass over all other inodes. Since this is device
>> > private we wish to allow a different locking hierarchy than is typified
>> > by the requirement for the mmap_rwsem being the outermost lock for
>> > handling pagefaults. By giving the internal mmap_rwsem a distinct
>> > lockclass, lockdep can identify it and learn/enforce its distinct locking
>> > requirements.
>> >
>> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104209
>> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>> > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>>
>> I think both the commit message and comment are a bit too fluffy - the
>> critical bit is that we're biting ourselves on gtt mmaps from
>> usersptr, and that's explicitly not allowed exactly because it would
>> deadlock.
>>
>> I'm also not sure it's a good idea to implement this in generic code,
>> since this is a very i915 specific issue, and other drivers (who might
>> be a lot less sloppy here) will now no longer get reports about this
>> deadlock.
>
> I was thinking that in a more general sense manipulating of the
> vma_manager's inode is independent of the processes's mappings. As such
> we do not want to tie the two together and force them to conform to the
> same rules, because the core mapping semaphore will be held on entry to
> driver code, but the internal mapping will be used from within driver
> code.

I think they're the same locks really. Maybe I'm missing something,
but I thought the mapping->rwsem we get on mmap/fault is exactly the
one we want/need to use for zap_pte.

Looking at the bugzilla trace I think the deadlock happens when the
i915_gem_userptr_mn_invalidate_range_start callback calls
flush_workqueue for a range that is not itself not allowed to be
userptr-mapped. But because it does that, we end up in a deadlock. I
think if the userptr callback would checkthe range it gets against all
the userptr mappings, we'd avoid this deadlock: userptr is not allowed
to map a gtt range, which means this should avoid calling
flush_workqueue while holding our drm mapping->rwsem.

So there seems to be a real deadlock, at least in my current understanding.

Of course if we'd fix that deadlock we'd still have lockdep
complaining, but maybe the deadlock fix also gets rid of the lockdep
splat (but that would be more rework than just making the flush_work
conditional).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch