vmwgfx leaking bo pins?

Thomas Hellström (Intel) thomas_os at shipmail.org
Mon Mar 15 22:35:26 UTC 2021


On 3/15/21 9:38 PM, Daniel Vetter wrote:
> On Mon, Mar 15, 2021 at 6:57 PM Zack Rusin <zackr at vmware.com> wrote:
>> On 3/12/21 5:06 AM, Thomas Hellström (Intel) wrote:
>>> On 3/12/21 12:02 AM, Zack Rusin wrote:
>>>>> On Mar 11, 2021, at 17:35, Thomas Hellström (Intel)
>>>>> <thomas_os at shipmail.org> wrote:
>>>>>
>>>>> Hi, Zack
>>>>>
>>>>> On 3/11/21 10:07 PM, Zack Rusin wrote:
>>>>>>> On Mar 11, 2021, at 05:46, Thomas Hellström (Intel)
>>>>>>> <thomas_os at shipmail.org> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I tried latest drm-fixes today and saw a lot of these: Fallout from
>>>>>>> ttm rework?
>>>>>> Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was
>>>>>> in drm-misc-next in the drm-misc tree for a while but hasn’t been
>>>>>> merged for 5.12.
>>>>>>
>>>>>> z
>>>>>>
>>>>> Hmm, yes but doesn't that fix trip the ttm_bo_unpin()
>>>>> dma_resv_assert_held(bo->base.resv)?
>>>> No, doesn’t seem to. TBH I’m not sure why myself, but it seems to be
>>>> working fine.
>>>>
>>>>
>>> With CONFIG_PROVE_LOCKING=y I see this:
>>>
>>> [    7.117145] [drm] FIFO at 0x00000000fe000000 size is 8192 kiB
>>> [    7.117284] [drm] VRAM at 0x00000000e8000000 size is 131072 kiB
>>> [    7.117291] INFO: trying to register non-static key.
>>> [    7.117295] the code is fine but needs lockdep annotation.
>>> [    7.117298] turning off the locking correctness validator
>>>
>>> Which will probably mask that dma_resv_assert_held(bo->base.resv)
>>>
>> Ah, yes, you're right. After fixing that I do see the
>> dma_resv_assert_held triggered. Technically trivially fixable with
>> ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but it's a little
>> ugly that some places e.g. vmw_bo_unreference will require
>> ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but some won't e.g.
>> vmw_mob_destroy won't because its lock is held by ttm_bo_delayed_delete
>> without a very clear indication within the function which is which.

It looks like, like Daniel hints below, for the mob pagetable bos since 
they are pinned and hence not on a LRU list, the parent bo is holding 
the only reference, which is utilized in vmw_mob_unbind() to make sure 
the tryreserve always succeeds. (unpin could be called in vmw_mob_unbind 
for the pagetable bo just after fencing if necessary), and similarly for 
the other vmwgfx_mob error paths, but in that case one should probably 
keep the bo pointers in stack variables until you know you don't have to 
unpin. Then it should be ok to tryreserve around unpinning in the error 
paths.

But it's constructs like that, that really makes me think we shouldn't 
need to reserve to unpin.

> Not sure it applies here, but if the refcount is down to 0 we know
> we're in destruction code and can't race with anything anymore, so
> maybe we can lift the debug check.
>
> Otoh I think at that point we might still be on lru lists, so the
> rules become rather tricky whether it's really always legal to skip
> the dma_resv_lock. But we could perhaps figure out something if it's
> too annoying to have a consistent calling context in drivers.
>
> I'm not a huge fan of dropping the requirement from unpin and
> switching to atomic_t for the pin count, since pin/unpin is an
> extremely slow path, adding complexity in how we protect stuff for a
> function that's maybe called 60/s (for page flipping we pin/unpin)
> doesn't strike me as the right balance.

*If* we can protect the bo LRU state with the lru lock instead of with 
reservation it shuld probably only be a matter of extending the lru lock 
critical section over a couple of assignments. If we change the bo lru 
state we'd need to grab the lru lock sooner or later anyway, so I think 
the added complexity should be minimal.

/Thomas




More information about the dri-devel mailing list