vmwgfx leaking bo pins?
Thomas Hellström (Intel)
thomas_os at shipmail.org
Mon Mar 15 22:35:26 UTC 2021
On 3/15/21 9:38 PM, Daniel Vetter wrote:
> On Mon, Mar 15, 2021 at 6:57 PM Zack Rusin <zackr at vmware.com> wrote:
>> On 3/12/21 5:06 AM, Thomas Hellström (Intel) wrote:
>>> On 3/12/21 12:02 AM, Zack Rusin wrote:
>>>>> On Mar 11, 2021, at 17:35, Thomas Hellström (Intel)
>>>>> <thomas_os at shipmail.org> wrote:
>>>>>
>>>>> Hi, Zack
>>>>>
>>>>> On 3/11/21 10:07 PM, Zack Rusin wrote:
>>>>>>> On Mar 11, 2021, at 05:46, Thomas Hellström (Intel)
>>>>>>> <thomas_os at shipmail.org> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I tried latest drm-fixes today and saw a lot of these: Fallout from
>>>>>>> ttm rework?
>>>>>> Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was
>>>>>> in drm-misc-next in the drm-misc tree for a while but hasn’t been
>>>>>> merged for 5.12.
>>>>>>
>>>>>> z
>>>>>>
>>>>> Hmm, yes but doesn't that fix trip the ttm_bo_unpin()
>>>>> dma_resv_assert_held(bo->base.resv)?
>>>> No, doesn’t seem to. TBH I’m not sure why myself, but it seems to be
>>>> working fine.
>>>>
>>>>
>>> With CONFIG_PROVE_LOCKING=y I see this:
>>>
>>> [ 7.117145] [drm] FIFO at 0x00000000fe000000 size is 8192 kiB
>>> [ 7.117284] [drm] VRAM at 0x00000000e8000000 size is 131072 kiB
>>> [ 7.117291] INFO: trying to register non-static key.
>>> [ 7.117295] the code is fine but needs lockdep annotation.
>>> [ 7.117298] turning off the locking correctness validator
>>>
>>> Which will probably mask that dma_resv_assert_held(bo->base.resv)
>>>
>> Ah, yes, you're right. After fixing that I do see the
>> dma_resv_assert_held triggered. Technically trivially fixable with
>> ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but it's a little
>> ugly that some places e.g. vmw_bo_unreference will require
>> ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but some won't e.g.
>> vmw_mob_destroy won't because its lock is held by ttm_bo_delayed_delete
>> without a very clear indication within the function which is which.
It looks like, like Daniel hints below, for the mob pagetable bos since
they are pinned and hence not on a LRU list, the parent bo is holding
the only reference, which is utilized in vmw_mob_unbind() to make sure
the tryreserve always succeeds. (unpin could be called in vmw_mob_unbind
for the pagetable bo just after fencing if necessary), and similarly for
the other vmwgfx_mob error paths, but in that case one should probably
keep the bo pointers in stack variables until you know you don't have to
unpin. Then it should be ok to tryreserve around unpinning in the error
paths.
But it's constructs like that, that really makes me think we shouldn't
need to reserve to unpin.
> Not sure it applies here, but if the refcount is down to 0 we know
> we're in destruction code and can't race with anything anymore, so
> maybe we can lift the debug check.
>
> Otoh I think at that point we might still be on lru lists, so the
> rules become rather tricky whether it's really always legal to skip
> the dma_resv_lock. But we could perhaps figure out something if it's
> too annoying to have a consistent calling context in drivers.
>
> I'm not a huge fan of dropping the requirement from unpin and
> switching to atomic_t for the pin count, since pin/unpin is an
> extremely slow path, adding complexity in how we protect stuff for a
> function that's maybe called 60/s (for page flipping we pin/unpin)
> doesn't strike me as the right balance.
*If* we can protect the bo LRU state with the lru lock instead of with
reservation it shuld probably only be a matter of extending the lru lock
critical section over a couple of assignments. If we change the bo lru
state we'd need to grab the lru lock sooner or later anyway, so I think
the added complexity should be minimal.
/Thomas
More information about the dri-devel
mailing list