[PATCH -fixes 5/5] drm/vmwgfx: Fix a buffer object eviction regression

Thu Sep 13 18:17:06 UTC 2018

On 09/13/2018 07:38 PM, Matthew Wilcox wrote:
> On Thu, Sep 13, 2018 at 06:52:43PM +0200, Thomas Hellstrom wrote:
>> On 09/13/2018 05:28 PM, Matthew Wilcox wrote:
>>> On Thu, Sep 13, 2018 at 04:56:53PM +0200, Thomas Hellstrom wrote:
>>>> On 09/13/2018 04:10 PM, Matthew Wilcox wrote:
>>>>> I think this could be better though ... if ida_alloc() ever starts
>>>>> returning a different errno in the future, you'll hit the same problem,
>>>>> right?  So how about this ...
>>>>>
>>>>>     	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
>>>>> +	if (id == -ENOMEM)
>>>>> +		return -ENOMEM;
>>>>> +	if (id < 0)
>>>>> +		return 0;
>>>>>     	spin_lock(&gman->lock);
>>>>>
>>>>> But I wonder ... why is -ENOMEM seen as a fatal error?  If you free up
>>>>> some memory, you'll free up an ID, so the next time around you should
>>>>> be able to allocate an ID.  So shouldn't this function just have
>>>>> been doing this all along?
>>>>>
>>>>>     	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
>>>>> +	if (id < 0)
>>>>> +		return 0;
>>>>>
>>>> Non-fatal errors are errors that can be remedied by GPU buffer eviction, and
>>>> buffer eviction will free up IDA space, so basically we need to target only
>>>> the error code that indicates we've run out of IDA space.
>>> Yes, but the following situation can happen:
>>>
>>>    - Allocate 1024 IDs
>>>    - Run very low on memory
>>>    - Allocating ID 1025 will fail (very very unlikely)
>>>    - ida_alloc_max() returns -ENOMEM
>>>
>>> In this situation, we want ttm_mem_evict_first() to be called which will
>>> free up one of the 1024 existing IDs and then we can allocate that ID for
>>> our new node.
>>>
>>> I'm assuming we're analysing the behaviour of ttm_bo_mem_force_space()
>>> here.
>> Well, that's true, but that situation depends I guess very much on the radix
>> tree implementation of IDA?
> The specific number 1024 depends on the current implementation, but
> generally speaking at some point, the IDA has to allocate memory to store
> one extra ID.  Because the IDA cannot allocate memory when freeing an
> ID, there is no more compact representation of the allocated IDs smaller
> than a bitmap.
>
>> Also I would expect the eviction paths to try to
>> allocate more memory here and there, so to me the preferred option when
>> -ENOMEM happens, is really to back off as soon as possible to avoid
>> interfering with shrinker work going on etc.
> I would be surprised if freeing resources needs memory to be allocated.
> That's not supposed to happen; the filesystems go to great lengths to
> pre-allocate enough memory that they can always write at least one dirty
> page back to storage without allocating any memory, for example.  Maybe
> the DRM subsystem is different; I'm not an expert in your subsystem.

It's different. In particular when evicting a large buffer from VRAM 
(which is on-card memory) to system memory, the subsystem may allocate a 
huge amount of memory. But this particular case is not evicting from 
VRAM (although it may lead to it).

>>> My point was that your solution (detect the one error which should be
>>> deemed as non-fatal) was not as robust as its inverse (detect the one
>>> error which the previous code deemed as fatal).  But I now believe no
>>> error from the IDA should be seen as fatal.
>> If you insist, I can test on -ENOMEM instead of -ENOSPC to mimic the
>> pre-change behaviour. We should really focus on the IDA api changes here,
>> and defer changing -ENOMEM to non-fatal to a follow-up patch if needed.
> I'd be comfortable with that solution for now.

OK, I'll respin and check for -ENOMEM instead.

Thanks,
Thomas