[PATCH 05/13] drm/ttm: overhaul memory accounting

Fri Nov 11 08:22:26 PST 2011

On 11/11/2011 04:47 PM, Jerome Glisse wrote:
> On Fri, Nov 11, 2011 at 08:49:39AM +0100, Thomas Hellstrom wrote:
>    
>> On 11/11/2011 12:33 AM, Jerome Glisse wrote:
>>      
>>> On Thu, Nov 10, 2011 at 09:05:22PM +0100, Thomas Hellstrom wrote:
>>>        
>>>> On 11/10/2011 07:05 PM, Jerome Glisse wrote:
>>>>          
>>>>> On Thu, Nov 10, 2011 at 11:27:33AM +0100, Thomas Hellstrom wrote:
>>>>>            
>>>>>> On 11/09/2011 09:22 PM, j.glisse at gmail.com wrote:
>>>>>>              
>>>>>>> From: Jerome Glisse<jglisse at redhat.com>
>>>>>>>
>>>>>>> This is an overhaul of the ttm memory accounting. This tries to keep
>>>>>>> the same global behavior while removing the whole zone concept. It
>>>>>>> keeps a distrinction for dma32 so that we make sure that ttm don't
>>>>>>> starve the dma32 zone.
>>>>>>>
>>>>>>> There is 3 threshold for memory allocation :
>>>>>>> - max_mem is the maximum memory the whole ttm infrastructure is
>>>>>>>    going to allow allocation for (exception of system process see
>>>>>>>    below)
>>>>>>> - emer_mem is the maximum memory allowed for system process, this
>>>>>>>    limit is>     to max_mem
>>>>>>> - swap_limit is the threshold at which point ttm will start to
>>>>>>>    try to swap object because ttm is getting close the max_mem
>>>>>>>    limit
>>>>>>> - swap_dma32_limit is the threshold at which point ttm will start
>>>>>>>    swap object to try to reduce the pressure on the dma32 zone. Note
>>>>>>>    that we don't specificly target object to swap to it might very
>>>>>>>    well free more memory from highmem rather than from dma32
>>>>>>>
>>>>>>> Accounting is done through used_mem&     used_dma32_mem, which sum give
>>>>>>> the total amount of memory actually accounted by ttm.
>>>>>>>
>>>>>>> Idea is that allocation will fail if (used_mem + used_dma32_mem)>
>>>>>>> max_mem and if swapping fail to make enough room.
>>>>>>>
>>>>>>> The used_dma32_mem can be updated as a later stage, allowing to
>>>>>>> perform accounting test before allocating a whole batch of pages.
>>>>>>>
>>>>>>>                
>>>>>> Jerome, you're removing a fair amount of functionality here, without
>>>>>> justifying
>>>>>> why it could be removed.
>>>>>>              
>>>>> All this code was overkill.
>>>>>            
>>>> [1] I don't agree, and since it's well tested, thought throught and
>>>> working, I see no obvious reason to alter it,
>>>> within the context of this patch series unless it's absolutely
>>>> required for the functionality.
>>>>          
>>> Well one thing i can tell is that it doesn't work on radeon, i pushed
>>> a test to libdrm and here it's the oom that starts doing its beating.
>>> Anyway i won't alter it. Was just trying to make it works, ie be useful
>>> while also being simpler.
>>>        
>> Well if it doesn't work it should of course be fixed.
>>
>> I'm not against fixing it nor making it simpler, but I think that
>> requires a detailed understanding of what's going wrong and how it
>> needs to be fixed. Not as part of a patch series that really tries
>> to accomplish something else.
>>
>> The current code was tested extensively with psb and unichrome.
>> One good test for drivers with bo-backed textures is to continously
>> create fairly large texture images. The end result should be the
>> swap space starting to fill up and once there is no more swap space,
>> the OOM killer should kill your app, and kmalloc failures should be
>> avoided. It should be tricky to get a failure from the global alloc
>> system, but a huge amount of small buffer objects or fence objects
>> should probably do it.
>>
>> Naturally, that requires that all persistent drm objects created
>> from user-space are registered with their correct sizes, or at least
>> a really good size approximation. That includes things like gem
>> flinks, that could otherwise easily be exploited to bring a system
>> down, simply by guessing a gem name and create flinks to that name
>> in an infinite loop.
>>
>> What are the symptoms of the failure you're seeing with Radeon? Any
>> suggestions on why it happens?
>>
>> Thanks,
>> Thomas
>>      
> I pushed my test case to libdrm yesterday, i basicly alloc ttm object
> of 1 page in a loop and expect it to fail. I modified the kernel to
> account 2 page for the ttm_buffer_object struct size so that the kernel
> area should be exhausted long before i run out of memory on a 8G
> config. What happen is that the oom start killing everythings except
> my app, even the kernel logger daemon got kill before my app ...
>
> I think the ttm_memory accounting for kernel object is not the right
> way.
>    
....

So, yet again, TTM gets incorrectly blamed when things are not
working as expected.

The TTM memory accounting is designed to avoid pinning too much memory 
for graphics, so that it can't be
used by the rest of the system. It's working well doing exactly that.

However, it can't stop your app from wanting to store too much data. It 
just shuffles that data to swap. If too many apps want to store too much 
data, eventually the computer runs out of swap space and the OOM killer 
kicks in, and
tries to guess what app to kill. That's not TTM's business. Nor is it 
DRM's business.

The only time the TTM memory accounting system blocks an allocation is 
if there is too much pinned memory allocated (kmalloc, vmalloc) that it 
can't release to swap space. It protects against kmalloc failures, but 
it makes
no attempt to stop your app from wanting to store too much data.

/Thomas