CONFIG_DMA_CMA causes ttm performance problems/hangs.

Mario Kleiner at
Tue Aug 12 05:12:07 PDT 2014

On 08/11/2014 05:17 PM, Jerome Glisse wrote:
> On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
>> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
>>> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
>>>> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>>>>> Resent this time without HTML formatting which lkml doesn't like.
>>>>> Sorry.
>>>>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>>>>>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>>>>>>> On August 9, 2014 1:39:39 AM EDT, Thomas
>>>>>>> Hellstrom<thellstrom at>  wrote:
>>>>>>>> Hi.
>>>>>>> Hey Thomas!
>>>>>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more
>>>>>>>> than
>>>>>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>>>>>>> the
>>>>>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>>>>>>> shed
>>>>>>>> some light over this?
>>>>>>> It should allocate in batches and keep them in the TTM DMA pool for
>>>>>>> some time to be reused.
>>>>>>> The pages that it gets are in 4kb granularity though.
>>>>>> Then I feel inclined to say this is a DMA subsystem bug. Single page
>>>>>> allocations shouldn't get routed to CMA.
>>>>>> /Thomas
>>>>> Yes, seems you're both right. I read through the code a bit more and
>>>>> indeed the TTM DMA pool allocates only one page during each
>>>>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
>>>>> allocators don't check for single page CMA allocations and therefore
>>>>> try to get it from the CMA area anyway, instead of skipping to the
>>>>> much cheaper fallback.
>>>>> So the callers of dma_alloc_from_contiguous() could need that little
>>>>> optimization of skipping it if only one page is requested. For
>>>>> dma_generic_alloc_coherent
>>>>> <>
>>>>> andintel_alloc_coherent
>>>>> <>
>>>>> this
>>>>> seems easy to do. Looking at the arm arch variants, e.g.,
>>>>> and
>>>>> i'm not sure if it is that easily done, as there aren't any fallbacks
>>>>> for such a case and the code looks to me as if that's at least
>>>>> somewhat intentional.
>>>>> As far as TTM goes, one quick one-line fix to prevent it from using
>>>>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
>>>>> above methods) would be to clear the __GFP_WAIT
>>>>> <>
>>>>> flag from the
>>>>> passed gfp_t flags. That would trigger the well working fallback.
>>>>> So, is
>>>>> __GFP_WAIT
>>>>> <>
>>>>> needed
>>>>> for those single page allocations that go through__ttm_dma_alloc_page
>>>>> <>?
>>>>> It would be nice to have such a simple, non-intrusive one-line patch
>>>>> that we still could get into 3.17 and then backported to older stable
>>>>> kernels to avoid the same desktop hangs there if CMA is enabled. It
>>>>> would be also nice for actual users of CMA to not use up lots of CMA
>>>>> space for gpu's which don't need it. I think DMA_CMA was introduced
>>>>> around 3.12.
>>>> I don't think that's a good idea. Omitting __GFP_WAIT would cause
>>>> unnecessary memory allocation errors on systems under stress.
>>>> I think this should be filed as a DMA subsystem kernel bug / regression
>>>> and an appropriate solution should be worked out together with the DMA
>>>> subsystem maintainers and then backported.
>>> Ok, so it is needed. I'll file a bug report.
>>>>> The other problem is that probably TTM does not reuse pages from the
>>>>> DMA pool. If i trace the __ttm_dma_alloc_page
>>>>> <>
>>>>> and
>>>>> __ttm_dma_free_page
>>>>> <>
>>>>> calls for
>>>>> those single page allocs/frees, then over a 20 second interval of
>>>>> tracing and switching tabs in firefox, scrolling things around etc. i
>>>>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
>>>>> 1648 frees.
>>>> This is because historically the pools have been designed to keep only
>>>> pages with nonstandard caching attributes since changing page caching
>>>> attributes have been very slow but the kernel page allocators have been
>>>> reasonably fast.
>>>> /Thomas
>>> Ok. A bit more ftraceing showed my hang problem case goes through the
>>> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
>>> it bouncing up and down by 4 pages all the time.
>>> But for the non-cached case, which i don't hit with my problem, could
>>> one of you look at line 954...
>>> ... and tell me why that unconditional npages = count; assignment
>>> makes sense? It seems to essentially disable all recycling for the dma
>>> pool whenever the pool isn't filled up to/beyond its maximum with free
>>> pages? When the pool is filled up, lots of stuff is recycled, but when
>>> it is already somewhat below capacity, it gets "punished" by not
>>> getting refilled? I'd just like to understand the logic behind that line.
>>> thanks,
>>> -mario
>> I'll happily forward that question to Konrad who wrote the code (or it
>> may even stem from the ordinary page pool code which IIRC has Dave
>> Airlie / Jerome Glisse as authors)
> This is effectively bogus code, i now wonder how it came to stay alive.
> Attached patch will fix that.

Yes, that makes sense to me. Fwiw,

Reviewed-by: Mario Kleiner < at>


More information about the dri-devel mailing list