[PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

Mon Apr 30 16:33:24 UTC 2018

On 2018-04-29 09:02 AM, Christian König wrote:
> Am 29.04.2018 um 01:02 schrieb Michel Dänzer:
>> On 2018-04-28 06:30 PM, Ilia Mirkin wrote:
>>> On Fri, Apr 27, 2018 at 9:08 AM, Michel Dänzer <michel at daenzer.net>
>>> wrote:
>>>> From: Michel Dänzer <michel.daenzer at amd.com>
>>>>
>>>> Previously, TTM would always (with CONFIG_TRANSPARENT_HUGEPAGE enabled)
>>>> try to allocate huge pages. However, not all drivers can take advantage
>>>> of huge pages, but they would incur the overhead for allocating and
>>>> freeing them anyway.
>>>>
>>>> Now, drivers which can take advantage of huge pages need to set the new
>>>> flag TTM_PAGE_FLAG_TRANSHUGE to get them. Drivers not setting this flag
>>>> no longer incur any overhead for allocating or freeing huge pages.
>>>>
>>>> v2:
>>>> * Also guard swapping of consecutive pages in ttm_get_pages
>>>> * Reword commit log, hopefully clearer now
>>>>
>>>> Cc: stable at vger.kernel.org
>>>> Signed-off-by: Michel Dänzer <michel.daenzer at amd.com>
>>> Both I and lots of other people, based on reports, are still seeing
>>> plenty of issues with this as late as 4.16.4.
>> "lots of other people", "plenty of issues" sounds a bit exaggerated from
>> what I've seen. FWIW, while I did see the original messages myself, I
>> haven't seen any since Christian's original fix (see below), neither
>> with amdgpu nor radeon, even before this patch you followed up to.
>>
>>
>>> Admittedly I'm on nouveau, but others have reported issues with
>>> radeon/amdgpu as well. It's been going on since the feature was merged
>>> in v4.15, with what seems like little investigation from the authors
>>> introducing the feature.
>> That's not a fair assessment. See
>> https://bugs.freedesktop.org/show_bug.cgi?id=104082#c40 and following
>> comments.
>>
>> Christian fixed the original issue in
>> d0bc0c2a31c95002d37c3cc511ffdcab851b3256 "swiotlb: suppress warning when
>> __GFP_NOWARN is set". Christian did his best to try and get the fix in
>> before 4.15 final, but for reasons beyond his control, it was delayed
>> until 4.16-rc1 and then backported to 4.15.5.
>>
>> Unfortunately, there was an swiotlb regression (not directly related to
>> Christian's work) shortly after this fix, also in 4.16-rc1, which is now
>> fixed in 4.17-rc1 and will be backported to 4.16.y.
> 
> And that's exactly the reason why I intentionally kept this enabled for
> all users of the TTM DMA page pool and not put it behind a flag.
> 
> This change has surfaced quite a number of bugs in the swiotlb code
> which could have caused issues before. It's just that those code path
> where never exercised massively before.
> 
> Additional to that using huge pages is beneficial for the MM and CPU TLB
> (not implemented yet) even when the GPU driver can't make much use of it.

Do I understand correctly that you're against this patch?

AFAIU the only benefit of huge pages with a driver which doesn't take
advantage of them directly is "for the MM". Can you describe a bit more
what that benefit is exactly? Is it expected to outweigh the cost of
allocating / freeing huge pages?

>> It looks like there's at least one more bug left, but it's not clear yet
>> when that was introduced, whether it's directly related to Christian's
>> work, or indeed what the impact is. Let's not get ahead of ourselves.
> 
> Well my patches surfaced the problems, but the underlying issues where
> present even before those changes and I'm very well involved in fixing
> the underlying issues.
> 
> I even considered to just revert the huge page path for the DMA pool
> allocator, but it's just that the TTM patches seem to work exactly as
> they are intended. So that doesn't feel like doing the right thing here.

I agree. Has anyone reported this to the DMA/SWIOTLB developers?

-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer