[PATCH 1/2] drm/prime: Iterate SG DMA addresses separately

Thu Apr 12 13:18:38 UTC 2018

On 12/04/18 11:33, Christian König wrote:
> Am 12.04.2018 um 11:49 schrieb Lucas Stach:
>> Am Donnerstag, den 12.04.2018, 11:35 +0200 schrieb Christian König:
>>> Am 12.04.2018 um 11:11 schrieb Lucas Stach:
>>>> Am Mittwoch, den 11.04.2018, 20:26 +0200 schrieb Christian König:
>>>>> Am 11.04.2018 um 19:11 schrieb Robin Murphy:
>>>>>> For dma_map_sg(), DMA API implementations are free to merge 
>>>>>> consecutive
>>>>>> segments into a single DMA mapping if conditions are suitable, 
>>>>>> thus the
>>>>>> resulting DMA addresses may be packed into fewer entries than
>>>>>> ttm->sg->nents implies.
>>>>>>
>>>>>> drm_prime_sg_to_page_addr_arrays() does not account for this, meaning
>>>>>> its callers either have to reject the 0 < count < nents case or risk
>>>>>> getting bogus addresses back later. Fortunately this is relatively 
>>>>>> easy
>>>>>> to deal with having to rejig structures to also store the mapped 
>>>>>> count,
>>>>>> since the total DMA length should still be equal to the total buffer
>>>>>> length. All we need is a separate scatterlist cursor to iterate 
>>>>>> the DMA
>>>>>> addresses separately from the CPU addresses.
>>>>> Mhm, I think I like Sinas approach better.
>>>>>
>>>>> See the hardware actually needs the dma_address on a page by page 
>>>>> basis.
>>>>>
>>>>> Joining multiple consecutive pages into one entry is just additional
>>>>> overhead which we don't need.

Note that the merging done inside dma_map_sg() is pretty trivial in 
itself (it's effectively just the inverse of the logic in this patch). 
The "overhead" here is inherent in calling sg_alloc_table_from_pages() 
and dma_map_sg() at all.

>>>> But setting MAX_SEGMENT_SIZE will probably prevent an IOMMU that might
>>>> be in front of your GPU to map large pages as such, causing additional
>>>> overhead by means of additional TLB misses and page walks on the IOMMU
>>>> side.
>>>>
>>>> And wouldn't you like to use huge pages at the GPU side, if the IOMMU
>>>> already provides you the service of producing one large consecutive
>>>> address range, rather than mapping them via a number of small pages?
>>> No, I wouldn't like to use that. We're already using that :)
>>>
>>> But we use huge pages by allocating consecutive chunks of memory so that
>>> both the CPU as well as the GPU can increase their TLB hit rate.
>> I'm not convinced that this is a universal win. Many GPU buffers aren't
>> accessed by the CPU and allocating huge pages puts much more strain on
>> the kernel MM subsystem.
> 
> Yeah, we indeed see the extra overhead during allocation.
> 
>>> What currently happens is that the DMA subsystem tries to coalesce
>>> multiple pages into on SG entry and we de-coalesce that inside the
>>> driver again to create our random access array.
>> I guess the right thing would be to have a flag that tells the the DMA
>> implementation to not coalesce the entries. But (ab-)using max segment
>> size tells the DMA API to split the segments if they are larger than
>> the given size, which is probably not what you want either as you now
>> need to coalesce the segments again when you are mapping real huge
>> pages.
> 
> No, even with huge pages I need an array with every single dma address 
> for a 4K pages (or whatever the normal page size is).
> 
> The problem is that I need a random access array for the DMA addresses, 
> even when they are continuously.

OK, that makes me wonder if you even need dma_map_sg() in this case. 
 From the sound of that it would be a lot simpler to just call 
dma_map_page() in a loop over the pair of arrays. AFAICS that's what TTM 
already does in most places.

> I agree that max segment size is a bit ugly here, but at least for 
> radeon, amdgpu and pretty much TTM in general it is exactly what I need.
> 
> I could fix TTM to not need that, but for radeon and amdgpu it is the 
> hardware which needs this.

Sorry, I don't follow - how does the hardware care about the format of 
an intermediate data structure used to *generate* the dma_address array? 
That's all that I'm proposing to fix here.

Robin.

> 
> Christian.
> 
>>
>>> That is a huge waste of memory and CPU cycles and I actually wanted to
>>> get rid of it for quite some time now. Sinas approach seems to be a good
>>> step into the right direction to me to actually clean that up.
>>>
>>>> Detecting such a condition is much easier if the DMA map implementation
>>>> gives you the coalesced scatter entries.
>>> A way which preserves both path would be indeed nice to have, but that
>>> only allows for the TLB optimization for the GPU and not the CPU any
>>> more. So I actually see that as really minor use case.
>> Some of the Tegras have much larger TLBs and better page-walk
>> performance on the system IOMMU compared to the GPU MMU, so they would
>> probably benefit a good deal even if the hugepage optimization only
>> targets the GPU side.
>>
>> Regards,
>> Lucas
>