"Fixes" for page flipping under PRIME on AMD & nouveau

Wed Aug 17 17:02:48 UTC 2016

Am 17.08.2016 um 18:35 schrieb Mario Kleiner:
> On 08/17/2016 06:27 PM, Christian König wrote:
>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>> scanout mode from tiled to linear on the fly during flips.
>> Well I'm not an expert on this, but as far as I know the bigger problem
>> is that the dedicated AMD hardware generations you are targeting usually
>> can't reliable scanout from system memory without a rather complicated
>> setup.
>>
>> So that is a complete NAK to the radeon changes.
>
> Hi Christian,
>
> thanks for the feedback, but i think that's a misunderstanding. The 
> patches don't make them scanout from system memory, they just enforce 
> a fresh copy from RAM/GTT -> VRAM before scanning out a buffer again. 
> I just assume there is a more elegant/clean way than this "fake" 
> pin/unpin to GTT to essentially tell the driver that its current VRAM 
> content is stale and needs a refresh from the up to date dmabuf in 
> system RAM.

I was already wondering how the heck you got that working.

What do you mean with a fresh copy from GTT to VRAM? A buffer exported 
by DMA-buf should never move as long as it is exported, same for a 
buffer pinned to VRAM.

So using a DMA-buf for scanout is impossible and actually not valuable 
cause is shouldn't matter if we copy from GTT to VRAM because of a 
buffer migration or because of a copy triggered by the DDX.

What are you actually trying to do here?

Regards,
Christian.

>
> Btw. i'll be offline for the next few hours, just wanted to get this 
> out now.
>
> thanks,
> -mario
>
>>
>> Regards,
>> Christian.
>>
>> Am 17.08.2016 um 18:12 schrieb Mario Kleiner:
>>> Hi,
>>>
>>> i spent some time playing with DRI3/Present + PRIME for testing
>>> how well it works for Optimus/Enduro style setups wrt. page flipping
>>> on the current kernel/mesa/xorg. I want page flipping, because
>>> neuroscience/medical applications need the reliable timing/timestamping
>>> and tear free presentation we currently only can get via page
>>> flipping, but not the copyswap path.
>>>
>>> Intel as display gpu + nouveau for render offload worked nicely
>>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>>> and all.
>>>
>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>> scanout mode from tiled to linear on the fly during flips. That's
>>> a todo in itself. For the moment i used the ati-ddx with Option
>>> "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
>>> HD-5770's into linear mode so page flipping can be used for
>>> prime. The current modesetting-ddx will use page flipping in
>>> any case as it doesn't detect the tiling format mismatch.
>>>
>>> nouveau uses page flips.
>>>
>>> Turns out that prime + page flipping currently doesn't work
>>> on nouveau and amd. The first offload rendered images from
>>> the imported dmabufs show up properly, but then the display
>>> is stuck alternating between the first two or three rendered
>>> frames.
>>>
>>> The problem is that during the pageflip ioctl we pin the
>>> dmabuf into VRAM in preparation for scanout, then unpin it
>>> when we are done with it at next flip, but the buffer stays
>>> in the VRAM memory domain. Next time we flip to the buffer
>>> again, the driver skips the DMA copy from GTT to VRAM during
>>> pinning, because the buffers content apparently already resides
>>> in VRAM. Therefore it doesn't update the VRAM copy with the updated
>>> dmabuf content in system RAM, so freshly rendered frames from the
>>> prime export/render offload gpu never reach the display gpu and one
>>> only sees stale images.
>>>
>>> The attached patches for nouveau and radeon kms seem to work
>>> pretty ok, page flipping works, display updates, tear-free,
>>> dmabuf fence sync works, onset timing/timestamping is correct.
>>> They simply pin the buffer back into GTT, then unpin, to force
>>> a move of the buffer into the GTT domain, and thereby force the
>>> following pin to do a new copy from GTT -> VRAM. The code tries
>>> to avoid a useless copy from VRAM -> GTT during the pin op.
>>>
>>> However, the approach feels very much like a hack, so i assume
>>> this is not the proper way of doing it? I looked what ttm has
>>> to offer, but couldn't find anything elegant and obvious. Maybe
>>> there is a way to evict a bo without actually copying data back
>>> to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
>>> missed something, as i'm not very familiar with ttm.
>>>
>>> Thoughts or suggestions?
>>>
>>> Another insight with my hacks is so far that nouveau seems to
>>> be fast as prime exporter/renderoffload, but rather slow as
>>> display gpu/prime importer, as tested on a 2008 or 2009
>>> MacBookPro dual-Nvidia laptop.
>>>
>>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>>> importer/display gpu, but very slow as prime exporter/render offload,
>>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>>> to draw a textured triangle strip to detile the gpu renderbuffer and
>>> copy it into GTT. As drawing a textured fullscreen quad is normally
>>> much faster, something special seems to be going on there wrt. DMA?
>>> However, i don't have a realistic real Enduro test setup with AMD
>>> iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
>>> so this could be wrong.
>>>
>>> thanks,
>>> -mario
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>>