"Fixes" for page flipping under PRIME on AMD & nouveau

Wed Aug 17 16:35:20 UTC 2016

On 08/17/2016 06:27 PM, Christian König wrote:
>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>> scanout mode from tiled to linear on the fly during flips.
> Well I'm not an expert on this, but as far as I know the bigger problem
> is that the dedicated AMD hardware generations you are targeting usually
> can't reliable scanout from system memory without a rather complicated
> setup.
>
> So that is a complete NAK to the radeon changes.

Hi Christian,

thanks for the feedback, but i think that's a misunderstanding. The 
patches don't make them scanout from system memory, they just enforce a 
fresh copy from RAM/GTT -> VRAM before scanning out a buffer again. I 
just assume there is a more elegant/clean way than this "fake" pin/unpin 
to GTT to essentially tell the driver that its current VRAM content is 
stale and needs a refresh from the up to date dmabuf in system RAM.

Btw. i'll be offline for the next few hours, just wanted to get this out 
now.

thanks,
-mario

>
> Regards,
> Christian.
>
> Am 17.08.2016 um 18:12 schrieb Mario Kleiner:
>> Hi,
>>
>> i spent some time playing with DRI3/Present + PRIME for testing
>> how well it works for Optimus/Enduro style setups wrt. page flipping
>> on the current kernel/mesa/xorg. I want page flipping, because
>> neuroscience/medical applications need the reliable timing/timestamping
>> and tear free presentation we currently only can get via page
>> flipping, but not the copyswap path.
>>
>> Intel as display gpu + nouveau for render offload worked nicely
>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>> and all.
>>
>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>> scanout mode from tiled to linear on the fly during flips. That's
>> a todo in itself. For the moment i used the ati-ddx with Option
>> "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
>> HD-5770's into linear mode so page flipping can be used for
>> prime. The current modesetting-ddx will use page flipping in
>> any case as it doesn't detect the tiling format mismatch.
>>
>> nouveau uses page flips.
>>
>> Turns out that prime + page flipping currently doesn't work
>> on nouveau and amd. The first offload rendered images from
>> the imported dmabufs show up properly, but then the display
>> is stuck alternating between the first two or three rendered
>> frames.
>>
>> The problem is that during the pageflip ioctl we pin the
>> dmabuf into VRAM in preparation for scanout, then unpin it
>> when we are done with it at next flip, but the buffer stays
>> in the VRAM memory domain. Next time we flip to the buffer
>> again, the driver skips the DMA copy from GTT to VRAM during
>> pinning, because the buffers content apparently already resides
>> in VRAM. Therefore it doesn't update the VRAM copy with the updated
>> dmabuf content in system RAM, so freshly rendered frames from the
>> prime export/render offload gpu never reach the display gpu and one
>> only sees stale images.
>>
>> The attached patches for nouveau and radeon kms seem to work
>> pretty ok, page flipping works, display updates, tear-free,
>> dmabuf fence sync works, onset timing/timestamping is correct.
>> They simply pin the buffer back into GTT, then unpin, to force
>> a move of the buffer into the GTT domain, and thereby force the
>> following pin to do a new copy from GTT -> VRAM. The code tries
>> to avoid a useless copy from VRAM -> GTT during the pin op.
>>
>> However, the approach feels very much like a hack, so i assume
>> this is not the proper way of doing it? I looked what ttm has
>> to offer, but couldn't find anything elegant and obvious. Maybe
>> there is a way to evict a bo without actually copying data back
>> to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
>> missed something, as i'm not very familiar with ttm.
>>
>> Thoughts or suggestions?
>>
>> Another insight with my hacks is so far that nouveau seems to
>> be fast as prime exporter/renderoffload, but rather slow as
>> display gpu/prime importer, as tested on a 2008 or 2009
>> MacBookPro dual-Nvidia laptop.
>>
>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>> importer/display gpu, but very slow as prime exporter/render offload,
>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>> to draw a textured triangle strip to detile the gpu renderbuffer and
>> copy it into GTT. As drawing a textured fullscreen quad is normally
>> much faster, something special seems to be going on there wrt. DMA?
>> However, i don't have a realistic real Enduro test setup with AMD
>> iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
>> so this could be wrong.
>>
>> thanks,
>> -mario
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>